Site reliability engineer Job Description

Site reliability engineer Job Description Template

Site Reliability Engineers ensure seamless operation of online platforms, managing system risks, automating processes, and enhancing software performance. Key tasks include troubleshooting, system design, and incident response.

Responsibilities:

Collaborate with development teams to design and implement highly available and scalable applications
Develop and maintain automation tools for deployment, monitoring, and alerting
Monitor system performance and troubleshoot issues, including network, hardware, and software
Ensure high availability and disaster recovery capabilities of systems and services
Participate in on-call rotation to respond to system outages and incidents
Continuously evaluate and improve system infrastructure to optimize performance and minimize downtime
Maintain documentation of system architecture, configurations, and processes
Stay up to date with industry trends and emerging technologies to identify opportunities for innovation and improvement

Requirements:

Minimum of 3 years of experience as a Site Reliability Engineer or similar role
Deep understanding of cloud infrastructure and experience with cloud providers such as AWS, Azure, or Google Cloud
Proficiency in at least one programming language such as Python, Java, or Go
Experience with containerization technologies such as Docker and Kubernetes
Strong analytical and problem-solving skills to identify and troubleshoot issues related to infrastructure and applications
Experience with monitoring and logging tools like Prometheus, Grafana, and ELK stack
Good knowledge of networking protocols and experience with network troubleshooting
Ability to work collaboratively with cross-functional teams including developers, DevOps, and other stakeholders to ensure high availability and reliability of the systems.