Site reliability engineer Job Description

Site reliability engineer Job Description Template

Site Reliability Engineers ensure seamless operation of online platforms, managing system risks, automating processes, and enhancing software performance. Key tasks include troubleshooting, system design, and incident response.

Responsibilities:

  • Collaborate with development teams to design and implement highly available and scalable applications
  • Develop and maintain automation tools for deployment, monitoring, and alerting
  • Monitor system performance and troubleshoot issues, including network, hardware, and software
  • Ensure high availability and disaster recovery capabilities of systems and services
  • Participate in on-call rotation to respond to system outages and incidents
  • Continuously evaluate and improve system infrastructure to optimize performance and minimize downtime
  • Maintain documentation of system architecture, configurations, and processes
  • Stay up to date with industry trends and emerging technologies to identify opportunities for innovation and improvement

Requirements:

  • Minimum of 3 years of experience as a Site Reliability Engineer or similar role
  • Deep understanding of cloud infrastructure and experience with cloud providers such as AWS, Azure, or Google Cloud
  • Proficiency in at least one programming language such as Python, Java, or Go
  • Experience with containerization technologies such as Docker and Kubernetes
  • Strong analytical and problem-solving skills to identify and troubleshoot issues related to infrastructure and applications
  • Experience with monitoring and logging tools like Prometheus, Grafana, and ELK stack
  • Good knowledge of networking protocols and experience with network troubleshooting
  • Ability to work collaboratively with cross-functional teams including developers, DevOps, and other stakeholders to ensure high availability and reliability of the systems.