Senior site reliability engineer Job Description

Senior site reliability engineer Job Description Template

Senior Site Reliability Engineers ensure seamless operation of complex, large-scale systems. Responsibilities encompass developing software, automating tasks, troubleshooting issues, and enhancing system performance and security. Expertise in cloud platforms and coding languages is essential.

Responsibilities:

  • Design and implement reliable and scalable systems for production environments
  • Develop and maintain tools for automation, monitoring, and alerting
  • Collaborate with cross-functional teams to identify and resolve performance and reliability issues
  • Lead incident management and root cause analysis efforts
  • Create and maintain documentation for infrastructure, processes, and procedures
  • Stay up-to-date with industry trends and emerging technologies in site reliability engineering
  • Mentor and coach junior engineers to improve their technical and professional skills
  • Participate in on-call rotation to provide 24/7 support for production systems

Requirements:

  • + years of experience in a site reliability engineering role
  • Strong understanding of cloud technologies, particularly AWS
  • Expertise in designing and implementing scalable and reliable systems
  • Proficiency in at least one programming language, such as Python or Java
  • Experience with configuration management tools like Ansible or Puppet
  • Ability to troubleshoot complex issues in production environments
  • Excellent communication and collaboration skills
  • Experience with monitoring and logging tools such as Prometheus, Grafana, and ELK stack