Senior site reliability engineer Job Description
Senior site reliability engineer Job Description Template
Senior Site Reliability Engineers ensure seamless operation of complex, large-scale systems. Responsibilities encompass developing software, automating tasks, troubleshooting issues, and enhancing system performance and security. Expertise in cloud platforms and coding languages is essential.
Responsibilities:
- Design and implement reliable and scalable systems for production environments
- Develop and maintain tools for automation, monitoring, and alerting
- Collaborate with cross-functional teams to identify and resolve performance and reliability issues
- Lead incident management and root cause analysis efforts
- Create and maintain documentation for infrastructure, processes, and procedures
- Stay up-to-date with industry trends and emerging technologies in site reliability engineering
- Mentor and coach junior engineers to improve their technical and professional skills
- Participate in on-call rotation to provide 24/7 support for production systems
Requirements:
- + years of experience in a site reliability engineering role
- Strong understanding of cloud technologies, particularly AWS
- Expertise in designing and implementing scalable and reliable systems
- Proficiency in at least one programming language, such as Python or Java
- Experience with configuration management tools like Ansible or Puppet
- Ability to troubleshoot complex issues in production environments
- Excellent communication and collaboration skills
- Experience with monitoring and logging tools such as Prometheus, Grafana, and ELK stack