10+ Years Of Relevant Experience
We are seeking an experienced Site Reliability Engineer (SRE) to ensure the stability, scalability, and reliability of our production systems. The ideal candidate will focus on automation, incident management, performance optimization, and proactive monitoring, while collaborating closely with development and operations teams to build resilient infrastructure.
Key Responsibilities:
- System Reliability: Collaborate with production support teams to build scalable, maintainable systems and continuously improve infrastructure and application architecture.
- Toil Reduction & Automation: Develop and maintain tools, scripts, and automation for deployments, monitoring, incident response, and repetitive operational tasks to minimize manual effort and human error.
- Incident Management: Participate in on-call rotations, respond to incidents and outages, investigate issues, and drive problem management through root cause analysis and preventive measures.
- Monitoring & Alerting: Implement and maintain proactive monitoring systems and alerts to detect and address issues before they impact users.
- Capacity Planning & Performance Optimization: Monitor performance metrics, identify bottlenecks, collaborate with engineering teams on optimization, and plan for future scalability.
- Error Budgeting & Chaos Engineering: Conduct resiliency tests, mock drills, and stability assessments to improve system fault tolerance.
- Documentation: Create and maintain detailed documentation for system configurations, operational processes, and troubleshooting guidelines.
Required Skills & Experience:
- Strong understanding of cloud platforms (AWS, Google Cloud, or Azure).
- Experience with containerization technologies (Docker, Kubernetes).
- Proficiency with infrastructure-as-code tools (Terraform, Ansible).
- Solid grasp of incident management processes and production operations.
Desirable Skills:
- Software development experience in Python or Java.
- Familiarity with monitoring and logging tools (Splunk Cloud, Thousand Eyes).
- Strong networking fundamentals.
- Ability to work in fast-paced, cross-functional environments with strong problem-solving skills.