Apply for Site Reliability Engineer Job

Full time
|
Work From Office
This Position is Currently Open
Apply Now
This Position is Filled
Department / Category:
Listed on Dec 19, 2023
Work Location:

Job Descritpion of Site Reliability Engineer

10+ Years Of Relevant Experience

We are seeking an experienced Site Reliability Engineer (SRE) to ensure the stability, scalability, and reliability of our production systems. The ideal candidate will focus on automation, incident management, performance optimization, and proactive monitoring, while collaborating closely with development and operations teams to build resilient infrastructure.

Key Responsibilities:

  • System Reliability: Collaborate with production support teams to build scalable, maintainable systems and continuously improve infrastructure and application architecture.
  • Toil Reduction & Automation: Develop and maintain tools, scripts, and automation for deployments, monitoring, incident response, and repetitive operational tasks to minimize manual effort and human error.
  • Incident Management: Participate in on-call rotations, respond to incidents and outages, investigate issues, and drive problem management through root cause analysis and preventive measures.
  • Monitoring & Alerting: Implement and maintain proactive monitoring systems and alerts to detect and address issues before they impact users.
  • Capacity Planning & Performance Optimization: Monitor performance metrics, identify bottlenecks, collaborate with engineering teams on optimization, and plan for future scalability.
  • Error Budgeting & Chaos Engineering: Conduct resiliency tests, mock drills, and stability assessments to improve system fault tolerance.
  • Documentation: Create and maintain detailed documentation for system configurations, operational processes, and troubleshooting guidelines.

Required Skills & Experience:

  • Strong understanding of cloud platforms (AWS, Google Cloud, or Azure).
  • Experience with containerization technologies (Docker, Kubernetes).
  • Proficiency with infrastructure-as-code tools (Terraform, Ansible).
  • Solid grasp of incident management processes and production operations.

Desirable Skills:

  • Software development experience in Python or Java.
  • Familiarity with monitoring and logging tools (Splunk Cloud, Thousand Eyes).
  • Strong networking fundamentals.
  • Ability to work in fast-paced, cross-functional environments with strong problem-solving skills.

Required Skills for Site Reliability Engineer Job

  • WS
  • Google Cloud
  • Azure
  • Docker
  • Kubernetes
  • Terraform
  • Ansible

Our Hiring Process

  • Screening (HR Round)
  • Technical Round 1
  • Technical Round 2
  • Final HR Round
Apply Now
Position Filled
Relavant Jobs in Engineer
Close Icon

Site Reliability Engineer with 10+ Years of Experience? Apply Now!

10-13

Suceess Message Icon
Thank you for submitting your form!
We appreciate your time and effort in providing us with your information.
We will get in touch with you soon.
Error occured submitting the form.
Top to Scroll Icon