4+ Years Relevant Experience
We are seeking a skilled and hands-on Site Reliability Engineer (SRE) with proven experience in Temporal, Docker, and Kubernetes in production environments. In this role, you'll work at the intersection of infrastructure and development to ensure system reliability, performance, and scalability. You'll integrate Temporal workflows across a microservices architecture while supporting modern CI/CD and observability practices.
Key Responsibilities:
- Deploy and manage Temporal workflows in real-world, production-grade environments.
- Build integrations with microservices, message queues (e.g., Kafka, RabbitMQ), and databases to enable reliable and resilient workflows.
- Design and maintain Kubernetes clusters, ensuring high availability and fault tolerance.
- Manage containerized applications using Docker and orchestrate them with Kubernetes.
- Contribute to system reliability by implementing monitoring, alerting, and auto-healing mechanisms.
- Collaborate with development and operations teams to drive DevOps and SRE best practices.
- Support incident management and root cause analysis for production issues.
Required Skills:
- Hands-on experience with Temporal in a production environment.
- Strong knowledge of Kubernetes and Docker.
- Solid understanding of microservices architecture, distributed systems, and resilient design patterns.
- Experience integrating Temporal with event-streaming platforms like Kafka or RabbitMQ.
- Familiarity with CI/CD pipelines and SRE principles.
- Basic understanding of cloud platforms (AWS, GCP, or Azure).
Nice to Have:
- Exposure to observability tools such as Prometheus, Grafana, OpenTelemetry, or similar.
- Experience with infrastructure as code tools (e.g., Terraform, Helm).
- Familiarity with chaos engineering and automated testing for resilience.
Soft Skills:
- Strong problem-solving and troubleshooting abilities.
- Excellent communication and team collaboration skills.
- Ability to work in a fast-paced, agile environment.