Site Reliability Engineer (SRE) – Reliability Platform
Zapier, a company building a platform for automation and AI that helps millions of businesses globally scale, is seeking a Site Reliability Engineer (SRE). This high-impact role is on the Reliability Platform team, which owns observability, incident response, and service ownership, with the mission of strengthening Zapier’s reliability posture at scale.
This is a Full-time, Remote position, specifically for the NAMER (West Coast) region. The salary range is $141,000 – $211,700 annually.
Role Summary and Observability Mandate
This SRE role goes beyond typical infrastructure work, focusing heavily on observability, incident response, and coding to build systems that make Zapier more resilient. You’re expected to thrive in writing production-grade code and proactively find ways to reduce toil and automate repetitive work.
Things You’ll Do:
- Platform Tooling: Build and improve platform tooling that helps Zapier engineers observe and operate their services.
- Observability Evolution: Operate and evolve core observability systems, including logging, metrics, alerting, and dashboards, using tools like Grafana, Datadog, Opensearch, and Prometheus.
- Incident Response: Participate in the team’s on-call rotation and contribute to the broader incident response program by improving processes, tooling, and practices used to detect, respond, and learn.
- Automation & Infra: Write code to automate operations, improve developer experience, and contribute to infrastructure reliability using AWS, Kubernetes, and Terraform.
- Best Practices: Review instrumentation designs, suggest improvements, and advocate for effective alerting to raise the bar on observability and reliability across product teams.
- AI Exploration: Explore and pilot AI-augmented tools (e.g., debugging agents, alert correlation) to improve reliability workflows.
Required Experience and Technical Qualifications
The ideal candidate is an experienced engineer with a strong coding background, deep familiarity with the cloud-native SRE stack, and a proactive, problem-solving mindset.
- Experience (Mandatory):
- 4+ years in systems, infrastructure, or backend software roles (SaaS, cloud-native environments preferred).
- Hands-on experience with observability (metrics, logging, dashboards, alerts) and the ability to reason about instrumentation and alert design.
- Comfortable jumping into incidents, diagnosing across telemetry, coordinating, and contributing to postmortems.
- Core Technical Stack:
- Thrives writing production-grade code in Go, Python, or equivalent.
- Experience with Infrastructure-as-Code (Terraform, or equivalent).
- Experience with cloud (AWS) and container orchestration (Kubernetes).
- Attitude: Thinks proactively about reducing toil and is comfortable influencing peers by suggesting better practices and driving cross-team improvements. Approaches new tools and ideas (especially AI in reliability) with curiosity and openness.
Job Features
| Job Category | Cloud Engineering, Data |