Site Reliability Engineer (SRE) – Internal Platform & AI Focus

Remote

Posted 8 months ago

Zapier, a company focused on automation and AI for businesses, is hiring a Site Reliability Engineer (SRE) for its Reliability Platform team within the Internal Platform. This role is pivotal in strengthening Zapier’s reliability posture by improving observability, incident response, and system resilience at scale.

This is a Full-time, Remote position, specifically located in the NAMER (West Coast) timezone. The salary range is $141,000 – $211,700/yr.

Why This Role Matters: Reliability, Observability, and AI

This SRE role goes beyond traditional infrastructure, focusing on writing production-grade code to solve complex systems challenges and integrate emerging AI tools into reliability workflows.

Key Responsibilities:

Platform Tooling: Build and improve platform tooling that helps Zapier engineers observe and operate their services.
Observability Systems: Operate and evolve core observability systems, including logging, metrics, alerting, and dashboards (Grafana, Datadog, Opensearch, Prometheus).
Incident Response:
- Participate in the team’s on-call rotation and contribute to Zapier’s broader incident response program.
- Improve the processes, tooling, and practices used to detect, respond, and learn from incidents.
Automation & IaC: Write code (Go, Python) to automate operations, improve developer experience, and reduce manual toil. Contribute to infrastructure reliability using AWS, Kubernetes, and Terraform.
Best Practices & Influence: Help shape observability and reliability best practices: review instrumentation designs, suggest improvements, and advocate for effective alerting.
AI Integration (Forward-Looking): Explore and pilot AI-augmented tools (e.g., debugging agents, alert correlation, query recommendations) to improve reliability workflows.

Required Experience and Technical Stack

The ideal candidate is an experienced, production-focused engineer who is comfortable coding and operating complex, modern cloud-native systems.

Experience: 4+ years in systems, infrastructure, or backend software roles (SaaS, cloud-native environments preferred).
Coding Proficiency: Thrive writing production-grade code in Go, Python, or equivalent.
Core Stack: Hands-on experience with:
- Infrastructure-as-Code (Terraform).
- Cloud (AWS).
- Container orchestration (Kubernetes).
Observability: Hands-on experience with observability concepts (metrics, logging, dashboards, alerts) and the ability to reason about instrumentation and alert design.
Incident Management: Comfortable jumping into incidents, diagnosing across telemetry, coordinating with teams, and contributing to postmortems.
AI Mindset: Approaches new tools and ideas with curiosity and openness, especially around AI in reliability workflows.

Zapier’s Stack Highlights:

Cloud & Infra: AWS, Kubernetes, Redis, Kafka, Terraform
Languages: Go, Python, TypeScript
Observability: Grafana, Datadog, Opensearch, Prometheus, Sentry
CI/CD: GitLab, ArgoCD

Job Features

Job Category

AI (Artificial Intelligence), Data, Software Engineering

Why This Role Matters: Reliability, Observability, and AI

Key Responsibilities:

Required Experience and Technical Stack

Zapier’s Stack Highlights:

Job Features

Apply For This Job