{"id":534,"date":"2025-11-05T09:26:03","date_gmt":"2025-11-05T09:26:03","guid":{"rendered":"https:\/\/skillbasedmatching.com\/jobs\/?post_type=jobpost&#038;p=534"},"modified":"2025-11-05T09:26:06","modified_gmt":"2025-11-05T09:26:06","slug":"site-reliability-engineer-sre-internal-platform-ai-focus","status":"publish","type":"jobpost","link":"https:\/\/skillbasedmatching.com\/jobs\/current-jobs\/site-reliability-engineer-sre-internal-platform-ai-focus\/","title":{"rendered":"Site Reliability Engineer (SRE) \u2013 Internal Platform &#038; AI Focus"},"content":{"rendered":"\n<p><strong>Zapier<\/strong>, a company focused on automation and AI for businesses, is hiring a <strong>Site Reliability Engineer (SRE)<\/strong> for its Reliability Platform team within the Internal Platform. This role is pivotal in strengthening Zapier\u2019s reliability posture by improving observability, incident response, and system resilience at scale.<\/p>\n\n\n\n<p>This is a <strong>Full-time, Remote<\/strong> position, specifically located in the <strong>NAMER (West Coast)<\/strong> timezone. The salary range is <strong>$141,000 \u2013 $211,700\/yr<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Why This Role Matters: Reliability, Observability, and AI<\/h3>\n\n\n\n<p>This SRE role goes beyond traditional infrastructure, focusing on writing production-grade code to solve complex systems challenges and integrate emerging AI tools into reliability workflows.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Key Responsibilities:<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Platform Tooling:<\/strong> Build and improve platform tooling that helps Zapier engineers <strong>observe and operate their services<\/strong>.<\/li>\n\n\n\n<li><strong>Observability Systems:<\/strong> Operate and evolve core observability systems, including <strong>logging, metrics, alerting, and dashboards (Grafana, Datadog, Opensearch, Prometheus)<\/strong>.<\/li>\n\n\n\n<li><strong>Incident Response:<\/strong>\n<ul class=\"wp-block-list\">\n<li>Participate in the team\u2019s <strong>on-call rotation<\/strong> and contribute to Zapier\u2019s broader incident response program.<\/li>\n\n\n\n<li>Improve the processes, tooling, and practices used to detect, respond, and learn from incidents.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Automation &amp; IaC:<\/strong> Write code (Go, Python) to <strong>automate operations<\/strong>, improve developer experience, and reduce manual toil. Contribute to infrastructure reliability using <strong>AWS, Kubernetes, and Terraform<\/strong>.<\/li>\n\n\n\n<li><strong>Best Practices &amp; Influence:<\/strong> Help shape observability and reliability best practices: <strong>review instrumentation designs, suggest improvements, and advocate for effective alerting<\/strong>.<\/li>\n\n\n\n<li><strong>AI Integration (Forward-Looking):<\/strong> <strong>Explore and pilot AI-augmented tools<\/strong> (e.g., debugging agents, alert correlation, query recommendations) to improve reliability workflows.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Required Experience and Technical Stack<\/h3>\n\n\n\n<p>The ideal candidate is an experienced, production-focused engineer who is comfortable coding and operating complex, modern cloud-native systems.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Experience:<\/strong> <strong>4+ years<\/strong> in systems, infrastructure, or backend software roles (SaaS, cloud-native environments preferred).<\/li>\n\n\n\n<li><strong>Coding Proficiency:<\/strong> <strong>Thrive writing production-grade code<\/strong> in <strong>Go, Python<\/strong>, or equivalent.<\/li>\n\n\n\n<li><strong>Core Stack:<\/strong> Hands-on experience with:\n<ul class=\"wp-block-list\">\n<li><strong>Infrastructure-as-Code (Terraform)<\/strong>.<\/li>\n\n\n\n<li><strong>Cloud (AWS)<\/strong>.<\/li>\n\n\n\n<li><strong>Container orchestration (Kubernetes)<\/strong>.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Observability:<\/strong> Hands-on experience with observability concepts (metrics, logging, dashboards, alerts) and the ability to <strong>reason about instrumentation and alert design<\/strong>.<\/li>\n\n\n\n<li><strong>Incident Management:<\/strong> Comfortable <strong>jumping into incidents, diagnosing across telemetry<\/strong>, coordinating with teams, and contributing to postmortems.<\/li>\n\n\n\n<li><strong>AI Mindset:<\/strong> Approaches new tools and ideas with curiosity and openness, especially around <strong>AI in reliability workflows<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Zapier&#8217;s Stack Highlights:<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cloud &amp; Infra:<\/strong> AWS, Kubernetes, Redis, Kafka, Terraform<\/li>\n\n\n\n<li><strong>Languages:<\/strong> Go, Python, TypeScript<\/li>\n\n\n\n<li><strong>Observability:<\/strong> Grafana, Datadog, Opensearch, Prometheus, Sentry<\/li>\n\n\n\n<li><strong>CI\/CD:<\/strong> GitLab, ArgoCD<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><\/h3>\n","protected":false},"excerpt":{"rendered":"<p>Zapier, a company focused on automation and AI for businesses, is hiring a Site Reliability Engineer (SRE) for its Reliability Platform team within the Internal Platform. This role is pivotal in strengthening Zapier\u2019s reliability posture by improving observability, incident response, and system resilience at scale. This is a Full-time, Remote position, specifically located in the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"menu_order":0,"template":"","jobpost_category":[42,46,45],"jobpost_job_type":[39],"jobpost_location":[],"jobpost_tag":[2567,188,1144,2566,2565,1261,1232,1002,2563,2564,1004],"class_list":["post-534","jobpost","type-jobpost","status-publish","hentry","jobpost_category-ai-artificial-intelligence","jobpost_category-data","jobpost_category-software-engineering","jobpost_job_type-remote","jobpost_tag-ai-in-sre","jobpost_tag-aws","jobpost_tag-cloud-native","jobpost_tag-datadog-grafana","jobpost_tag-go-python-coding","jobpost_tag-incident-response","jobpost_tag-kubernetes","jobpost_tag-observability","jobpost_tag-remote-namer","jobpost_tag-site-reliability-engineer-sre","jobpost_tag-terraform"],"_links":{"self":[{"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/jobpost\/534","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/jobpost"}],"about":[{"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/types\/jobpost"}],"author":[{"embeddable":true,"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/users\/1"}],"wp:attachment":[{"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/media?parent=534"}],"wp:term":[{"taxonomy":"jobpost_category","embeddable":true,"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/jobpost_category?post=534"},{"taxonomy":"jobpost_job_type","embeddable":true,"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/jobpost_job_type?post=534"},{"taxonomy":"jobpost_location","embeddable":true,"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/jobpost_location?post=534"},{"taxonomy":"jobpost_tag","embeddable":true,"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/jobpost_tag?post=534"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}