Senior Site Reliability Engineer – AI & Automation Focus
This role represents the cutting edge of SRE, moving beyond traditional scripting toward Agentic Workflows and Autonomous Infrastructure. You will be responsible for building self-sustaining systems that use AI to eliminate operational toil. This involves integrating Large Language Models (LLMs) and orchestration frameworks directly into the production lifecycle to automate incident response and system scaling.
- Focus: AI Operations (AIOps), Autonomous Agents, and Predictive Observability.
- Core Frameworks: LangChain, LangGraph, n8n, CrewAI, AutoGPT.
- Automation Tools: Airplane.dev, Custom AI Flow Builders.
- Key Metric: Elimination of toil through self-healing systems.
Autonomous Agent Orchestration
You will design and deploy agentic workflows using frameworks like LangGraph or CrewAI. Unlike standard linear automation, these autonomous agents can reason through complex infrastructure alerts, interact with APIs, and execute remediation steps independently. You will be tasked with integrating these “AI Copilots” into production systems to handle routine maintenance and complex multi-step recoveries.
AI-Driven Observability & Predictive SLOs
A major component of this role is evolving traditional monitoring into Predictive Observability. You will build LLM-based assistants that help engineers query system state using natural language and design dashboards that predict Service Level Objective (SLO) breaches before they occur. By measuring “everything,” you will create the data loops necessary for AI to understand and maintain system health.
Documentation & Communication
Clarity is critical when automating high-stakes infrastructure. You will be responsible for documenting complex AI flow logic and communicating technical resolutions to partners and customers. This ensures that even as the systems become more autonomous, the human operators maintain full visibility and “precision of understanding” regarding how the AI is managing the platform.
Summary: You are at the forefront of the “SRE 2.0” movement. By replacing manual toil with intelligent, agent-driven automation and predictive analytics, you ensure that enterprise-scale systems are not just reliable, but inherently self-sustaining.
Job Features
| Job Category | AI (Artificial Intelligence) |