Site Reliability Engineer (SRE)
An opportunity is available for a Site Reliability Engineer (SRE) to join the infrastructure team of a product and engineering organization. This key technical role is responsible for ensuring the scalability, reliability, and performance of the company’s cloud-based services through automation and continuous improvement.
This is a full-time, remote position.
Role Summary and Reliability Mandate
The SRE will work closely with IT, Engineering, and Security teams to design, secure, and maintain highly available, cost-efficient, and observable systems. A strong emphasis is placed on Infrastructure as Code (IaC), incident management, and modern cloud practices.
Key Responsibilities
- Cloud Reliability & Performance: Responsible for the overall scalability, reliability, and performance of the cloud-based services.
- Automation & IaC: Strong emphasis on automation and continuous improvement. Design and maintain systems using Infrastructure as Code (Terraform preferred).
- Observability: Implement and manage log aggregation and observability tools (e.g., Sumo Logic, Datadog, ELK) for monitoring and proactive management.
- Container Orchestration: Work with Kubernetes (EKS), Helm, and container orchestration to manage services.
- Security & Compliance: Design and maintain secure systems, with familiarity in compliance frameworks (SOC2, HIPAA, etc.).
- Incident Management: Utilize incident management practices and SRE principles (SLAs, SLOs, error budgets) to ensure operational excellence.
- Collaboration: Work closely with cross-functional teams to design systems that are secure, observable, and cost-efficient.
Required Experience and Technical Qualifications
The ideal candidate is a hands-on SRE or DevOps professional with deep expertise in AWS, Kubernetes, and leveraging IaC and observability tools in a fast-paced environment.
- Experience: 5+ years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.
- AWS Expertise (Strong Proficiency): Hands-on experience with core AWS services, including IAM, EC2, ECS/Fargate, S3, RDS, CloudFormation, or Terraform.
- DevOps Tooling: Experience with Infrastructure as Code (Terraform preferred) and GitHub (workflow automation, PR workflows, secrets management).
- Observability: Hands-on experience with log aggregation and observability tools (Sumo Logic or equivalents like Datadog, ELK).
- Containerization: Experience with Kubernetes (EKS), Helm, and container orchestration.
- Environment: Prior experience in fast-paced SaaS or startup environments is highly valued.
- Principles: Familiarity with incident management practices and SRE principles (SLAs, SLOs, error budgets).
Job Features
| Job Category | Cloud Engineering, Product Management |