Site Reliability Engineer (SRE)

Remote

Posted 6 months ago

An opportunity is available for a Site Reliability Engineer (SRE) to join the infrastructure team of a product and engineering organization. This key technical role is responsible for ensuring the scalability, reliability, and performance of the company’s cloud-based services through automation and continuous improvement.

This is a full-time, remote position.

Role Summary and Reliability Mandate

The SRE will work closely with IT, Engineering, and Security teams to design, secure, and maintain highly available, cost-efficient, and observable systems. A strong emphasis is placed on Infrastructure as Code (IaC), incident management, and modern cloud practices.

Key Responsibilities

Cloud Reliability & Performance: Responsible for the overall scalability, reliability, and performance of the cloud-based services.
Automation & IaC: Strong emphasis on automation and continuous improvement. Design and maintain systems using Infrastructure as Code (Terraform preferred).
Observability: Implement and manage log aggregation and observability tools (e.g., Sumo Logic, Datadog, ELK) for monitoring and proactive management.
Container Orchestration: Work with Kubernetes (EKS), Helm, and container orchestration to manage services.
Security & Compliance: Design and maintain secure systems, with familiarity in compliance frameworks (SOC2, HIPAA, etc.).
Incident Management: Utilize incident management practices and SRE principles (SLAs, SLOs, error budgets) to ensure operational excellence.
Collaboration: Work closely with cross-functional teams to design systems that are secure, observable, and cost-efficient.

Required Experience and Technical Qualifications

The ideal candidate is a hands-on SRE or DevOps professional with deep expertise in AWS, Kubernetes, and leveraging IaC and observability tools in a fast-paced environment.

Experience: 5+ years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.
AWS Expertise (Strong Proficiency): Hands-on experience with core AWS services, including IAM, EC2, ECS/Fargate, S3, RDS, CloudFormation, or Terraform.
DevOps Tooling: Experience with Infrastructure as Code (Terraform preferred) and GitHub (workflow automation, PR workflows, secrets management).
Observability: Hands-on experience with log aggregation and observability tools (Sumo Logic or equivalents like Datadog, ELK).
Containerization: Experience with Kubernetes (EKS), Helm, and container orchestration.
Environment: Prior experience in fast-paced SaaS or startup environments is highly valued.
Principles: Familiarity with incident management practices and SRE principles (SLAs, SLOs, error budgets).

Job Features

Job Category

Cloud Engineering, Product Management

Role Summary and Reliability Mandate

Key Responsibilities

Required Experience and Technical Qualifications

Job Features

Apply For This Job