Site Reliability Engineer (SRE) – Health Tech Operations

Remote
United States
Posted 1 week ago

SafeRide Health is seeking a Site Reliability Engineer (SRE) to join their IT Infrastructure team. This critical role is responsible for ensuring that user-facing services and production systems remain highly available, reliable, and scalable by developing and implementing new processes that support software delivery excellence and operational discipline.

This is a Full-time, Remote position in the United States.


Core Responsibilities and Operational Discipline Mandate

The SRE will focus on minimizing downtime, automating tasks, and proactively managing system health and capacity. A key component involves defining and monitoring Service Level Objectives (SLOs) and collaborating closely with development teams.

  • Reliability & Availability: Focus on availability, reliability, and scalability to keep systems and services running smoothly with minimal downtime.
  • Incident Management: Define and monitor SLOs, respond to and diagnose system incidents, and conduct post-mortems to prevent future occurrences.
  • Automation: Develop and maintain tools and scripts to automate repetitive tasks such as deployments, configuration management, and monitoring.
  • Monitoring & Alerting: Implement and manage monitoring and alerting systems (Prometheus, DataDog, New Relic, Grafana, Splunk) to provide visibility and quickly detect potential issues.
  • Capacity & Risk Mitigation: Perform capacity planning by monitoring resource usage to forecast future needs, and collaborate with stakeholders to identify and mitigate operational risks.
  • Optimization: Analyze metrics from operating systems and applications to identify areas for performance improvement.

Required Experience and Technical Qualifications

The ideal candidate has progressive experience in technology operations, with hands-on proficiency in production monitoring, containerized cloud infrastructure, and automation scripting.

  • Minimum Experience (Mandatory):
    • Minimum of 5 years progressive experience in an IT, Software Engineering, Technology Operations, or Business Continuity role.
    • Minimum of 2 years of hands-on experience in a Site Reliability, DevOps, or IT Observability role.
  • Technical Proficiency:
    • Basic proficiency in an AWS containerized environment running infrastructure as code.
    • Demonstrated proficiency with production monitoring and alerting tools (DataDog is a major plus!).
  • Key Skills:
    • Cloud Technologies: Expertise in major cloud platforms such as AWS and Azure.
    • Tools & Technologies: Experience with tools for Infrastructure as Code (Terraform) and containerization (Docker).
    • Systems Engineering: Deep knowledge of operating systems, networking, storage, and distributed systems.
    • Programming & Scripting: Proficiency in coding languages like Python, Ruby, and JavaScript for automation and infrastructure management.

Job Features

Job CategoryInformation Technology

Apply For This Job

A valid phone number is required.