Site Reliability Engineer (SRE) – Health Tech Operations

Remote

United States

Posted 5 months ago

SafeRide Health is seeking a Site Reliability Engineer (SRE) to join their IT Infrastructure team. This critical role is responsible for ensuring that user-facing services and production systems remain highly available, reliable, and scalable by developing and implementing new processes that support software delivery excellence and operational discipline.

This is a Full-time, Remote position in the United States.

Core Responsibilities and Operational Discipline Mandate

The SRE will focus on minimizing downtime, automating tasks, and proactively managing system health and capacity. A key component involves defining and monitoring Service Level Objectives (SLOs) and collaborating closely with development teams.

Reliability & Availability: Focus on availability, reliability, and scalability to keep systems and services running smoothly with minimal downtime.
Incident Management: Define and monitor SLOs, respond to and diagnose system incidents, and conduct post-mortems to prevent future occurrences.
Automation: Develop and maintain tools and scripts to automate repetitive tasks such as deployments, configuration management, and monitoring.
Monitoring & Alerting: Implement and manage monitoring and alerting systems (Prometheus, DataDog, New Relic, Grafana, Splunk) to provide visibility and quickly detect potential issues.
Capacity & Risk Mitigation: Perform capacity planning by monitoring resource usage to forecast future needs, and collaborate with stakeholders to identify and mitigate operational risks.
Optimization: Analyze metrics from operating systems and applications to identify areas for performance improvement.

Required Experience and Technical Qualifications

The ideal candidate has progressive experience in technology operations, with hands-on proficiency in production monitoring, containerized cloud infrastructure, and automation scripting.

Minimum Experience (Mandatory):
- Minimum of 5 years progressive experience in an IT, Software Engineering, Technology Operations, or Business Continuity role.
- Minimum of 2 years of hands-on experience in a Site Reliability, DevOps, or IT Observability role.
Technical Proficiency:
- Basic proficiency in an AWS containerized environment running infrastructure as code.
- Demonstrated proficiency with production monitoring and alerting tools (DataDog is a major plus!).
Key Skills:
- Cloud Technologies: Expertise in major cloud platforms such as AWS and Azure.
- Tools & Technologies: Experience with tools for Infrastructure as Code (Terraform) and containerization (Docker).
- Systems Engineering: Deep knowledge of operating systems, networking, storage, and distributed systems.
- Programming & Scripting: Proficiency in coding languages like Python, Ruby, and JavaScript for automation and infrastructure management.

Job Features

Job Category

Information Technology

Core Responsibilities and Operational Discipline Mandate

Required Experience and Technical Qualifications

Job Features

Apply For This Job