Site Reliability Engineer (SRE) – Health Tech Operations
SafeRide Health is seeking a Site Reliability Engineer (SRE) to join their IT Infrastructure team. This critical role is responsible for ensuring that user-facing services and production systems remain highly available, reliable, and scalable by developing and implementing new processes that support software delivery excellence and operational discipline.
This is a Full-time, Remote position in the United States.
Core Responsibilities and Operational Discipline Mandate
The SRE will focus on minimizing downtime, automating tasks, and proactively managing system health and capacity. A key component involves defining and monitoring Service Level Objectives (SLOs) and collaborating closely with development teams.
- Reliability & Availability: Focus on availability, reliability, and scalability to keep systems and services running smoothly with minimal downtime.
- Incident Management: Define and monitor SLOs, respond to and diagnose system incidents, and conduct post-mortems to prevent future occurrences.
- Automation: Develop and maintain tools and scripts to automate repetitive tasks such as deployments, configuration management, and monitoring.
- Monitoring & Alerting: Implement and manage monitoring and alerting systems (Prometheus, DataDog, New Relic, Grafana, Splunk) to provide visibility and quickly detect potential issues.
- Capacity & Risk Mitigation: Perform capacity planning by monitoring resource usage to forecast future needs, and collaborate with stakeholders to identify and mitigate operational risks.
- Optimization: Analyze metrics from operating systems and applications to identify areas for performance improvement.
Required Experience and Technical Qualifications
The ideal candidate has progressive experience in technology operations, with hands-on proficiency in production monitoring, containerized cloud infrastructure, and automation scripting.
- Minimum Experience (Mandatory):
- Minimum of 5 years progressive experience in an IT, Software Engineering, Technology Operations, or Business Continuity role.
- Minimum of 2 years of hands-on experience in a Site Reliability, DevOps, or IT Observability role.
- Technical Proficiency:
- Basic proficiency in an AWS containerized environment running infrastructure as code.
- Demonstrated proficiency with production monitoring and alerting tools (DataDog is a major plus!).
- Key Skills:
- Cloud Technologies: Expertise in major cloud platforms such as AWS and Azure.
- Tools & Technologies: Experience with tools for Infrastructure as Code (Terraform) and containerization (Docker).
- Systems Engineering: Deep knowledge of operating systems, networking, storage, and distributed systems.
- Programming & Scripting: Proficiency in coding languages like Python, Ruby, and JavaScript for automation and infrastructure management.
Job Features
| Job Category | Information Technology |