Senior Cloud Infrastructure Engineer – Multi-Cloud & MLOps

Hybrid

New York, NY

Posted 8 months ago

Hatch, a fast-moving technology company solving real-world business problems with AI, is seeking a Senior DevOps Engineer (titled Cloud Infrastructure Engineer) to join its high-impact engineering team. This senior role is focused on building the resilient, secure, and scalable infrastructure that powers the company’s core platform and AI product lines.

This is a Full-time, Hybrid position based in SOHO, New York City. Candidates must be based in NYC, and visa sponsorship is not available.

Core Mandate: Infrastructure, MLOps, and Reliability

You will own the infrastructure that enables the company’s velocity, focusing on the specialized compute and data needs of machine learning workflows.

Infrastructure at Scale:
- Evolve the multi-cloud infrastructure (AWS & GCP) using Infrastructure-as-Code (Terraform or Ansible).
- Manage scalable, secure, and cost-efficient environments across all stages (dev, staging, production).
- Implement systems that support the compute-heavy and storage-intensive needs of machine learning and data processing pipelines.
- Participate in an on-call rotation.
ML Platform Support (MLOps):
- Collaborate with ML engineers to productionize models and manage workflows across training, testing, and deployment.
- Implement infrastructure to support versioning, orchestration, and monitoring of ML models (using tools like Kubeflow, SageMaker, or VertexAI).
- Optimize data pipelines and model serving for low-latency and high-throughput performance.
Reliability & Observability:
- Drive the strategy for observability, logging, and alerting across distributed systems.
- Lead incident response, root cause analysis (RCA), and system hardening.
- Implement best practices for infrastructure security and container hardening.

Required Experience and Technical Stack

The role requires a senior engineer with deep AWS experience, IaC expertise, and specialized knowledge of the MLOps lifecycle.

Experience: 3+ years of experience in DevOps, SRE, or platform engineering in high-growth environments.
Cloud Expertise: 3+ years of experience with AWS infrastructure and services, including networking, IAM, ECS/EKS, and serverless computing.
MLOps Experience: Experience supporting machine learning teams or MLOps platforms (e.g., model training pipelines, feature stores, online inference).
IaC & CI/CD: Strong experience with Terraform or Ansible and CI/CD tooling (GitHub Actions, ArgoCD, etc.).
Containerization: Strong knowledge of container orchestration (Kubernetes preferred).
Observability: Strong knowledge of observability stacks (Prometheus, Grafana, Sentry, DataDog, etc.).
Programming: Familiarity with at least one programming language (Python, Go, Erlang, Rust, etc.).
Preferred: Exposure to agentic programming workflows and RHCE/RHCSA or equivalent certifications.

Job Features

Job Category

Cloud Engineering, Data

Core Mandate: Infrastructure, MLOps, and Reliability

Required Experience and Technical Stack

Job Features

Apply For This Job