Senior Cloud Infrastructure Engineer – Multi-Cloud & MLOps
Hatch, a fast-moving technology company solving real-world business problems with AI, is seeking a Senior DevOps Engineer (titled Cloud Infrastructure Engineer) to join its high-impact engineering team. This senior role is focused on building the resilient, secure, and scalable infrastructure that powers the company’s core platform and AI product lines.
This is a Full-time, Hybrid position based in SOHO, New York City. Candidates must be based in NYC, and visa sponsorship is not available.
Core Mandate: Infrastructure, MLOps, and Reliability
You will own the infrastructure that enables the company’s velocity, focusing on the specialized compute and data needs of machine learning workflows.
- Infrastructure at Scale:
- Evolve the multi-cloud infrastructure (AWS & GCP) using Infrastructure-as-Code (Terraform or Ansible).
- Manage scalable, secure, and cost-efficient environments across all stages (dev, staging, production).
- Implement systems that support the compute-heavy and storage-intensive needs of machine learning and data processing pipelines.
- Participate in an on-call rotation.
- ML Platform Support (MLOps):
- Collaborate with ML engineers to productionize models and manage workflows across training, testing, and deployment.
- Implement infrastructure to support versioning, orchestration, and monitoring of ML models (using tools like Kubeflow, SageMaker, or VertexAI).
- Optimize data pipelines and model serving for low-latency and high-throughput performance.
- Reliability & Observability:
- Drive the strategy for observability, logging, and alerting across distributed systems.
- Lead incident response, root cause analysis (RCA), and system hardening.
- Implement best practices for infrastructure security and container hardening.
Required Experience and Technical Stack
The role requires a senior engineer with deep AWS experience, IaC expertise, and specialized knowledge of the MLOps lifecycle.
- Experience: 3+ years of experience in DevOps, SRE, or platform engineering in high-growth environments.
- Cloud Expertise: 3+ years of experience with AWS infrastructure and services, including networking, IAM, ECS/EKS, and serverless computing.
- MLOps Experience: Experience supporting machine learning teams or MLOps platforms (e.g., model training pipelines, feature stores, online inference).
- IaC & CI/CD: Strong experience with Terraform or Ansible and CI/CD tooling (GitHub Actions, ArgoCD, etc.).
- Containerization: Strong knowledge of container orchestration (Kubernetes preferred).
- Observability: Strong knowledge of observability stacks (Prometheus, Grafana, Sentry, DataDog, etc.).
- Programming: Familiarity with at least one programming language (Python, Go, Erlang, Rust, etc.).
- Preferred: Exposure to agentic programming workflows and RHCE/RHCSA or equivalent certifications.
Job Features
| Job Category | Cloud Engineering, Data |