Senior Cloud Infrastructure Engineer – Multi-Cloud & MLOps

Hybrid
New York, NY
Posted 2 days ago

Hatch, a fast-moving technology company solving real-world business problems with AI, is seeking a Senior DevOps Engineer (titled Cloud Infrastructure Engineer) to join its high-impact engineering team. This senior role is focused on building the resilient, secure, and scalable infrastructure that powers the company’s core platform and AI product lines.

This is a Full-time, Hybrid position based in SOHO, New York City. Candidates must be based in NYC, and visa sponsorship is not available.


Core Mandate: Infrastructure, MLOps, and Reliability

You will own the infrastructure that enables the company’s velocity, focusing on the specialized compute and data needs of machine learning workflows.

  • Infrastructure at Scale:
    • Evolve the multi-cloud infrastructure (AWS & GCP) using Infrastructure-as-Code (Terraform or Ansible).
    • Manage scalable, secure, and cost-efficient environments across all stages (dev, staging, production).
    • Implement systems that support the compute-heavy and storage-intensive needs of machine learning and data processing pipelines.
    • Participate in an on-call rotation.
  • ML Platform Support (MLOps):
    • Collaborate with ML engineers to productionize models and manage workflows across training, testing, and deployment.
    • Implement infrastructure to support versioning, orchestration, and monitoring of ML models (using tools like Kubeflow, SageMaker, or VertexAI).
    • Optimize data pipelines and model serving for low-latency and high-throughput performance.
  • Reliability & Observability:
    • Drive the strategy for observability, logging, and alerting across distributed systems.
    • Lead incident response, root cause analysis (RCA), and system hardening.
    • Implement best practices for infrastructure security and container hardening.

Required Experience and Technical Stack

The role requires a senior engineer with deep AWS experience, IaC expertise, and specialized knowledge of the MLOps lifecycle.

  • Experience: 3+ years of experience in DevOps, SRE, or platform engineering in high-growth environments.
  • Cloud Expertise: 3+ years of experience with AWS infrastructure and services, including networking, IAM, ECS/EKS, and serverless computing.
  • MLOps Experience: Experience supporting machine learning teams or MLOps platforms (e.g., model training pipelines, feature stores, online inference).
  • IaC & CI/CD: Strong experience with Terraform or Ansible and CI/CD tooling (GitHub Actions, ArgoCD, etc.).
  • Containerization: Strong knowledge of container orchestration (Kubernetes preferred).
  • Observability: Strong knowledge of observability stacks (Prometheus, Grafana, Sentry, DataDog, etc.).
  • Programming: Familiarity with at least one programming language (Python, Go, Erlang, Rust, etc.).
  • Preferred: Exposure to agentic programming workflows and RHCE/RHCSA or equivalent certifications.

Job Features

Job CategoryCloud Engineering, Data

Apply For This Job

A valid phone number is required.