HPC Support Engineer – Deep Learning Cloud

Remote

United States

Posted 1 week ago

Lambda, The Superintelligence Cloud, which builds Gigawatt-scale AI Factories, is seeking an experienced HPC Support Engineer to join its Operations department. This role is highly technical, focusing on providing expert support for complex software and hardware issues within their deep learning cloud and High-Performance Computing (HPC) infrastructure.

This is a Full-time, Remote (USA) position with a competitive compensation range of $137K – $206K annually. The role is part of a 24/7 coverage model and requires working one of three set schedules, including potential weekend and evening shifts.

Role Summary and Technical Support Mandate

You will be expected to dive deep into complex challenges related to large-scale AI/ML and HPC workloads. The role requires a strong focus on advanced Linux system administration, cluster orchestration (Kubernetes/Slurm), and high-throughput networking technologies.

What You’ll Do:

Advanced Troubleshooting: Dive into complex software and hardware issues and provide timely, effective solutions to customers.
Escalation & Mentorship: Take escalations from peers while providing training and education to enhance team capabilities.
Product Collaboration: Collaborate closely with engineering teams to identify customer pain points and develop innovative solutions, contributing expertise to shape the future of the deep learning cloud.
Documentation & Improvement: Craft comprehensive documentation of solutions and contribute to enhancing support procedures.
Project Work: Work cross-functionally on project work, focusing on creating and improving support tooling.
On-Call: Participate in a rotating on-call schedule responsible for major incidents and major customer alerts.

Required Experience and Technical Qualifications

The ideal candidate has extensive experience in cloud/systems engineering, deep Linux expertise, and specific knowledge of GPU-related hardware and software components critical to AI/ML and HPC environments.

Experience (Mandatory):
- 7+ years in cloud support operations or systems engineering.
- Very strong understanding and experience with Linux (Ubuntu) system administration.
- Proven experience in HPC environments, with a strong preference for Kubernetes and/or Slurm for cluster orchestration.
- Strong experience with public cloud platforms (AWS, Azure, GCP) or GPU cloud providers.
AI/HPC Specific Knowledge:
- Experience with CUDA, NCCL, NVLink, MIG, GPUDirect RDMA.
- Experience with high-throughput networking technologies (IB/RoCE).
- Knowledge of distributed AI/ML or HPC workloads.
- Knowledge of TCP/IP, VPN, and firewalls in cloud environments.
DevOps & Troubleshooting:
- Proficiency with monitoring/logging tools (Prometheus, Grafana, Datadog).
- Strong skills in log analysis, debugging kernel-level issues, and performance profiling.
- Experience with virtualization and container (Docker, Kubernetes) technologies.
Soft Skills: Ability to work independently and mentor junior support engineers.
Nice to Have (Bonus Points):
- Very strong experience with Python (including virtual environments like venv, conda, pyenv).
- Experience with Storage providers and technologies (VAST, CEPH, Lustre, Weka, DDN).
- Familiarity with Infrastructure-as-Code (IaC) tools (Terraform, Puppet, Ansible, Chef, etc.).
- Nvidia and InfiniBand certifications.

Job Features

Job Category

Cloud Engineering, Support Service, Technical Services

Role Summary and Technical Support Mandate

What You’ll Do:

Required Experience and Technical Qualifications

Job Features

Apply For This Job