{"id":469,"date":"2025-10-28T13:27:40","date_gmt":"2025-10-28T13:27:40","guid":{"rendered":"https:\/\/skillbasedmatching.com\/jobs\/?post_type=jobpost&#038;p=469"},"modified":"2025-10-28T13:27:43","modified_gmt":"2025-10-28T13:27:43","slug":"hpc-support-engineer-deep-learning-cloud","status":"publish","type":"jobpost","link":"https:\/\/skillbasedmatching.com\/jobs\/current-jobs\/hpc-support-engineer-deep-learning-cloud\/","title":{"rendered":"HPC Support Engineer \u2013 Deep Learning Cloud"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><strong>Lambda, The Superintelligence Cloud<\/strong>, which builds Gigawatt-scale AI Factories, is seeking an experienced <strong>HPC Support Engineer<\/strong> to join its Operations department. This role is highly technical, focusing on providing expert support for complex software and hardware issues within their deep learning cloud and High-Performance Computing (HPC) infrastructure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is a <strong>Full-time, Remote (USA)<\/strong> position with a competitive compensation range of <strong>$137K \u2013 $206K<\/strong> annually. The role is part of a <strong>24\/7 coverage model<\/strong> and requires working one of three set schedules, including potential weekend and evening shifts.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Role Summary and Technical Support Mandate<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You will be expected to dive deep into complex challenges related to large-scale AI\/ML and HPC workloads. The role requires a strong focus on advanced Linux system administration, cluster orchestration (Kubernetes\/Slurm), and high-throughput networking technologies.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">What You\u2019ll Do:<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Advanced Troubleshooting:<\/strong> Dive into <strong>complex software and hardware issues<\/strong> and provide timely, effective solutions to customers.<\/li>\n\n\n\n<li><strong>Escalation &amp; Mentorship:<\/strong> Take <strong>escalations from peers<\/strong> while providing training and education to enhance team capabilities.<\/li>\n\n\n\n<li><strong>Product Collaboration:<\/strong> Collaborate closely with engineering teams to <strong>identify customer pain points<\/strong> and develop innovative solutions, contributing expertise to shape the future of the deep learning cloud.<\/li>\n\n\n\n<li><strong>Documentation &amp; Improvement:<\/strong> Craft comprehensive documentation of solutions and contribute to enhancing support procedures.<\/li>\n\n\n\n<li><strong>Project Work:<\/strong> Work cross-functionally on project work, focusing on creating and improving support tooling.<\/li>\n\n\n\n<li><strong>On-Call:<\/strong> Participate in a <strong>rotating on-call schedule<\/strong> responsible for major incidents and major customer alerts.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Required Experience and Technical Qualifications<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The ideal candidate has extensive experience in cloud\/systems engineering, deep Linux expertise, and specific knowledge of GPU-related hardware and software components critical to AI\/ML and HPC environments.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Experience (Mandatory):<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>7+ years<\/strong> in cloud support operations or systems engineering.<\/li>\n\n\n\n<li><strong>Very strong understanding and experience with Linux (Ubuntu) system administration<\/strong>.<\/li>\n\n\n\n<li><strong>Proven experience in HPC environments<\/strong>, with a strong preference for <strong>Kubernetes and\/or Slurm<\/strong> for cluster orchestration.<\/li>\n\n\n\n<li>Strong experience with public cloud platforms (<strong>AWS, Azure, GCP<\/strong>) or GPU cloud providers.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>AI\/HPC Specific Knowledge:<\/strong>\n<ul class=\"wp-block-list\">\n<li>Experience with <strong>CUDA, NCCL, NVLink, MIG, GPUDirect RDMA<\/strong>.<\/li>\n\n\n\n<li>Experience with <strong>high-throughput networking technologies (IB\/RoCE)<\/strong>.<\/li>\n\n\n\n<li>Knowledge of <strong>distributed AI\/ML or HPC workloads<\/strong>.<\/li>\n\n\n\n<li>Knowledge of <strong>TCP\/IP, VPN, and firewalls<\/strong> in cloud environments.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>DevOps &amp; Troubleshooting:<\/strong>\n<ul class=\"wp-block-list\">\n<li>Proficiency with <strong>monitoring\/logging tools (Prometheus, Grafana, Datadog)<\/strong>.<\/li>\n\n\n\n<li>Strong skills in <strong>log analysis, debugging kernel-level issues, and performance profiling<\/strong>.<\/li>\n\n\n\n<li>Experience with <strong>virtualization and container (Docker, Kubernetes)<\/strong> technologies.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Soft Skills:<\/strong> Ability to work independently and <strong>mentor junior support engineers<\/strong>.<\/li>\n\n\n\n<li><strong>Nice to Have (Bonus Points):<\/strong>\n<ul class=\"wp-block-list\">\n<li>Very strong experience with <strong>Python<\/strong> (including virtual environments like venv, conda, pyenv).<\/li>\n\n\n\n<li>Experience with Storage providers and technologies (<strong>VAST, CEPH, Lustre, Weka, DDN<\/strong>).<\/li>\n\n\n\n<li>Familiarity with <strong>Infrastructure-as-Code (IaC)<\/strong> tools (<strong>Terraform, Puppet, Ansible, Chef<\/strong>, etc.).<\/li>\n\n\n\n<li>Nvidia and InfiniBand certifications.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Lambda, The Superintelligence Cloud, which builds Gigawatt-scale AI Factories, is seeking an experienced HPC Support Engineer to join its Operations department. This role is highly technical, focusing on providing expert support for complex software and hardware issues within their deep learning cloud and High-Performance Computing (HPC) infrastructure. This is a Full-time, Remote (USA) position with [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"menu_order":0,"template":"","jobpost_category":[1294,734,1098],"jobpost_job_type":[39],"jobpost_location":[1041],"jobpost_tag":[2117,2121,997,2116,2122,2115,2119,2124,1232,2118,2123,1259,24,2014,2120],"class_list":["post-469","jobpost","type-jobpost","status-publish","hentry","jobpost_category-cloud-engineering","jobpost_category-support-service","jobpost_category-technical-services","jobpost_job_type-remote","jobpost_location-united-states","jobpost_tag-ai-factories","jobpost_tag-cuda","jobpost_tag-datadog","jobpost_tag-deep-learning-cloud","jobpost_tag-gpu-cloud","jobpost_tag-hpc-support-engineer","jobpost_tag-ib-roce-networking","jobpost_tag-kernel-debugging","jobpost_tag-kubernetes","jobpost_tag-linux-cluster-administration","jobpost_tag-nccl","jobpost_tag-prometheus","jobpost_tag-python","jobpost_tag-remote-us","jobpost_tag-slurm"],"_links":{"self":[{"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/jobpost\/469","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/jobpost"}],"about":[{"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/types\/jobpost"}],"author":[{"embeddable":true,"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/users\/1"}],"wp:attachment":[{"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/media?parent=469"}],"wp:term":[{"taxonomy":"jobpost_category","embeddable":true,"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/jobpost_category?post=469"},{"taxonomy":"jobpost_job_type","embeddable":true,"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/jobpost_job_type?post=469"},{"taxonomy":"jobpost_location","embeddable":true,"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/jobpost_location?post=469"},{"taxonomy":"jobpost_tag","embeddable":true,"href":"https:\/\/skillbasedmatching.com\/jobs\/wp-json\/wp\/v2\/jobpost_tag?post=469"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}