Lambda

Technology

HPCOperationsEngineer

$240–356k San Francisco, California, United States FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“HPC Operations Engineer at Lambda. Skills: HPC clusters, AI workloads, Cloud infrastructure, Networking. Deploy and configure large-scale HPC clusters. Install and configure operating systems”

Industry & Context.

Technology
Problems you'll solve

Problem solving; Troubleshooting

Eligibility Requirements

Presence in San Francisco / Bellevue office 4 days a week, Flexibility to travel to data centers, On-site needs arise

What They're Looking For.

Must Have

5+ years of experience in deploying and configuring HPC clusters for AI workloads, Deeply experienced HPC engineer, Comfortable with logical provisioning of a cluster, Expert in configuring and troubleshooting SFP+ fiber, Infiniband, and 100 GbE network fabrics, Expert in configuring and troubleshooting Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments, Expert in configuring and troubleshooting Linux based compute nodes, firmware updates, driver installation, Expert in configuring and troubleshooting SLURM, Kubernetes, or other job scheduling systems, Work well under deadlines and structured project plans, Excellent problem solving and troubleshooting skills, Flexibility to travel to North American data centers, Able to work independently and as part of a team, Comfortable mentoring and supporting junior HPC engineers

Nice to Have

Experience with machine learning and deep learning frameworks, Experience with benchmarking tools, Experience with containerization technologies, Experience working with GPU acceleration, virtualization, and cloud computing, Keen situational awareness in customer situations, Bachelors degree in EE, CS, Physics, Mathematics, or equivalent work experience

What You'll Do.

Deploy and configure large-scale HPC clusters

Install and configure operating systems

Install and configure firmware

Install and configure software

Install and configure networking

Troubleshoot and resolve HPC cluster issues

Provide requirements to engineering teams

Contribute to Standard Operating Procedures

Provide updates to project leads

Mentor and assist team members

Stay up-to-date on HPC/AI technologies

How You'll Work.

Team & Collaboration

Working closely with physical deployment teams; Work as part of a team; Supporting junior HPC engineers

Communication Scope

Clear and detailed requirements; Regular and well-communicated updates

Process & Methodology

Structured project plans

Full Job Description

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco / Bellevue office location 4 days per week; Lambda’s designated work from home day is currently Tuesday. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance. What You’ll Do - Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes) - Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools - Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site - Provide clear and detailed requirements back to other engineering teams on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency - Contribute to the creation of and maintenance of Standard Operating Procedures - Provide regular and well-communicated updates to project leads throughout each deployment - Mentor and assist less experienced team members - Stay up-to-date on the latest HPC/AI technologies and best practices You - Are a deeply experienced HPC engineer comfortable with logical provisioning of a cluster - Have a strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking - 5+ years of experience in deploying and configuring HPC clusters for AI workloads - Have an innate attention to detail - Are in expert in configuring and troubleshooting: - SF

Free ATS check

Applying for this HPC Operations Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Ashby

  • Ashby is a fast modern ATS — most applications take under 3 minutes.
  • The resume parser is strong; verify parsed experience dates and job titles.
  • Custom screening questions are often scored algorithmically — answer completely.
  • Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Lambda?

Real rants from real employees. Read before you apply.

Read Company Rants →