NVIDIA

AI/ML Technology

PrincipalAIandMLInfraSoftwareEngineer,GPUClusters

$272–431k Santa Clara, California, United States FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Principal candidates.

The Brief

“Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA. Skills: AI and ML Infra Software Engineering, GPU Clusters, High Performance Computing (HPC), accelerated computing, distributed training operations, AI/ML workflows. Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers, converting those insights into actionable improvements.. Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically impro”

What You'll Achieve.

enhancing efficiency for our researchers; facilitating groundbreaking AI and ML research on GPU Clusters; ensuring high availability, scalability, and efficient resource utilization; ensuring that our actions are in line with measurable results

Industry & Context.

AI/ML Technology

Problems you'll solve

pinpoint and address infrastructure deficiencies; identify researcher efficiency bottlenecks; systematically improve it; optimize the performance of our infrastructure

What They're Looking For.

Must Have

BS or similar background in Computer Science or related area (or equivalent experience), 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems, Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure, In-depth knowledge of accelerated computing (e. g. , GPU, custom silicon), In-depth knowledge of storage (e. g. , Lustre, GPFS, BeeGFS), In-depth knowledge of scheduling & orchestration (e. g. , Slurm, Kubernetes, LSF), In-depth knowledge of high-speed networking (e. g. , Infiniband, RoCE, Amazon EFA), In-depth knowledge of containers technologies (Docker, Enroot), Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX, In-depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines, Proficiency in programming & scripting languages such as Python, Go, Bash, Familiarity with cloud computing platforms (e. g. , AWS, GCP, Azure), Experience with parallel computing frameworks and paradigms, Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector, Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds

What You'll Do.

Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers

converting those insights into actionable improvements.

Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it.

Drive the direction and long-term roadmaps for such initiatives.

Monitor and optimize the performance of our infrastructure ensuring high availability

and efficient resource utilization.

Help define and improve important measures of AI researcher efficiency

ensuring that our actions are in line with measurable results.

Work closely with a variety of teams

and DevOps professionals

to develop a cohesive AI/ML infrastructure ecosystem.

Keep up to date with the most recent developments in AI/ML technologies

and successful strategies

and advocate for their integration within the organization.

How You'll Work.

Team & Collaboration

Engage closely with our AI and ML research teams; Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals; Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds

Communication Scope

Excellent communication and collaboration skills

Process & Methodology

Lead initiatives, Drive the direction and long-term roadmaps

Full Job Description

We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a pivotal role in enhancing efficiency for our researchers by implementing progressions throughout the entire stack. Your main task will revolve around collaborating closely with customers to pinpoint and address infrastructure deficiencies, facilitating groundbreaking AI and ML research on GPU Clusters. Together, we can craft potent, effective, and scalable solutions as we mold the future of AI/ML technology! **What you will be doing:** * Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers, converting those insights into actionable improvements. * Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it. Drive the direction and long-term roadmaps for such initiatives. * Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization. * Help define and improve important measures of AI researcher efficiency, ensuring that our actions are in line with measurable results. * Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals, to develop a cohesive AI/ML infrastructure ecosystem. * Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies, and advocate for their integration within the organization. **What we need to see:** * BS or similar background in Computer Science or related area (or equivalent experience). * 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems. * Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure as well as in-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kuber

Free ATS check

Applying for this Principal AI and ML Infra Software Engineer, GPU Clusters role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 44 detected · ranked by frequency

AI/ML workflows ×5

supervising and improving substantial distributed training operations ×3

data processing ×3

model training ×3

inference pipelines ×3

programming ×3

scripting ×3

AI and ML Infra Software Engineering ×2

GPU Clusters ×2

High Performance Computing (HPC) ×2

accelerated computing ×2

distributed training operations ×2

Lustre ×2

GPFS ×2

BeeGFS ×2

Slurm ×2

Kubernetes ×2

LSF ×2

Infiniband ×2

RoCE ×2

Amazon EFA ×2

Docker ×2

Enroot ×2

PyTorch ×2

NeMo ×2

JAX ×2

Python ×2

Go ×2

Bash ×2

AWS ×2

GCP ×2

Azure ×2

BEHAVIOURAL

collaboration skillswork effectively with teams and individuals of different backgroundspassionateindependent

Role Details

Seniority senior

Experience 15–99 yrs

Level Principal

Type FULL TIME

Education BS or similar background in Computer Science or related area

Salary Band 200k+

AI-Extracted Insights

Domain Areas

ai-mlhpcaccelerated-computingstoragescheduling-orchestrationhigh-speed-networkingcontainers-technologiesdistributed-training-operations

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →