NVIDIA
AI/ML Technology
PrincipalAIandMLInfraSoftwareEngineer,GPUClusters
Neural analysis suggests this role is
optimal for Principal candidates.
“Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA. Skills: AI and ML Infra Software Engineering, GPU Clusters, High Performance Computing (HPC), accelerated computing, distributed training operations, AI/ML workflows. Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers, converting those insights into actionable improvements.. Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically impro”
What You'll Achieve.
enhancing efficiency for our researchers; facilitating groundbreaking AI and ML research on GPU Clusters; ensuring high availability, scalability, and efficient resource utilization; ensuring that our actions are in line with measurable results
Industry & Context.
pinpoint and address infrastructure deficiencies; identify researcher efficiency bottlenecks; systematically improve it; optimize the performance of our infrastructure
What They're Looking For.
Must Have
BS or similar background in Computer Science or related area (or equivalent experience), 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems, Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure, In-depth knowledge of accelerated computing (e. g. , GPU, custom silicon), In-depth knowledge of storage (e. g. , Lustre, GPFS, BeeGFS), In-depth knowledge of scheduling & orchestration (e. g. , Slurm, Kubernetes, LSF), In-depth knowledge of high-speed networking (e. g. , Infiniband, RoCE, Amazon EFA), In-depth knowledge of containers technologies (Docker, Enroot), Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX, In-depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines, Proficiency in programming & scripting languages such as Python, Go, Bash, Familiarity with cloud computing platforms (e. g. , AWS, GCP, Azure), Experience with parallel computing frameworks and paradigms, Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector, Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds
What You'll Do.
Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers
converting those insights into actionable improvements.
Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it.
Drive the direction and long-term roadmaps for such initiatives.
Monitor and optimize the performance of our infrastructure ensuring high availability
and efficient resource utilization.
Help define and improve important measures of AI researcher efficiency
ensuring that our actions are in line with measurable results.
Work closely with a variety of teams
and DevOps professionals
to develop a cohesive AI/ML infrastructure ecosystem.
Keep up to date with the most recent developments in AI/ML technologies
and successful strategies
and advocate for their integration within the organization.
How You'll Work.
Team & Collaboration
Engage closely with our AI and ML research teams; Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals; Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds
Communication Scope
Excellent communication and collaboration skills
Process & Methodology
Lead initiatives, Drive the direction and long-term roadmaps
Full Job Description
We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a pivotal role in enhancing efficiency for our researchers by implementing progressions throughout the entire stack. Your main task will revolve around collaborating closely with customers to pinpoint and address infrastructure deficiencies, facilitating groundbreaking AI and ML research on GPU Clusters. Together, we can craft potent, effective, and scalable solutions as we mold the future of AI/ML technology! **What you will be doing:** * Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers, converting those insights into actionable improvements. * Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it. Drive the direction and long-term roadmaps for such initiatives. * Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization. * Help define and improve important measures of AI researcher efficiency, ensuring that our actions are in line with measurable results. * Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals, to develop a cohesive AI/ML infrastructure ecosystem. * Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies, and advocate for their integration within the organization. **What we need to see:** * BS or similar background in Computer Science or related area (or equivalent experience). * 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems. * Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure as well as in-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kuber
Applying for this Principal AI and ML Infra Software Engineer, GPU Clusters role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.