Nvidia

AI

SeniorSoftwareEngineer,AIResiliency

$184–288k Redmond, Washington, United States FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Software Engineer, AI Resiliency at Nvidia. Skills: AI Software Resiliency, distributed systems, fault tolerance, C++, Python, large-scale computing environments, AI frameworks. Develop AI Software Resiliency Features. Implement and optimize software features that improve AI system reliability at a massive scale”

What You'll Achieve.

driving down cluster downtime towards zero; ensuring that our AI systems remain robust and reliable at all times; making AI training and inference more reliable, scalable, and efficient

Industry & Context.

AI
Problems you'll solve

Excellent problem-solving skills

What They're Looking For.

Must Have

Bachelor’s, Master’s or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience, Proficiency in C++ and Python, with experience in writing efficient, high-performance code, 6+ years of relevant experience, understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments, Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar, Experience with debugging and profiling tools (e. g. , gdb, perf, valgrind, NVIDIA Nsight), Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment

Nice to Have

Hands-on experience in training models or working with model training teams, Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing, especially at extreme-scale, Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training, Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads, systems programming skills and experience with low-level performance tuning

What You'll Do.

Develop AI Software Resiliency Features

Implement and optimize software features that improve AI system reliability at a massive scale

Hands-On Coding & Optimization

Contribute to large-scale distributed systems with high-quality

production-level C++ and Python code

Enhance performance for AI workloads running on thousands of GPUs

Fault Tolerance & Debugging

Work on AI system error handling

implementing techniques to detect silent data corruption (SDC) and other failure scenarios

Assist in developing monitoring tools for proactive failure mitigation

Develop and implement tests to ensure robustness

and efficiency of resiliency mechanisms

Contribute to CI/CD pipelines to automate validation of AI workloads

Support Production Deployments

Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments

ensuring seamless operation of AI training and inference workloads

How You'll Work.

Team & Collaboration

Collaborate Across Teams; Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA

Full Job Description

We are now looking for a Senior Software Engineer for AI Resiliency! At NVIDIA, we are pushing the boundaries of what’s possible in AI. We are currently seeking a Senior Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world. As a member of our AI Software Resiliency team, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs. Your expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times. **What You’ll Be Doing:** * Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection. * Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs. * Fault Tolerance & Debugging: Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation. * Collaborate Across Teams: Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA. * Testing & Automation: Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads. * Support Production Deployments: Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads. **What We Need to See:** * You've achieved a Bachelor’s, Master’s or PhD in Computer Science,

Free ATS check

Applying for this Senior Software Engineer, AI Resiliency role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about Nvidia?

Real rants from real employees. Read before you apply.

Read Company Rants →