Nvidia
AI
SeniorSoftwareEngineer,AIResiliency
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Software Engineer, AI Resiliency at Nvidia. Skills: AI Software Resiliency, distributed systems, fault tolerance, C++, Python, large-scale computing environments, AI frameworks. Develop AI Software Resiliency Features. Implement and optimize software features that improve AI system reliability at a massive scale”
What You'll Achieve.
driving down cluster downtime towards zero; ensuring that our AI systems remain robust and reliable at all times; making AI training and inference more reliable, scalable, and efficient
Industry & Context.
Excellent problem-solving skills
What They're Looking For.
Must Have
Bachelor’s, Master’s or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience, Proficiency in C++ and Python, with experience in writing efficient, high-performance code, 6+ years of relevant experience, understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments, Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar, Experience with debugging and profiling tools (e. g. , gdb, perf, valgrind, NVIDIA Nsight), Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment
Nice to Have
Hands-on experience in training models or working with model training teams, Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing, especially at extreme-scale, Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training, Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads, systems programming skills and experience with low-level performance tuning
What You'll Do.
Develop AI Software Resiliency Features
Implement and optimize software features that improve AI system reliability at a massive scale
Hands-On Coding & Optimization
Contribute to large-scale distributed systems with high-quality
production-level C++ and Python code
Enhance performance for AI workloads running on thousands of GPUs
Fault Tolerance & Debugging
Work on AI system error handling
implementing techniques to detect silent data corruption (SDC) and other failure scenarios
Assist in developing monitoring tools for proactive failure mitigation
Develop and implement tests to ensure robustness
and efficiency of resiliency mechanisms
Contribute to CI/CD pipelines to automate validation of AI workloads
Support Production Deployments
Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments
ensuring seamless operation of AI training and inference workloads
How You'll Work.
Team & Collaboration
Collaborate Across Teams; Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA
Full Job Description
We are now looking for a Senior Software Engineer for AI Resiliency! At NVIDIA, we are pushing the boundaries of what’s possible in AI. We are currently seeking a Senior Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world. As a member of our AI Software Resiliency team, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs. Your expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times. **What You’ll Be Doing:** * Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection. * Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs. * Fault Tolerance & Debugging: Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation. * Collaborate Across Teams: Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA. * Testing & Automation: Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads. * Support Production Deployments: Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads. **What We Need to See:** * You've achieved a Bachelor’s, Master’s or PhD in Computer Science,
Applying for this Senior Software Engineer, AI Resiliency role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about Nvidia?
Real rants from real employees. Read before you apply.