NVIDIA
Technology
SeniorSoftwareEngineer,DGXCloudAIInfrastructure
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Software Engineer, DGX Cloud AI Infrastructure at NVIDIA. Skills: AI Infrastructure, Distributed Systems, GPU Computing, Performance Engineering. Lead bring-up of AI clusters. Lead validation of AI clusters”
Industry & Context.
Analytical skills; Debugging skills; Root cause analysis; Troubleshooting
What They're Looking For.
Must Have
Bachelor's or Master's in Computer Science, 8+ years developing software infrastructure, Track record of technical leadership, Expertise debugging AI applications, Deep hands-on experience with NCCL, CUDA-aware distributed execution experience, Debugging multi-GPU/multi-node workloads, Architecting, debugging, scaling distributed systems, Expert-level Python programming, Expert-level C/C++ programming, Operating workloads in scheduled environments, Operating workloads in containerized environments, Excellent analytical skills, Excellent debugging skills, Excellent communication skills
Nice to Have
Debugging and optimizing AI workloads at large scale, Familiarity with RDMA software stack, Knowledge of GPU cluster fabrics, Knowledge of GPU cluster topology, Experience building acceptance tests, Experience building benchmark harnesses, Experience building regression gates, Experience building cluster qualification tooling, Experience building resilience systems, Experience building fault-detection systems, Experience building failure-attribution systems
What You'll Do.
Lead bring-up of AI clusters
Lead validation of AI clusters
Lead debugging of AI clusters
Lead bring-up of infrastructure
Lead validation of infrastructure
Lead debugging of infrastructure
Lead bring-up of workloads
Lead validation of workloads
Lead debugging of workloads
Tune AI pre-training workloads
Tune AI post-training workloads
Tune AI inference workloads
Benchmark AI pre-training workloads
Benchmark AI post-training workloads
Benchmark AI inference workloads
Profile workload performance
Optimize workload performance
Analyze scaling efficiency
Translate findings into guidance
Own root-cause analysis
Define resilience stack
Build resilience stack
Define failure-attribution stack
Build failure-attribution stack
Build repeatable benchmark suites
Build acceptance criteria
Build qualification workflows
Tune runtime settings
Tune communication parameters
Tune deployment configurations
Deliver data-driven recommendations
Drive technical standards
Act as force multiplier
How You'll Work.
Team & Collaboration
Partner with framework teams; Partner with systems teams; Partner with platform teams; Influence across teams
Communication Scope
Communication skills
Full Job Description
NVIDIA is at the forefront of the generative AI revolution, building the software and systems that power the world’s most advanced large language model workloads. We are looking for a Senior Software Engineer to lead the bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across NVIDIA GPU platforms at the largest scales we run. In this role you will set technical direction across communication libraries, model frameworks, and inference/training stacks to ensure state-of-the-art LLM workloads run efficiently and reliably at scale. You will lead deep performance and reliability investigations on multi-GPU and multi-node deployments, define how we benchmark and qualify new platforms, and build the resilience and failure-attribution capabilities that keep large clusters productive. This is a hands-on senior individual-contributor role for an engineer who operates at the intersection of deep learning systems, GPU performance, distributed computing, and large-scale operations — and who raises the bar for the engineers around them. **What you’ll be doing:** * Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates. * Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks. * Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks. * Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance. * Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments. * Define and build t
Applying for this Senior Software Engineer, DGX Cloud AI Infrastructure role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.