NVIDIA
Technology
SoftwareEngineer,DGXCloudAIInfrastructure
Neural analysis suggests this role is
optimal for Mid candidates.
“Software Engineer, DGX Cloud AI Infrastructure at NVIDIA. Skills: Distributed systems, AI infrastructure, GPU platforms, LLM workloads. Bring up AI clusters. Validate AI clusters”
Industry & Context.
Analytical skills; Debugging; Troubleshooting; Root cause analysis
What They're Looking For.
Must Have
Bachelor's or Master's in Computer Science, 3+ years of experience, Hands-on experience with multi-GPU, Hands-on experience with multi-node workloads, Hands-on experience with CUDA-aware distributed execution, Background with debugging distributed systems, Background with scaling distributed systems, Experience debugging AI applications, Experience triaging AI applications, Experience operating workloads in scheduled environments, Experience operating workloads in containerized environments, Python programming skills, C/C++ programming skills
Nice to Have
Hands-on experience with NCCL, Deep familiarity with RDMA software stack, Familiarity with InfiniBand / RoCE congestion debugging, Experience building acceptance tests, Experience building benchmark harnesses, Experience building regression gates, Experience building cluster qualification tooling, Experience with MLPerf, Experience diagnosing performance jitter, Experience building resilience systems, Experience building fault-detection systems, Experience building failure-attribution systems
What You'll Do.
Bring up infrastructure
Validate infrastructure
Bring up end-to-end workloads
Validate end-to-end workloads
Debug end-to-end workloads
Tune AI pre-training workloads
Benchmark AI pre-training workloads
Tune AI post-training workloads
Benchmark AI post-training workloads
Tune AI inference workloads
Benchmark AI inference workloads
Perform root-cause analysis
Contribute to resilience tooling
Contribute to failure-attribution tooling
Attribute node failures
Detect fabric failures
Triage fabric failures
Attribute fabric failures
Detect workload failures
Triage workload failures
Attribute workload failures
Build repeatable benchmark suites
Build acceptance criteria
Build qualification workflows
Tune runtime settings
Tune communication parameters
Tune deployment configurations
Deliver actionable recommendations
Provide data-driven recommendations
How You'll Work.
Team & Collaboration
Framework teams; Systems teams; Platform teams; Cross-functional teams
Communication Scope
Communication skills
Full Job Description
NVIDIA is at the forefront of the generative AI revolution, building the software and systems that power the world’s most advanced large language model workloads. We are looking for a Software Engineer focused on bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across NVIDIA GPU platforms at the largest scales we run. In this role you will help bring up, benchmark, and debug distributed LLM workloads on multi-GPU and multi-node deployments, and own the design and implementation of the benchmarking tooling, automation, and debugging workflows that support them. This is a hands-on role for an engineer who enjoys deep technical problems across deep learning systems, GPU performance, distributed computing, and large-scale operations. **What you’ll be doing:** * Bring up, validate, and debug large-scale AI clusters, infrastructure, and end-to-end workloads. * Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks. * Perform root-cause analysis of failures in large distributed environments * Contribute to the resilience and failure-attribution tooling that detects, triages, and attributes node, fabric, and workload failures across the cluster. * Build and maintain repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms. * Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams. * Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization. **What we need to see:** * Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience). * 3+ years of experience developing software for AI, HPC, or systems-level applications. * Hands-on experience with multi-GPU or multi-node workload
Applying for this Software Engineer, DGX Cloud AI Infrastructure role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.