NVIDIA

Technology

SoftwareEngineer,DGXCloudAIInfrastructure

$116–224k Santa Clara, California, United States FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid candidates.

The Brief

“Software Engineer, DGX Cloud AI Infrastructure at NVIDIA. Skills: Distributed systems, AI infrastructure, GPU platforms, LLM workloads. Bring up AI clusters. Validate AI clusters”

Industry & Context.

Technology
Problems you'll solve

Analytical skills; Debugging; Troubleshooting; Root cause analysis

What They're Looking For.

Must Have

Bachelor's or Master's in Computer Science, 3+ years of experience, Hands-on experience with multi-GPU, Hands-on experience with multi-node workloads, Hands-on experience with CUDA-aware distributed execution, Background with debugging distributed systems, Background with scaling distributed systems, Experience debugging AI applications, Experience triaging AI applications, Experience operating workloads in scheduled environments, Experience operating workloads in containerized environments, Python programming skills, C/C++ programming skills

Nice to Have

Hands-on experience with NCCL, Deep familiarity with RDMA software stack, Familiarity with InfiniBand / RoCE congestion debugging, Experience building acceptance tests, Experience building benchmark harnesses, Experience building regression gates, Experience building cluster qualification tooling, Experience with MLPerf, Experience diagnosing performance jitter, Experience building resilience systems, Experience building fault-detection systems, Experience building failure-attribution systems

What You'll Do.

Bring up infrastructure

Validate infrastructure

Bring up end-to-end workloads

Validate end-to-end workloads

Debug end-to-end workloads

Tune AI pre-training workloads

Benchmark AI pre-training workloads

Tune AI post-training workloads

Benchmark AI post-training workloads

Tune AI inference workloads

Benchmark AI inference workloads

Perform root-cause analysis

Contribute to resilience tooling

Contribute to failure-attribution tooling

Attribute node failures

Detect fabric failures

Triage fabric failures

Attribute fabric failures

Detect workload failures

Triage workload failures

Attribute workload failures

Build repeatable benchmark suites

Build acceptance criteria

Build qualification workflows

Tune runtime settings

Tune communication parameters

Tune deployment configurations

Deliver actionable recommendations

Provide data-driven recommendations

How You'll Work.

Team & Collaboration

Framework teams; Systems teams; Platform teams; Cross-functional teams

Communication Scope

Communication skills

Full Job Description

NVIDIA is at the forefront of the generative AI revolution, building the software and systems that power the world’s most advanced large language model workloads. We are looking for a Software Engineer focused on bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across NVIDIA GPU platforms at the largest scales we run. In this role you will help bring up, benchmark, and debug distributed LLM workloads on multi-GPU and multi-node deployments, and own the design and implementation of the benchmarking tooling, automation, and debugging workflows that support them. This is a hands-on role for an engineer who enjoys deep technical problems across deep learning systems, GPU performance, distributed computing, and large-scale operations. **What you’ll be doing:** * Bring up, validate, and debug large-scale AI clusters, infrastructure, and end-to-end workloads. * Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks. * Perform root-cause analysis of failures in large distributed environments * Contribute to the resilience and failure-attribution tooling that detects, triages, and attributes node, fabric, and workload failures across the cluster. * Build and maintain repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms. * Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams. * Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization. **What we need to see:** * Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience). * 3+ years of experience developing software for AI, HPC, or systems-level applications. * Hands-on experience with multi-GPU or multi-node workload

Free ATS check

Applying for this Software Engineer, DGX Cloud AI Infrastructure role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →