NVIDIA

Technology

SeniorSoftwareEngineer,DGXCloudAIInfrastructure

$184–357k Santa Clara, California, United States FULL TIME

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Software Engineer, DGX Cloud AI Infrastructure at NVIDIA. Skills: AI Infrastructure, Distributed Systems, GPU Computing, Performance Engineering. Lead bring-up of AI clusters. Lead validation of AI clusters”

Industry & Context.

Technology

Problems you'll solve

Analytical skills; Debugging skills; Root cause analysis; Troubleshooting

What They're Looking For.

Must Have

Bachelor's or Master's in Computer Science, 8+ years developing software infrastructure, Track record of technical leadership, Expertise debugging AI applications, Deep hands-on experience with NCCL, CUDA-aware distributed execution experience, Debugging multi-GPU/multi-node workloads, Architecting, debugging, scaling distributed systems, Expert-level Python programming, Expert-level C/C++ programming, Operating workloads in scheduled environments, Operating workloads in containerized environments, Excellent analytical skills, Excellent debugging skills, Excellent communication skills

Nice to Have

Debugging and optimizing AI workloads at large scale, Familiarity with RDMA software stack, Knowledge of GPU cluster fabrics, Knowledge of GPU cluster topology, Experience building acceptance tests, Experience building benchmark harnesses, Experience building regression gates, Experience building cluster qualification tooling, Experience building resilience systems, Experience building fault-detection systems, Experience building failure-attribution systems

What You'll Do.

Lead bring-up of AI clusters

Lead validation of AI clusters

Lead debugging of AI clusters

Lead bring-up of infrastructure

Lead validation of infrastructure

Lead debugging of infrastructure

Lead bring-up of workloads

Lead validation of workloads

Lead debugging of workloads

Tune AI pre-training workloads

Tune AI post-training workloads

Tune AI inference workloads

Benchmark AI pre-training workloads

Benchmark AI post-training workloads

Benchmark AI inference workloads

Profile workload performance

Optimize workload performance

Analyze scaling efficiency

Translate findings into guidance

Own root-cause analysis

Define resilience stack

Build resilience stack

Define failure-attribution stack

Build failure-attribution stack

Build repeatable benchmark suites

Build acceptance criteria

Build qualification workflows

Tune runtime settings

Tune communication parameters

Tune deployment configurations

Deliver data-driven recommendations

Drive technical standards

Act as force multiplier

How You'll Work.

Team & Collaboration

Partner with framework teams; Partner with systems teams; Partner with platform teams; Influence across teams

Communication Scope

Communication skills

Full Job Description

NVIDIA is at the forefront of the generative AI revolution, building the software and systems that power the world’s most advanced large language model workloads. We are looking for a Senior Software Engineer to lead the bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across NVIDIA GPU platforms at the largest scales we run. In this role you will set technical direction across communication libraries, model frameworks, and inference/training stacks to ensure state-of-the-art LLM workloads run efficiently and reliably at scale. You will lead deep performance and reliability investigations on multi-GPU and multi-node deployments, define how we benchmark and qualify new platforms, and build the resilience and failure-attribution capabilities that keep large clusters productive. This is a hands-on senior individual-contributor role for an engineer who operates at the intersection of deep learning systems, GPU performance, distributed computing, and large-scale operations — and who raises the bar for the engineers around them. **What you’ll be doing:** * Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates. * Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks. * Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks. * Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance. * Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments. * Define and build t

Free ATS check

Applying for this Senior Software Engineer, DGX Cloud AI Infrastructure role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 39 detected · ranked by frequency

Distributed computing ×3

GPU performance ×3

Performance tuning ×3

Scaling efficiency analysis ×3

Failure analysis ×3

Resilience building ×3

Automation ×3

Qualification workflows ×3

AI Infrastructure ×2

Distributed Systems ×2

GPU Computing ×2

Performance Engineering ×2

Python

C/C++

NCCL

CUDA

PyTorch

NeMo

Megatron

TensorRT-LLM

RDMA

UCX

Libfabric

NVLink

NVSwitch

PCIe

RoCE

InfiniBand

Technical leadership

Performance optimization

Reliability engineering

Large-scale operations

BEHAVIOURAL

LeadershipMentoring

Role Details

Seniority senior

Experience 5–10 yrs

Level Senior

Work Mode Onsite

Type FULL TIME

Salary Band 150k-200k

AI-Extracted Insights

Domain Areas

deep-learning-systemslarge-scale-ai-systemshpc-systemsdistributed-computinggpu-performancelarge-scale-operationsai-workloadsllm-workloads

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →