NVIDIA

Technology

SeniorTechnicalProgramManager,DGXCloudSoftwareProductsandServices

$200–322k Santa Clara, California, United States FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Technical Program Manager, DGX Cloud Software Products and Services at NVIDIA. Skills: Technical Program Management, Resilience, Reliability, Operational Scale, Cloud Infrastructure, AI Training Environments. Lead strategic programs emphasizing resilience, reliability, and goodput. Drive improvements in resilience, service stability, and operational scale”

What You'll Achieve.

improve resilience, reliability, and goodput; driving improvements in resilience, service stability, and operational scale; deliver fault-tolerant, high-availability training and inference environments at scale; improve end-to-end service stability; strengthen service readiness and recovery; facilitating reusable and extensible capabilities across the platform; improve observability, failure detection and attribution, root cause analysis, recovery orchestration, and operational readiness; increase usable fleet capacity, workload efficiency, and customer outcomes at scale; track program health, reliability posture, operational maturity, and performance

Industry & Context.

Technology

Problems you'll solve

analytical skills with the ability to assess issues across infrastructure, software, and operational layers; problem-solving skills

What They're Looking For.

Must Have

MS EE or CS degree, or equivalent experience, 8+ years of experience in program management of large-scale software or infrastructure projects, Proven track record of leading complex cross-functional programs in cloud, infrastructure, distributed systems, or platform environments, analytical skills with the ability to assess issues across infrastructure, software, and operational layers, Excellent organizational skills, Solid understanding of reliability engineering, resilience development, and service performance metrics, including goodput, efficiency, and utilization, Experience working alongside engineering, SRE, operations, and technical collaborators to advance projects in ambiguous, high-complexity environments, Outstanding communication and presentation skills for diverse technical and non-technical audiences with problem-solving and conflict management skills

Nice to Have

Background in computer science, machine learning, deep learning, open-source software, and GPU technology, AI infrastructure, or large-scale compute platforms, Experience with large-scale AI training environments (e.g., distributed training frameworks, checkpointing, NCCL, Slurm or other schedulers), Prior experience in the management of customer workflows using large scale distributed computing and working with AI researchers or directly training and evaluating AI models, Proven ability to harness AI-enabled workflows and tools to improve program management efficiency, decision-making, execution visibility, and operational efficiency

What You'll Do.

Lead strategic programs emphasizing resilience

Drive improvements in resilience

and operational scale

Guide architectural decisions related to resilience reference architecture

Lead programs spanning DGXC infrastructure

and core platform services to deliver fault-tolerant

high-availability training and inference environments at scale

Develop scalable resilience strategies

Improve operational performance

Assist in building open

modular software components and reference stacks for DGX Cloud at scale

Lead cross-functional programs that improve resilience

and fleet-wide goodput across DGX Cloud

Drive the definition and adoption of resilience reference stacks

operational standards

and scalable guidelines that strengthen service readiness and recovery

Support the development and delivery of open

modular software components for resilience

facilitating reusable and extensible capabilities across the platform

Build and scale resilience tooling and operational mechanisms that improve observability

failure detection and attribution

recovery orchestration

and operational readiness

using data-driven insights to increase usable fleet capacity

and customer outcomes at scale

Establish clear metrics

and operating cadences to track program health

How You'll Work.

Team & Collaboration

Collaboration across multiple teams; Partner across infrastructure, platform, site reliability, operational, and tenant teams to identify systemic risks, resolve cross-stack dependencies, and improve end-to-end service stability; Partner with engineering teams and researchers to support the development and delivery of open, modular software components for resilience; Work closely with engineering, SRE, operations, and researchers to develop scalable resilience strategies, improve operational performance, and assist in building open, modular software components and reference stacks for DGX Cloud at scale; Experience working alongside engineering, SRE, operations, and technical collaborators to advance projects in ambiguous, high-complexity environments

Communication Scope

Outstanding communication and presentation skills for diverse technical and non-technical audiences

Process & Methodology

program management of large-scale software or infrastructure projects, leading complex cross-functional programs, project management tools (e.g. Jira, Aha!, Confluence), program health, operating cadences

Full Job Description

NVIDIA's DGX Cloud (DGXC) powers AI for strategic research and product workloads. The company seeks an expert Technical Program Manager (IC5) to lead strategic programs emphasizing resilience, reliability, and goodput. This role requires collaboration across multiple teams. It involves driving improvements in resilience, service stability, and operational scale. The TPM also guides architectural decisions related to resilience reference architecture. The TPM leads programs spanning DGXC infrastructure, Resilience Tools, and core platform services to deliver fault-tolerant, high-availability training and inference environments at scale. We are looking for a TPM who is analytical, technically skilled, and comfortable working with cloud infrastructure, software, operations, and environments driven by data and research. You will work closely with engineering, SRE, operations, and researchers to develop scalable resilience strategies, improve operational performance, and assist in building open, modular software components and reference stacks for DGX Cloud at scale. **What You 'll Be Doing:** * Lead cross-functional programs that improve resilience, reliability, operational scale, and fleet-wide goodput across DGX Cloud. * Partner across infrastructure, platform, site reliability, operational, and tenant teams to identify systemic risks, resolve cross-stack dependencies, and improve end-to-end service stability. * Drive the definition and adoption of resilience reference stacks, operational standards, and scalable guidelines that strengthen service readiness and recovery. * Partner with engineering teams and researchers to support the development and delivery of open, modular software components for resilience, facilitating reusable and extensible capabilities across the platform. * Build and scale resilience tooling and operational mechanisms that improve observability, failure detection and attribution, root cause analysis, recovery orchestration, and operational read

Free ATS check

Applying for this Senior Technical Program Manager, DGX Cloud Software Products and Services role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 41 detected · ranked by frequency

Resilience ×5

Operational Scale ×3

Cloud Infrastructure ×3

technical skills ×3

analytical skills ×3

organizational skills ×3

reliability engineering ×3

service performance metrics ×3

goodput ×3

efficiency ×3

utilization ×3

communication skills ×3

presentation skills ×3

Technical Program Management ×2

Reliability ×2

AI Training Environments ×2

Git ×2

distributed systems

platform environments

AI infrastructure

large-scale compute platforms

distributed training frameworks

NCCL

Slurm

program management of large-scale software or infrastructure projects

resilience reference architecture

fault-tolerant, high-availability training and inference environments at scale

service stability

operational performance

resilience strategies

operational standards

service readiness

BEHAVIOURAL

collaboration across multiple teamsanalyticalcomfortable working with data and research driven environmentsproblem-solvingconflict management

Role Details

Seniority manager

Experience 8–10 yrs

Level Senior

Work Mode Remote Friendly

Type FULL TIME

Education MS EE or CS degree, or equivalent experience

Salary Band 200k+

AI-Extracted Insights

Domain Areas

dgx-cloud-software-products-and-servicesaicloud-infrastructuredistributed-systemsplatform-environmentsreliability-engineeringresilience-developmentservice-performance-metrics

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →