NVIDIA

Artificial Intelligence, High-Performance Computing, Visualization

SeniorSoftwareEngineer,DGXCloudProductionEngineering

$184–357k Santa Clara, California, United States FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Software Engineer, DGX Cloud Production Engineering at NVIDIA. Skills: Kubernetes-based infrastructure, GPU cluster operations, reliability, automation, GitOps, Day 2 operability. Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments. Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations”

What You'll Achieve.

make GPU clusters reliable, scalable, and safe to run; Reduce manual production touches

Industry & Context.

Artificial Intelligence, High Performance Computing, Visualization

Problems you'll solve

Ability to troubleshoot distributed systems in production

Eligibility Requirements

on-call, incident response

What They're Looking For.

Must Have

8+ years of experience building or operating production infrastructure, programming skills in Python, Go, or similar, Experience with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation, Ability to troubleshoot distributed systems in production, Clear communication and ability to work across teams, BS/MS in Computer Science or equivalent experience

Nice to Have

Experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, or fleet automation, Experience with SLOs, on-call, incident response, observability, and reliability practices, Exposure to BMaaS, VMaaS, managed Kubernetes, or multi-cloud infrastructure

What You'll Do.

Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments

Develop tools and services for provisioning

and cluster lifecycle operations

Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup

and production operations

Reduce manual production touches through APIs

and agent-assisted workflows

Participate in on-call

and durable follow-up work

Partner with platform

and workload teams to make infrastructure production-ready

How You'll Work.

Team & Collaboration

work across teams; Partner with platform, storage, networking, security, and workload teams

Communication Scope

Clear communication

Full Job Description

NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. We are looking for Senior Software Engineers to help build the automation, tooling, and operational systems that make GPU clusters reliable, scalable, and safe to run. This role is part of a production engineering team focused on Kubernetes-based infrastructure, GPU cluster operations, reliability, automation, GitOps, and Day 2 operability across DGX Cloud environments. **What you’ll be doing:** * Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments. * Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations. * Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations. * Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows. * Participate in on-call, incident response, debugging, and durable follow-up work. * Partner with platform, storage, networking, security, and workload teams to make infrastructure production-ready. **What we need to see:** * 8+ years of experience building or operating production infrastructure. * Strong programming skills in Python, Go, or similar. * Experience with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation. * Ability to troubleshoot distributed systems in production. * Clear communication and ability to work across teams. * BS/MS in Computer Science or equivalent experience. **Ways to stand out from the crowd:** * Experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, or fleet automation. * Experience with SLOs, on-call, incident response, observability, and reliability practices. * Exposure to BMaaS, VMaaS, managed Kubernetes, or multi-cloud infrastructure. NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performa

Free ATS check

Applying for this Senior Software Engineer, DGX Cloud Production Engineering role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 37 detected · ranked by frequency

GitOps ×4

Day 2 operability ×3

Build and operate automation for large-scale GPU clusters ×3

Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations ×3

Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows ×3

Partner with platform, storage, networking, security, and workload teams ×3

Kubernetes-based infrastructure ×2

GPU cluster operations ×2

reliability ×2

automation ×2

Terraform ×2

ArgoCD ×2

Python

Linux

Kubernetes

containers

cloud infrastructure

infrastructure automation

GPU infrastructure

Kubernetes operators

fleet automation

BMaaS

VMaaS

managed Kubernetes

multi-cloud infrastructure

production engineering

Day 0 / Day 1 / Day 2 workflows

cluster bringup

handoff

production operations

platform

BEHAVIOURAL

Clear communicationability to work across teamscreativehard-workingself-motivated

Role Details

Seniority senior

Experience 8–10 yrs

Level Senior

Type FULL TIME

Education BS/MS in Computer Science or equivalent experience

Salary Band 150k-200k

AI-Extracted Insights

Domain Areas

ai-research-and-production-workloadsgpu-infrastructurekubernetes-based-infrastructuregpu-cluster-operationsreliabilityautomationgitopsday-2-operability

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →