NVIDIA

Artificial Intelligence

SeniorSoftwareEngineer,DGXCloudProductionEngineering

$184–357k Santa Clara, California, United States FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Software Engineer, DGX Cloud Production Engineering at NVIDIA. Skills: Kubernetes, GPU infrastructure, automation, production engineering. Build automation for GPU clusters. Operate automation for GPU clusters”

What You'll Achieve.

make GPU clusters reliable; make GPU clusters scalable; make GPU clusters safe to run; make infrastructure production-ready

Industry & Context.

Artificial Intelligence
Problems you'll solve

troubleshoot distributed systems

Eligibility Requirements

on-call, incident response

What They're Looking For.

Must Have

8+ years of experience building or operating production infrastructure, programming skills in Python, Go, or similar, Experience with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation, Ability to troubleshoot distributed systems in production, Clear communication and ability to work across teams, BS/MS in Computer Science or equivalent experience

Nice to Have

Experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, fleet automation, SLOs, on-call, incident response, observability, reliability practices, BMaaS, VMaaS, managed Kubernetes, multi-cloud infrastructure

What You'll Do.

Build automation for GPU clusters

Operate automation for GPU clusters

Develop tools for provisioning

Develop tools for validation

Develop tools for upgrades

Develop tools for monitoring

Develop tools for repair

Develop tools for cluster lifecycle

Improve Day 0 workflows

Improve Day 1 workflows

Improve Day 2 workflows

Reduce manual production touches

Participate in on-call

Participate in incident response

Participate in debugging

Participate in durable follow-up work

Partner with platform teams

Partner with storage teams

Partner with networking teams

Partner with security teams

Partner with workload teams

How You'll Work.

Team & Collaboration

work across teams; Partner with platform teams; Partner with storage teams; Partner with networking teams; Partner with security teams; Partner with workload teams

Communication Scope

Clear communication

Full Job Description

NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. We are looking for Senior Software Engineers to help build the automation, tooling, and operational systems that make GPU clusters reliable, scalable, and safe to run. This role is part of a production engineering team focused on Kubernetes-based infrastructure, GPU cluster operations, reliability, automation, GitOps, and Day 2 operability across DGX Cloud environments. **What you’ll be doing:** * Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments. * Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations. * Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations. * Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows. * Participate in on-call, incident response, debugging, and durable follow-up work. * Partner with platform, storage, networking, security, and workload teams to make infrastructure production-ready. **What we need to see:** * 8+ years of experience building or operating production infrastructure. * Strong programming skills in Python, Go, or similar. * Experience with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation. * Ability to troubleshoot distributed systems in production. * Clear communication and ability to work across teams. * BS/MS in Computer Science or equivalent experience. **Ways to stand out from the crowd:** * Experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, or fleet automation. * Experience with SLOs, on-call, incident response, observability, and reliability practices. * Exposure to BMaaS, VMaaS, managed Kubernetes, or multi-cloud infrastructure. NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performa

Free ATS check

Applying for this Senior Software Engineer, DGX Cloud Production Engineering role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →