NVIDIA

PrincipalProductManager

$240–380k Santa Clara, California, United States FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Principal candidates.

The Brief

“Principal Product Manager at NVIDIA. Skills: Resilient automation at AI Factory, Break-fix automation, Product strategy, Roadmap development, Operator experience, Scalable, reliable product development. Manage break-fix automation. Develop product strategy”

What You'll Achieve.

Convert tokens to intelligence at scale to power AI demands of tomorrow; Maintain AI infrastructure at scale with smart automation; Orchestration engine for AI factory break-fix runs live in production at DGX Cloud; Build a scalable, reliable product from an engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs; Ensure on-call engineers have the context they need to act quickly and confidently; Follow through from detection to resolution for failure attribution and automated repair actions; Optimize repair workflows at scale

Industry & Context.

Problems you'll solve

Break-fix automation; Workflow orchestration; Failure attribution; Automated repair actions

Eligibility Requirements

On-call engineers have the context they need to act quickly and confidently

What They're Looking For.

Must Have

15+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background, BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience, Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation, Track record owning products with real-world operational consequences — you understand blast radius and build accordingly, operator UX instincts — proven ability to translate complex system state into workflows that on-call engineers can act on under pressure, Ability to build alignment across engineering, SRE, and external vendor partner teams

Nice to Have

Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments, Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale, Background in reliability engineering, SLO build, or chaos/fault-injection testing, Prior experience at a cloud service provider or Hyperscalers infrastructure team, Experience building Agentic AI workflow software

What You'll Do.

Manage break-fix automation

Develop product strategy

Improve operator experience

Guide roadmap for professionals

reliable product from an engineering foundation

Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors

Define automation confidence thresholds

blocking issue criteria

and human-in-the-loop intervention points that balance speed with operational safety

Build the operator UX for repair queues

workflow transparency

and audit trails — ensuring on-call engineers have the context they need to act quickly and confidently

Drive the integration between failure attribution and automated repair actions

following through from detection to resolution

Define repair SLOs and own the metrics framework for time-to-drain

and overall fleet availability

Collaborate with NCP operators

and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale

How You'll Work.

Team & Collaboration

Collaborate with NCP operators, SRE teams, and hardware vendor partners; Ability to build alignment across engineering, SRE, and external vendor partner teams

Process & Methodology

Roadmap development

Full Job Description

NVIDIA is driving a vision for AI factories that convert tokens to intelligence at scale to power AI demands of tomorrow. Maintaining AI infrastructure at scale takes more than human involvement; it demands smart automation. The orchestration engine for AI factory break-fix runs live in production at DGX Cloud. As the Product Manager leading all aspects of resilient automation at AI Factory, you will manage break-fix automation. You will develop the product strategy, improve operator experience, and guide the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs. This is your chance to compose how AI factories self-heal! **What You’ll Be Doing:** * Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs. * Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety. * Build the operator UX for repair queues, workflow transparency, and audit trails — ensuring on-call engineers have the context they need to act quickly and confidently. * Drive the integration between failure attribution and automated repair actions, following through from detection to resolution. * Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability. * Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale. **What We Need to See:** * 15+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background. * BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience. * Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation.

Free ATS check

Applying for this Principal Product Manager role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 27 detected · ranked by frequency

Break-fix automation ×5

Product strategy ×3

Roadmap development ×3

Automation confidence thresholds ×3

Blocking issue criteria ×3

Human-in-the-loop intervention points ×3

Repair queues ×3

Workflow transparency ×3

Audit trails ×3

Failure attribution ×3

Automated repair actions ×3

Repair SLOs ×3

Metrics framework ×3

Time-to-drain ×3

Time-to-healthy ×3

Fleet availability ×3

Chaos/fault-injection testing ×3

Resilient automation at AI Factory ×2

Operator experience ×2

Scalable, reliable product development ×2

distributed systems

workflow orchestration

Agentic AI workflow software

Operator experience improvement

SLO build

RMA processes

Vendor SLA oversight

BEHAVIOURAL

Ability to build alignment across engineering, SRE, and external vendor partner teams

Role Details

Seniority manager

Experience 15–99 yrs

Level Principal

Work Mode Hybrid

Type FULL TIME

Education BS or MS in Computer Science, Engineering, or a related tech

Salary Band 200k+

AI-Extracted Insights

Domain Areas

ai-factoriesai-infrastructuredistributed-systemsworkflow-orchestrationmlopsgpu-infrastructuredatacenter-operationsai-factory-environments

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →