NVIDIA

AI

PrincipalProductManager

$240–380k Santa Clara, California, United States FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Principal candidates.

The Brief

“Principal Product Manager at NVIDIA. Skills: product management, resilient automation, break-fix automation, AI factory. Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs. Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety”

What You'll Achieve.

improve operator experience; build a scalable, reliable product; uphold their SLAs; balance speed with operational safety; ensuring on-call engineers have the context they need to act quickly and confidently; following through from detection to resolution; optimize repair workflows at scale; overall fleet availability

Industry & Context.

AI
Problems you'll solve

break-fix automation; failure attribution; automated repair actions

Eligibility Requirements

on-call engineers have the context they need to act quickly and confidently

What They're Looking For.

Must Have

15+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background, BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience, Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation, Track record owning products with real-world operational consequences, operator UX instincts, Ability to build alignment across engineering, SRE, and external vendor partner teams

Nice to Have

Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments, Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale, Background in reliability engineering, SLO build, or chaos/fault-injection testing, Prior experience at a cloud service provider or Hyperscalers infrastructure team, Experience building Agentic AI workflow software

What You'll Do.

Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors

Define automation confidence thresholds

blocking issue criteria

and human-in-the-loop intervention points that balance speed with operational safety

Build the operator UX for repair queues

workflow transparency

Drive the integration between failure attribution and automated repair actions

following through from detection to resolution

Define repair SLOs and own the metrics framework for time-to-drain

and overall fleet availability

How You'll Work.

Team & Collaboration

Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale; Ability to build alignment across engineering, SRE, and external vendor partner teams

Process & Methodology

strategic direction, roadmap development

Full Job Description

NVIDIA is driving a vision for AI factories that convert tokens to intelligence at scale to power AI demands of tomorrow. Maintaining AI infrastructure at scale takes more than human involvement; it demands smart automation. The orchestration engine for AI factory break-fix runs live in production at DGX Cloud. As the Product Manager leading all aspects of resilient automation at AI Factory, you will manage break-fix automation. You will develop the product strategy, improve operator experience, and guide the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs. This is your chance to compose how AI factories self-heal! **What You’ll Be Doing:** * Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs. * Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety. * Build the operator UX for repair queues, workflow transparency, and audit trails — ensuring on-call engineers have the context they need to act quickly and confidently. * Drive the integration between failure attribution and automated repair actions, following through from detection to resolution. * Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability. * Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale. **What We Need to See:** * 15+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background. * BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience. * Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation.

Free ATS check

Applying for this Principal Product Manager role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →