NVIDIA
AI
PrincipalProductManager
Neural analysis suggests this role is
optimal for Principal candidates.
“Principal Product Manager at NVIDIA. Skills: Resilient automation at AI Factory, Break-fix automation, Product strategy, Roadmap development, Operator experience, Scalable, reliable product development. Manage break-fix automation. Develop product strategy”
What You'll Achieve.
Convert tokens to intelligence at scale to power AI demands of tomorrow; Maintain AI infrastructure at scale with smart automation; Orchestration engine for AI factory break-fix runs live in production at DGX Cloud; Build a scalable, reliable product from an engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs; Ensure on-call engineers have the context they need to act quickly and confidently; Follow through from detection to resolution for failure attribution and automated repair actions; Optimize repair workflows at scale
Industry & Context.
Break-fix automation; Workflow orchestration; Failure attribution; Automated repair actions
On-call engineers have the context they need to act quickly and confidently
What They're Looking For.
Must Have
15+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background, BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience, Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation, Track record owning products with real-world operational consequences — you understand blast radius and build accordingly, operator UX instincts — proven ability to translate complex system state into workflows that on-call engineers can act on under pressure, Ability to build alignment across engineering, SRE, and external vendor partner teams
Nice to Have
Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments, Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale, Background in reliability engineering, SLO build, or chaos/fault-injection testing, Prior experience at a cloud service provider or Hyperscalers infrastructure team, Experience building Agentic AI workflow software
What You'll Do.
Manage break-fix automation
Develop product strategy
Improve operator experience
Guide roadmap for professionals
reliable product from an engineering foundation
Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors
Define automation confidence thresholds
blocking issue criteria
and human-in-the-loop intervention points that balance speed with operational safety
Build the operator UX for repair queues
workflow transparency
and audit trails — ensuring on-call engineers have the context they need to act quickly and confidently
Drive the integration between failure attribution and automated repair actions
following through from detection to resolution
Define repair SLOs and own the metrics framework for time-to-drain
and overall fleet availability
Collaborate with NCP operators
and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale
How You'll Work.
Team & Collaboration
Collaborate with NCP operators, SRE teams, and hardware vendor partners; Ability to build alignment across engineering, SRE, and external vendor partner teams
Process & Methodology
Roadmap development
Full Job Description
NVIDIA is driving a vision for AI factories that convert tokens to intelligence at scale to power AI demands of tomorrow. Maintaining AI infrastructure at scale takes more than human involvement; it demands smart automation. The orchestration engine for AI factory break-fix runs live in production at DGX Cloud. As the Product Manager leading all aspects of resilient automation at AI Factory, you will manage break-fix automation. You will develop the product strategy, improve operator experience, and guide the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs. This is your chance to compose how AI factories self-heal! **What You’ll Be Doing:** * Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs. * Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety. * Build the operator UX for repair queues, workflow transparency, and audit trails — ensuring on-call engineers have the context they need to act quickly and confidently. * Drive the integration between failure attribution and automated repair actions, following through from detection to resolution. * Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability. * Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale. **What We Need to See:** * 15+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background. * BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience. * Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation.
Applying for this Principal Product Manager role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.