NVIDIA

SeniorProductManager,AIFactoryInfra

$208k–$380k Santa Clara, California, United States FULL TIME Remote Friendly USD208,000–379,500
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Product Manager, AI Factory Infra at NVIDIA. Skills: AI Factory Infra, Break-fix automation, Product strategy, Operator UX. Manage break-fix automation. Develop product strategy”

Industry & Context.

Problems you'll solve

Break-fix automation; Failure attribution; Automated repair actions

What They're Looking For.

Must Have

12+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background, BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience, Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation, Track record owning products with real-world operational consequences, operator UX instincts, Ability to build alignment across engineering, SRE, and external vendor partner teams

Nice to Have

Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments, Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale, Background in reliability engineering, SLO build, or chaos/fault-injection testing, Prior experience at a cloud service provider or Hyperscalers infrastructure team, Experience building Agentic AI workflow software

What You'll Do.

Manage break-fix automation

Develop product strategy

Improve operator experience

Guide roadmap for professionals

Take responsibility for strategic direction

Define automation confidence thresholds

Build operator UX for repair queues

Drive integration between failure attribution

Own metrics framework

Collaborate with NCP operators

Collaborate with SRE teams

Collaborate with hardware vendor partners

How You'll Work.

Team & Collaboration

Build alignment across engineering, SRE, and external vendor partner teams; Collaborate with NCP operators, SRE teams, and hardware vendor partners

Full Job Description

NVIDIA is driving a vision for AI factories that convert tokens to intelligence at scale to power AI demands of tomorrow. Maintaining AI infrastructure at scale takes more than human involvement; it demands smart automation. The orchestration engine for AI factory break-fix runs live in production at DGX Cloud. As the Product Manager leading all aspects of resilient automation at AI Factory, you will manage break-fix automation. You will develop the product strategy, improve operator experience, and guide the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs. This is your chance to compose how AI factories self-heal! **What You’ll Be Doing:** * Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs. * Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety. * Build the operator UX for repair queues, workflow transparency, and audit trails — ensuring on-call engineers have the context they need to act quickly and confidently. * Drive the integration between failure attribution and automated repair actions, following through from detection to resolution. * Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability. * Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale. **What We Need to See:** * 12+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background. * BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience. * Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation.

Free ATS check

Applying for this Senior Product Manager, AI Factory Infra role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →