NVIDIA
SeniorProductManager,AIFactoryInfra
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Product Manager, AI Factory Infra at NVIDIA. Skills: AI Factory Infra, Break-fix automation, Product strategy, Operator UX. Manage break-fix automation. Develop product strategy”
Industry & Context.
Break-fix automation; Failure attribution; Automated repair actions
What They're Looking For.
Must Have
12+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background, BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience, Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation, Track record owning products with real-world operational consequences, operator UX instincts, Ability to build alignment across engineering, SRE, and external vendor partner teams
Nice to Have
Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments, Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale, Background in reliability engineering, SLO build, or chaos/fault-injection testing, Prior experience at a cloud service provider or Hyperscalers infrastructure team, Experience building Agentic AI workflow software
What You'll Do.
Manage break-fix automation
Develop product strategy
Improve operator experience
Guide roadmap for professionals
Take responsibility for strategic direction
Define automation confidence thresholds
Build operator UX for repair queues
Drive integration between failure attribution
Own metrics framework
Collaborate with NCP operators
Collaborate with SRE teams
Collaborate with hardware vendor partners
How You'll Work.
Team & Collaboration
Build alignment across engineering, SRE, and external vendor partner teams; Collaborate with NCP operators, SRE teams, and hardware vendor partners
Full Job Description
NVIDIA is driving a vision for AI factories that convert tokens to intelligence at scale to power AI demands of tomorrow. Maintaining AI infrastructure at scale takes more than human involvement; it demands smart automation. The orchestration engine for AI factory break-fix runs live in production at DGX Cloud. As the Product Manager leading all aspects of resilient automation at AI Factory, you will manage break-fix automation. You will develop the product strategy, improve operator experience, and guide the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs. This is your chance to compose how AI factories self-heal! **What You’ll Be Doing:** * Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs. * Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety. * Build the operator UX for repair queues, workflow transparency, and audit trails — ensuring on-call engineers have the context they need to act quickly and confidently. * Drive the integration between failure attribution and automated repair actions, following through from detection to resolution. * Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability. * Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale. **What We Need to See:** * 12+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background. * BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience. * Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation.
Applying for this Senior Product Manager, AI Factory Infra role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.