NVIDIA

AI/ML

SystemsQualityandReliabilityLeadLPU

$168–311k Santa Clara, California, United States FULL TIME Remote Friendly
The Brief

“Systems Quality and Reliability Lead - LPU at NVIDIA. Skills: Systems Quality and Reliability Engineering, Failure Analysis (FA), Root-cause analysis, Hardware quality performance, RMA management. Own, build, and manage the RMA and FA debug and root-cause analysis for existing and new Nvidia AI/ML products. Conduct tests, and root-cause analysis”

What You'll Achieve.

Achieve key perf indicators including FA cycle times, fault duplication rates and fault isolation rates

Industry & Context.

AI/ML
Problems you'll solve

Root-cause analysis; Fault isolation

What They're Looking For.

Must Have

BS/MS in EE, Physics or a related degree (or equivalent experience), 8+ yrs of hands on systems test and/or validation engineering experience, Proven hands-on management and leadership experience, Competence using lab equipment such as oscilloscopes, logic analyzers, power analyzers etc., Experience with enabling reliability tests such as HTOL and quality tests such as Burn in, Proficiency with high speed interfaces (SerDes, PCIe, DDR), Proficiency in Python, PERL, C++, or other languages on UNIX /Linux, Excellent knowledge of PCB card and system level test and debug as well as be able to manage factory floor partners (CMs) for RMA/FA activities

Nice to Have

working knowledge of FA techniques and tools such as FIB, SEM, TDR, VNA and CSAM, knowledge of Fault isolation techniques such as OBIRCH, DLS/LADA, LVP and LVI

What You'll Do.

and manage the RMA and FA debug and root-cause analysis for existing and new Nvidia AI/ML products

and root-cause analysis

Conduct and lead debug and root-cause analysis of field RMAs

Scale root cause FA capabilities within your organization

Create FA result reports that align with standard 8D or similar process

Identify trends and raise quality alerts when necessary

and mitigation plans for such quality alerts

Oversee hardware quality performance

monitoring field quality data and associated metrics including RMA rates

and Reliability Ratio

Manage operational perf of FA at CMs

ensuring partner achieve key perf indicators including FA cycle times

fault duplication rates and fault isolation rates

Oversee the setup of new products into Failure Analysis operations

How You'll Work.

Team & Collaboration

Collaborate with Systems Engineers, Hardware engineers, Software engineers, and operations engineers as required; Manage factory floor partners (CMs) for RMA/FA activities

Communication Scope

Create FA result reports

Free ATS check

Applying for this Systems Quality and Reliability Lead - LPU role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →