NVIDIA

Technology

SeniorRASandPowerManagementFirmwareArchitect

$550–850k ~AI est. Yokneam, Israel FULL TIME
The Brief

“Senior RAS and Power Management Firmware Architect at NVIDIA. Skills: RAS architecture, Power management, Firmware architecture. Define platform-level firmware architecture. Own error detection architecture”

Industry & Context.

Technology
Problems you'll solve

Failure-mode analysis; Customer failure analysis

What They're Looking For.

Must Have

BSc, MS, or PhD, 7+ years of relevant experience, Deep understanding of RAS principles, Experience architecting firmware, Knowledge of power management concepts, Familiarity with boot firmware, Understanding of hardware/software interfaces, Programming and debugging fundamentals, Ability to lead cross-functional architecture discussions

Nice to Have

Experience with PCIe AER, Experience with CXL RAS, Experience with memory RAS, Experience with ECC/parity, Experience with accelerator RAS, Experience with networking RAS, Experience with high-availability systems, Experience with large-scale data center platforms, Knowledge of ACPI, Knowledge of SMBIOS, Knowledge of UEFI, Knowledge of PLDM, Knowledge of MCTP, Knowledge of Redfish, Knowledge of IPMI, Knowledge of cloud telemetry systems, Experience with power/thermal fault handling, Experience with dynamic power management, Experience with platform power sequencing, Experience with low-power states, Experience with autonomous recovery mechanisms, Background in silicon bring-up, Background in platform validation, Background in production diagnostics, Background in customer failure analysis, Prior technical leadership experience

What You'll Do.

Define platform-level firmware architecture

Own error detection architecture

Own error classification architecture

Own error containment architecture

Own error recovery architecture

Own error escalation architecture

Own error reporting architecture

Define firmware architecture for power sequencing

Define firmware architecture for power states

Define firmware architecture for reset flows

Define firmware architecture for thermal fault handling

Define firmware architecture for power fault handling

Define firmware architecture for idle management

Define firmware architecture for recovery from power-related failures

Create firmware specifications for hardware error handling

Create firmware specifications for health monitoring

Create firmware specifications for crash capture

Create firmware specifications for telemetry

Create firmware specifications for diagnostics

Create firmware specifications for debug data

Create firmware specifications for field serviceability

Define interfaces between firmware and hardware

Define contracts between firmware and hardware

Define interfaces between firmware and operating systems

Define contracts between firmware and operating systems

Define interfaces between firmware and BMCs

Define contracts between firmware and BMCs

Define interfaces between firmware and management controllers

Define contracts between firmware and management controllers

Define interfaces between firmware and platform software

Define contracts between firmware and platform software

Define interfaces between firmware and cloud infrastructure

Define contracts between firmware and cloud infrastructure

Define interfaces between firmware and service infrastructure

Define contracts between firmware and service infrastructure

Drive architecture reviews

Drive tradeoff discussions

Drive failure-mode analysis

Drive validation strategy

Drive long-term RAS roadmap planning

Drive long-term power management roadmap planning

Establish standards for error logs

Establish standards for event schemas

Establish standards for telemetry flows

Establish standards for recovery policies

Establish standards for service diagnostics

Establish standards for production debug infrastructure

Guide engineering teams through implementation

Guide engineering teams through validation

Guide engineering teams through silicon bring-up

Guide engineering teams through platform integration

Guide engineering teams through production deployment

Analyze customer failures

Analyze field failures

Identify architectural gaps

Feed lessons learned into future designs

How You'll Work.

Team & Collaboration

Cross-functional teams; Hardware, firmware, software teams; Validation teams; Customer engineering teams; External partners; Cross-functional architecture discussions

Communication Scope

Communication skills

Process & Methodology

Roadmap planning

Free ATS check

Applying for this Senior RAS and Power Management Firmware Architect role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →