NVIDIA
Technology
SeniorRASandPowerManagementFirmwareArchitect
“Senior RAS and Power Management Firmware Architect at NVIDIA. Skills: RAS architecture, Power management, Firmware architecture. Define platform-level firmware architecture. Own error detection architecture”
Industry & Context.
Failure-mode analysis; Customer failure analysis
What They're Looking For.
Must Have
BSc, MS, or PhD, 7+ years of relevant experience, Deep understanding of RAS principles, Experience architecting firmware, Knowledge of power management concepts, Familiarity with boot firmware, Understanding of hardware/software interfaces, Programming and debugging fundamentals, Ability to lead cross-functional architecture discussions
Nice to Have
Experience with PCIe AER, Experience with CXL RAS, Experience with memory RAS, Experience with ECC/parity, Experience with accelerator RAS, Experience with networking RAS, Experience with high-availability systems, Experience with large-scale data center platforms, Knowledge of ACPI, Knowledge of SMBIOS, Knowledge of UEFI, Knowledge of PLDM, Knowledge of MCTP, Knowledge of Redfish, Knowledge of IPMI, Knowledge of cloud telemetry systems, Experience with power/thermal fault handling, Experience with dynamic power management, Experience with platform power sequencing, Experience with low-power states, Experience with autonomous recovery mechanisms, Background in silicon bring-up, Background in platform validation, Background in production diagnostics, Background in customer failure analysis, Prior technical leadership experience
What You'll Do.
Define platform-level firmware architecture
Own error detection architecture
Own error classification architecture
Own error containment architecture
Own error recovery architecture
Own error escalation architecture
Own error reporting architecture
Define firmware architecture for power sequencing
Define firmware architecture for power states
Define firmware architecture for reset flows
Define firmware architecture for thermal fault handling
Define firmware architecture for power fault handling
Define firmware architecture for idle management
Define firmware architecture for recovery from power-related failures
Create firmware specifications for hardware error handling
Create firmware specifications for health monitoring
Create firmware specifications for crash capture
Create firmware specifications for telemetry
Create firmware specifications for diagnostics
Create firmware specifications for debug data
Create firmware specifications for field serviceability
Define interfaces between firmware and hardware
Define contracts between firmware and hardware
Define interfaces between firmware and operating systems
Define contracts between firmware and operating systems
Define interfaces between firmware and BMCs
Define contracts between firmware and BMCs
Define interfaces between firmware and management controllers
Define contracts between firmware and management controllers
Define interfaces between firmware and platform software
Define contracts between firmware and platform software
Define interfaces between firmware and cloud infrastructure
Define contracts between firmware and cloud infrastructure
Define interfaces between firmware and service infrastructure
Define contracts between firmware and service infrastructure
Drive architecture reviews
Drive tradeoff discussions
Drive failure-mode analysis
Drive validation strategy
Drive long-term RAS roadmap planning
Drive long-term power management roadmap planning
Establish standards for error logs
Establish standards for event schemas
Establish standards for telemetry flows
Establish standards for recovery policies
Establish standards for service diagnostics
Establish standards for production debug infrastructure
Guide engineering teams through implementation
Guide engineering teams through validation
Guide engineering teams through silicon bring-up
Guide engineering teams through platform integration
Guide engineering teams through production deployment
Analyze customer failures
Analyze field failures
Identify architectural gaps
Feed lessons learned into future designs
How You'll Work.
Team & Collaboration
Cross-functional teams; Hardware, firmware, software teams; Validation teams; Customer engineering teams; External partners; Cross-functional architecture discussions
Communication Scope
Communication skills
Process & Methodology
Roadmap planning
Applying for this Senior RAS and Power Management Firmware Architect role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.