NVIDIA

Technology

PrincipalSoftwareEngineer-ComputeInfrastructure

$248–391k Santa Clara, California, United States FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Principal candidates.

The Brief

“Principal Software Engineer - Compute Infrastructure at NVIDIA. Skills: Platform Architecture, AI Inference Infrastructure, Capacity & Scale, Paved Road development, Complex Migrations. Define Platform Architecture. architect and transform our global enterprise compute platform”

What You'll Achieve.

driving efficiency; defining platform architecture; optimizing the performance of our infrastructure; operationalization of our internal frontier-class AI inference systems; scaling to frontier-class models; navigate extreme hardware supply constraints; drive cultural adoption of standard platforms; build the "Paved Road"; build "Day 2" operational maturity

Industry & Context.

Technology
Problems you'll solve

mitigating hardware-level failures; mitigating silent data corruption; mitigating anomalies in large-scale environments; automated remediation pipelines; advanced auto-remediation

What They're Looking For.

Must Have

Bachelor's degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience, 15+ years of proven experience in compute platform engineering, site reliability, or systems architecture with a heavy focus on automation at massive scale, Deep expertise in Kubernetes architecture and designing/deploying virtualization architectures, specifically operating VMs inside K8s (KubeVirt, OpenShift), In-depth knowledge of hardware technologies (GPUs, high-speed backplane networking) with a track record of mitigating hardware-level failures, silent data corruption, and anomalies in large-scale environments, Experience running large global environments spanning bare metal, virtualized infrastructure, and cloud with a unified GitOps posture (ArgoCD or similar), Proficiency in programming languages such as Go and/or Python, alongside expert-level infrastructure-as-code development (Terraform, Config Management), leadership skills with the ability to influence technical direction across highly autonomous teams without relying on top-down mandates

Nice to Have

Hands-on experience managing bleeding-edge, pre-release hardware in production environments, Deep understanding of advanced storage migrations and protocols (NFSv4, NVMe/TCP, Hyperconverged storage), Solid understanding of microservices architecture and seamless multi-cloud deployment strategies (AWS, GCP), Proven track record of building "Day 2" operational maturity (self-service, advanced auto-remediation, strict SLAs) from the ground up on existing foundations

What You'll Do.

Define Platform Architecture

architect and transform our global enterprise compute platform

defining service tiers

and automated cluster lifecycles

Operationalize Frontier AI Infrastructure

Build the operational foundation for our internal AI inference platform scaling to frontier-class models

develop automated remediation pipelines

and telemetry for pre-release

rack-scale GPU systems

Drive Strategic Capacity & Scale

Collect and review system data for capacity planning

Develop proactive strategies

including public cloud bursting

and evaluating alternative compute architectures

Build the "Paved Road"

design compelling self-service architectures

and Terraform/OpenTofu providers

Lead Complex Migrations

Evaluate existing application architectures

drive the fraught but critical migration of massive legacy workloads—including large-scale

long-running VDI environments—into modern Kubernetes orchestration

How You'll Work.

Team & Collaboration

Collaborate with highly autonomous NVIDIA engineering teams to drive cultural adoption of standard platforms

Full Job Description

NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. It is a unique legacy of innovation that’s fueled by great technology and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, generative AI, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. We are seeking a highly skilled Principal Software Engineer to join our dynamic team. Our company is at the forefront of technological innovation, and we are dedicated to driving efficiency, defining platform architecture, and optimizing the performance of our infrastructure both on-prem and in the cloud. You will lead the architectural vision for a massive global platform and spearhead the operationalization of our internal frontier-class AI inference systems. Join us in this exciting endeavor! **What You Will Be Doing:** * Define Platform Architecture: Lead initiatives to architect and transform our global enterprise compute platform—running thousands of nodes and tens of thousands of VMs and containers via OpenShift and KubeVirt—by defining service tiers, SLAs, and automated cluster lifecycles. * Operationalize Frontier AI Infrastructure: Build the operational foundation for our internal AI inference platform scaling to frontier-class models. You will develop automated remediation pipelines, hardware watchdogs, and telemetry for pre-release, rack-scale GPU systems (including Blackwell and upcoming architectures). * Drive Strategic Capacity & Scale: Collect and review system data for capacity planning to navigate extreme hardware supply constraints. Develop proactive strategies, including public cloud bursting, hardware dogfooding, and evaluating alt

Free ATS check

Applying for this Principal Software Engineer - Compute Infrastructure role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →