NVIDIA
Technology
PrincipalSoftwareEngineer-ComputeInfrastructure
Neural analysis suggests this role is
optimal for Principal candidates.
“Principal Software Engineer - Compute Infrastructure at NVIDIA. Skills: Platform Architecture, AI Inference Infrastructure, Capacity & Scale, Paved Road development, Complex Migrations. Define Platform Architecture. architect and transform our global enterprise compute platform”
What You'll Achieve.
driving efficiency; defining platform architecture; optimizing the performance of our infrastructure; operationalization of our internal frontier-class AI inference systems; scaling to frontier-class models; navigate extreme hardware supply constraints; drive cultural adoption of standard platforms; build the "Paved Road"; build "Day 2" operational maturity
Industry & Context.
mitigating hardware-level failures; mitigating silent data corruption; mitigating anomalies in large-scale environments; automated remediation pipelines; advanced auto-remediation
What They're Looking For.
Must Have
Bachelor's degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience, 15+ years of proven experience in compute platform engineering, site reliability, or systems architecture with a heavy focus on automation at massive scale, Deep expertise in Kubernetes architecture and designing/deploying virtualization architectures, specifically operating VMs inside K8s (KubeVirt, OpenShift), In-depth knowledge of hardware technologies (GPUs, high-speed backplane networking) with a track record of mitigating hardware-level failures, silent data corruption, and anomalies in large-scale environments, Experience running large global environments spanning bare metal, virtualized infrastructure, and cloud with a unified GitOps posture (ArgoCD or similar), Proficiency in programming languages such as Go and/or Python, alongside expert-level infrastructure-as-code development (Terraform, Config Management), leadership skills with the ability to influence technical direction across highly autonomous teams without relying on top-down mandates
Nice to Have
Hands-on experience managing bleeding-edge, pre-release hardware in production environments, Deep understanding of advanced storage migrations and protocols (NFSv4, NVMe/TCP, Hyperconverged storage), Solid understanding of microservices architecture and seamless multi-cloud deployment strategies (AWS, GCP), Proven track record of building "Day 2" operational maturity (self-service, advanced auto-remediation, strict SLAs) from the ground up on existing foundations
What You'll Do.
Define Platform Architecture
architect and transform our global enterprise compute platform
defining service tiers
and automated cluster lifecycles
Operationalize Frontier AI Infrastructure
Build the operational foundation for our internal AI inference platform scaling to frontier-class models
develop automated remediation pipelines
and telemetry for pre-release
rack-scale GPU systems
Drive Strategic Capacity & Scale
Collect and review system data for capacity planning
Develop proactive strategies
including public cloud bursting
and evaluating alternative compute architectures
Build the "Paved Road"
design compelling self-service architectures
and Terraform/OpenTofu providers
Lead Complex Migrations
Evaluate existing application architectures
drive the fraught but critical migration of massive legacy workloads—including large-scale
long-running VDI environments—into modern Kubernetes orchestration
How You'll Work.
Team & Collaboration
Collaborate with highly autonomous NVIDIA engineering teams to drive cultural adoption of standard platforms
Full Job Description
NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. It is a unique legacy of innovation that’s fueled by great technology and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, generative AI, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. We are seeking a highly skilled Principal Software Engineer to join our dynamic team. Our company is at the forefront of technological innovation, and we are dedicated to driving efficiency, defining platform architecture, and optimizing the performance of our infrastructure both on-prem and in the cloud. You will lead the architectural vision for a massive global platform and spearhead the operationalization of our internal frontier-class AI inference systems. Join us in this exciting endeavor! **What You Will Be Doing:** * Define Platform Architecture: Lead initiatives to architect and transform our global enterprise compute platform—running thousands of nodes and tens of thousands of VMs and containers via OpenShift and KubeVirt—by defining service tiers, SLAs, and automated cluster lifecycles. * Operationalize Frontier AI Infrastructure: Build the operational foundation for our internal AI inference platform scaling to frontier-class models. You will develop automated remediation pipelines, hardware watchdogs, and telemetry for pre-release, rack-scale GPU systems (including Blackwell and upcoming architectures). * Drive Strategic Capacity & Scale: Collect and review system data for capacity planning to navigate extreme hardware supply constraints. Develop proactive strategies, including public cloud bursting, hardware dogfooding, and evaluating alt
Applying for this Principal Software Engineer - Compute Infrastructure role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.