NVIDIA

PrincipalHigh-PerformanceLLMTrainingEngineer

$272–431k Santa Clara, California, United States FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Principal candidates.

The Brief

“Principal High-Performance LLM Training Engineer at NVIDIA. Skills: High-Performance LLM Training, AI Training Performance Optimization, Distributed Systems, GPU Architecture, Deep Learning Frameworks. Lead end-to-end performance analysis and optimization of innovative LLM pre-training and post-training workloads on the latest NVIDIA hardware and software platforms. Drive workloads closer to speed-of-light performance by identifying and removing bottlenecks across compute, memory, communication,”

What You'll Achieve.

drive improvements across frameworks such as PyTorch, JAX, NeMo, and NeMo RL; help shape future NVIDIA GPU, system, and software roadmaps; directly improving performance directly; setting technical direction; raising the bar for the organization; influencing multi-functional decisions across NVIDIA; improve training performance, efficiency, and developer velocity; guide future GPU, networking, system, and software architecture decisions; advocate for changes that improve performance and efficiency across the AI ecosystem; establish best practices for large-scale AI performance analysis and optimization

Industry & Context.

Problems you'll solve

analyze and optimize frontier-scale LLM workloads; identify and remove bottlenecks; diagnose complex bottlenecks and drive measurable improvements

What They're Looking For.

Must Have

MS, or PhD (or equivalent experience) in Computer Science, Electrical Engineering, Computer Engineering, or a related field, 12+ years of relevant work or research experience, Demonstrated principal-level technical impact in one or more of the following areas: large-scale AI training systems, GPU performance optimization, distributed systems, high-performance computing, ML frameworks, compilers/runtimes, or hardware/software co-design, Deep hands-on experience analyzing and optimizing performance of large-scale deep learning workloads, especially transformer-based models, LLM pre-training, reinforcement learning, fine-tuning, or other post-training workloads, understanding of GPU and AI accelerator architecture from individual accelerators to datacenter-scale systems, Experience with distributed training techniques such as data parallelism, tensor parallelism, pipeline parallelism, expert parallelism, sequence parallelism, activation checkpointing, mixed precision training, and communication/computation overlap, A track record of using profiling, tracing, benchmarking, and performance modeling tools to diagnose complex bottlenecks and drive measurable improvements, Excellent communication and technical leadership skills, with the ability to influence architecture and software decisions across multiple teams without relying on direct authority

What You'll Do.

Lead end-to-end performance analysis and optimization of innovative LLM pre-training and post-training workloads on the latest NVIDIA hardware and software platforms

Drive workloads closer to speed-of-light performance by identifying and removing bottlenecks across compute

and system-level scaling

Develop production-quality software

and analysis infrastructure that improve training performance

and developer velocity across NVIDIA’s AI software stack

Build and refine performance models

workload characterizations

and simulation methodologies to guide future GPU

and software architecture decisions

Serve as a technical authority for AI training performance

partnering closely with teams across GPU architecture

Translate workload insights into concrete hardware and software recommendations

and advocate for changes that improve performance and efficiency across the AI ecosystem

Mentor and provide technical leadership to engineers across the organization

helping establish best practices for large-scale AI performance analysis and optimization

How You'll Work.

Team & Collaboration

partnering closely with teams across GPU architecture, systems, CUDA libraries, compilers, networking, frameworks, product management, and applied AI; influence architecture and software decisions across multiple teams without relying on direct authority

Communication Scope

Excellent communication and technical leadership skills; ability to influence architecture and software decisions across multiple teams without relying on direct authority

Full Job Description

NVIDIA is seeking a Principal Engineer to drive the performance of large-scale AI training and post-training workloads across NVIDIA’s full hardware and software stack. This role sits at the intersection of distributed training, GPU architecture, systems software, deep learning frameworks, and performance engineering. You will analyze and optimize frontier-scale LLM workloads running on thousands of GPUs, drive improvements across frameworks such as PyTorch, JAX, NeMo, and NeMo RL, and use insights from real workloads to help shape future NVIDIA GPU, system, and software roadmaps. We are looking for a deeply technical leader who can operate across abstraction layers: from application-level training behavior to framework/runtime internals, CUDA libraries, communication collectives, memory systems, networking, and GPU architecture. At this level, success means both directly improving performance directly as well as setting technical direction, raising the bar for the organization, and influencing multi-functional decisions across NVIDIA. ****What you will be doing:**** * Lead end-to-end performance analysis and optimization of innovative LLM pre-training and post-training workloads on the latest NVIDIA hardware and software platforms. * Drive workloads closer to speed-of-light performance by identifying and removing bottlenecks across compute, memory, communication, scheduling, parallelism strategy, kernel efficiency, framework overhead, and system-level scaling. * Develop production-quality software, tools, models, benchmarks, and analysis infrastructure that improve training performance, efficiency, and developer velocity across NVIDIA’s AI software stack. * Build and refine performance models, workload characterizations, and simulation methodologies to guide future GPU, networking, system, and software architecture decisions. * Serve as a technical authority for AI training performance, partnering closely with teams across GPU architecture, systems, CUDA libraries,

Free ATS check

Applying for this Principal High-Performance LLM Training Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 56 detected · ranked by frequency

performance modeling ×7

workload characterization ×4

hardware/software co-design ×4

tracing ×4

benchmarking ×4

Distributed Systems ×3

GPU Architecture ×3

Deep Learning Frameworks ×3

performance analysis ×3

optimization ×3

software development ×3

tool development ×3

model building ×3

benchmark creation ×3

analysis infrastructure development ×3

simulation ×3

distributed training techniques ×3

High-Performance LLM Training ×2

AI Training Performance Optimization ×2

PyTorch

JAX

NeMo

NeMo RL

CUDA

AI training

systems software

distributed training

large-scale AI training systems

GPU performance optimization

high-performance computing

ML frameworks

compilers/runtimes

BEHAVIOURAL

communicationtechnical leadershipinfluencecollaborationinnovationautonomy

Role Details

Seniority senior

Experience 12–99 yrs

Level Principal

Work Mode No

Type FULL TIME

Education MS, or PhD (or equivalent experience)

Salary Band 200k+

AI-Extracted Insights

Domain Areas

large-scale-ai-training-systemsgpu-performance-optimizationdistributed-systemshigh-performance-computingml-frameworkscompilers-runtimeshardware-software-co-designtransformer-based-models

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →