NVIDIA

Technology

AIInferencePerformanceEngineer-NewCollegeGrad2026

$124–242k Santa Clara, California, United States FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Entry candidates.

The Brief

“AI Inference Performance Engineer - New College Grad 2026 at NVIDIA. Skills: AI Inference, Performance Engineering, Deep Learning, GPU Computing. Drive industry benchmark results. Optimize end-to-end pipeline”

Industry & Context.

Technology

Problems you'll solve

Bottleneck decomposition; Troubleshooting

What They're Looking For.

Must Have

BS, MS, or PhD, 2+ years software development experience, Python or C++ programming, Software design skills, Software engineering skills, Expertise with PyTorch or JAX, Deliver measurable performance improvements

Nice to Have

Prior LLM framework experience, Prior DL compiler experience, Prior performance modeling experience, Prior profiling experience, Prior debug experience, Prior code optimization experience, Experience with scale-out inference orchestration, Expertise in kernel development, Expertise in compiler/runtime paths, Architectural knowledge of CPU, GPU, FPGA, DL GPU programming experience, Track record leading technical programs

What You'll Do.

Drive industry benchmark results

Optimize end-to-end pipeline

Implement optimizations in quantization

Integrate optimizations in quantization

Implement optimizations in scheduling

Integrate optimizations in scheduling

Implement optimizations in memory management

Integrate optimizations in memory management

Implement optimizations in distributed inference

Integrate optimizations in distributed inference

Define cutting-edge workloads

Optimize cutting-edge workloads

Identify next-generation inference benchmarks

Shape next-generation inference benchmarks

Identify emerging AI use cases

Shape emerging AI use cases

Push performance on LLM-MoE models

Push performance on vision-language models

Push performance on video diffusion models

Push performance on recommendation workloads

Push performance on speech workloads

Design distributed inference execution

Optimize distributed inference execution

Manage performance across GPU clusters

Apply roofline analysis

Apply systematic profiling

Decompose bottlenecks across CUDA kernels

Decompose bottlenecks across frameworks

Decompose bottlenecks across serving layers

Contribute to TensorRT-LLM

Contribute to open-source projects

Raise technical bar for team

Drive cross-functional execution

Lead world-class team

How You'll Work.

Team & Collaboration

Cross-functional execution; Framework teams; Kernel teams; Architecture teams; Compiler teams

Process & Methodology

Benchmark timelines

Full Job Description

We optimize and benchmark GenAI inference on NVIDIA's latest accelerators, defining the industry’s performance standards across language models, video generation, and speech workloads. We work directly within TensorRT-LLM, SGLang, and vLLM, building the tools that evaluate serving performance at scale. This team sits at the intersection of GPU performance engineering and public accountability. **What You Will Be Doing:** * Drive industry benchmark results: own the end-to-end optimization pipeline, implement and integrate optimizations in quantization, scheduling, memory management, and distributed inference across TensorRT-LLM, SGLang, and vLLM. * Define and optimize cutting-edge workloads: identify and shape next-generation inference benchmarks, multi-turn coding, agentic workflows, and other emerging AI use cases. Collaborate with framework and kernel teams to push performance to its extreme on large-scale LLM-MoE models, vision-language models, video diffusion models, recommendation, and speech workloads. * Architect distributed inference: Design and optimize execution from single-GPU to rack-scale clusters, managing performance across clusters of GPUs. * Establish performance methodology: Apply roofline analysis and systematic profiling to decompose bottlenecks across CUDA kernels, frameworks, and serving layers. * Influence the ecosystem: contribute to TensorRT-LLM, vLLM, SGLang, and other open-source projects. Partner with architecture, kernel, and compiler teams to shape GPU roadmaps based on real workload data. * Technical Leadership: Raise the technical bar for the team, drive cross-functional execution on tight benchmark timelines, and lead a world-class team. **What We Need To See:** * BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience. * 2+ years of relevant software development experience. * Strong Python or C++ programming, software design, and software engineering skills. * Expertise with a DL fr

Free ATS check

Applying for this AI Inference Performance Engineer - New College Grad 2026 role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 38 detected · ranked by frequency

Performance Engineering ×3

Quantization ×3

Scheduling ×3

Memory management ×3

LLM architectures ×3

VLM architectures ×3

Inference mechanics ×3

KV caching ×3

Batching strategies ×3

Decode-phase bottlenecks ×3

Speculative decoding ×3

Disaggregated serving ×3

Kernel development ×3

Compiler paths ×3

AI Inference ×2

Deep Learning ×2

GPU Computing ×2

TensorRT-LLM ×2

SGLang ×2

vLLM ×2

Python

PyTorch

JAX

CUDA

Inference optimization

Distributed inference

Performance methodology

Roofline analysis

Systematic profiling

GPU programming

MPI

NCCL

BEHAVIOURAL

LeadershipTechnical leadership

Role Details

Seniority mid

Experience 0–2 yrs

Level Entry

Work Mode Remote

Type FULL TIME

Education Bachelor's

Salary Band 100k-150k

AI-Extracted Insights

Domain Areas

genai-inferencellm-inferencevideo-generationspeech-workloadslarge-scale-llm-moevision-language-modelsvideo-diffusion-modelsrecommendation-workloads

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →