AI Inference Performance Engineer
Technology
AIInferencePerformanceEngineer-NewCollegeGrad2026
Neural analysis suggests this role is
optimal for Entry candidates.
“AI Inference Performance Engineer - New College Grad 2026 at AI Inference Performance Engineer. Skills: AI inference performance, Deep learning inference, LLM optimization, GPU programming. Drive industry benchmark results. Implement optimizations in quantization”
What You'll Achieve.
Define industry performance standards; Build tools to evaluate serving performance; Deliver measurable performance improvements
Industry & Context.
Root cause analysis; Troubleshooting
What They're Looking For.
Must Have
BS, MS, or PhD, 2+ years software development, Python or C++ programming, Software design skills, Software engineering skills, Expertise with PyTorch or JAX, Deliver performance improvements
Nice to Have
Prior LLM framework experience, Prior DL compiler experience, Prior performance modeling experience, Prior profiling experience, Prior debug experience, Prior code optimization experience, Experience with scale-out inference orchestration, Kernel development expertise, Compiler/runtime paths expertise, Architectural knowledge of CPU/GPU/FPGA, GPU programming experience, Track record leading technical programs
What You'll Do.
Drive industry benchmark results
Implement optimizations in quantization
Implement optimizations in scheduling
Implement optimizations in memory management
Implement optimizations in distributed inference
Integrate optimizations in quantization
Integrate optimizations in scheduling
Integrate optimizations in memory management
Integrate optimizations in distributed inference
Define cutting-edge workloads
Optimize cutting-edge workloads
Identify next-generation inference benchmarks
Shape next-generation inference benchmarks
Identify emerging AI use cases
Shape emerging AI use cases
Collaborate with framework teams
Collaborate with kernel teams
Push performance on LLM-MoE models
Push performance on vision-language models
Push performance on video diffusion models
Push performance on recommendation workloads
Push performance on speech workloads
Design distributed inference
Optimize distributed inference
Manage performance across GPU clusters
Apply roofline analysis
Apply systematic profiling
Decompose bottlenecks across CUDA kernels
Decompose bottlenecks across frameworks
Decompose bottlenecks across serving layers
Contribute to TensorRT-LLM
Contribute to open-source projects
Partner with architecture teams
Partner with kernel teams
Partner with compiler teams
Raise technical bar for team
Drive cross-functional execution
Lead world-class team
How You'll Work.
Team & Collaboration
Cross-functional execution; Partner with architecture teams; Partner with kernel teams; Partner with compiler teams
Process & Methodology
Benchmark timelines
Full Job Description
We optimize and benchmark GenAI inference on NVIDIA's latest accelerators, defining the industry’s performance standards across language models, video generation, and speech workloads. We work directly within TensorRT-LLM, SGLang, and vLLM, building the tools that evaluate serving performance at scale. This team sits at the intersection of GPU performance engineering and public accountability. **What You Will Be Doing:** * Drive industry benchmark results: own the end-to-end optimization pipeline, implement and integrate optimizations in quantization, scheduling, memory management, and distributed inference across TensorRT-LLM, SGLang, and vLLM. * Define and optimize cutting-edge workloads: identify and shape next-generation inference benchmarks, multi-turn coding, agentic workflows, and other emerging AI use cases. Collaborate with framework and kernel teams to push performance to its extreme on large-scale LLM-MoE models, vision-language models, video diffusion models, recommendation, and speech workloads. * Architect distributed inference: Design and optimize execution from single-GPU to rack-scale clusters, managing performance across clusters of GPUs. * Establish performance methodology: Apply roofline analysis and systematic profiling to decompose bottlenecks across CUDA kernels, frameworks, and serving layers. * Influence the ecosystem: contribute to TensorRT-LLM, vLLM, SGLang, and other open-source projects. Partner with architecture, kernel, and compiler teams to shape GPU roadmaps based on real workload data. * Technical Leadership: Raise the technical bar for the team, drive cross-functional execution on tight benchmark timelines, and lead a world-class team. **What We Need To See:** * BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience. * 2+ years of relevant software development experience. * Strong Python or C++ programming, software design, and software engineering skills. * Expertise with a DL fr
Applying for this AI Inference Performance Engineer - New College Grad 2026 role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about AI Inference Performance Engineer?
Real rants from real employees. Read before you apply.