AI Inference Performance Engineer

Technology

AIInferencePerformanceEngineer-NewCollegeGrad2026

$124–242k Santa Clara, California, United States FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Entry candidates.

The Brief

“AI Inference Performance Engineer - New College Grad 2026 at AI Inference Performance Engineer. Skills: AI inference performance, Deep learning inference, LLM optimization, GPU programming. Drive industry benchmark results. Implement optimizations in quantization”

What You'll Achieve.

Define industry performance standards; Build tools to evaluate serving performance; Deliver measurable performance improvements

Industry & Context.

Technology
Problems you'll solve

Root cause analysis; Troubleshooting

What They're Looking For.

Must Have

BS, MS, or PhD, 2+ years software development, Python or C++ programming, Software design skills, Software engineering skills, Expertise with PyTorch or JAX, Deliver performance improvements

Nice to Have

Prior LLM framework experience, Prior DL compiler experience, Prior performance modeling experience, Prior profiling experience, Prior debug experience, Prior code optimization experience, Experience with scale-out inference orchestration, Kernel development expertise, Compiler/runtime paths expertise, Architectural knowledge of CPU/GPU/FPGA, GPU programming experience, Track record leading technical programs

What You'll Do.

Drive industry benchmark results

Implement optimizations in quantization

Implement optimizations in scheduling

Implement optimizations in memory management

Implement optimizations in distributed inference

Integrate optimizations in quantization

Integrate optimizations in scheduling

Integrate optimizations in memory management

Integrate optimizations in distributed inference

Define cutting-edge workloads

Optimize cutting-edge workloads

Identify next-generation inference benchmarks

Shape next-generation inference benchmarks

Identify emerging AI use cases

Shape emerging AI use cases

Collaborate with framework teams

Collaborate with kernel teams

Push performance on LLM-MoE models

Push performance on vision-language models

Push performance on video diffusion models

Push performance on recommendation workloads

Push performance on speech workloads

Design distributed inference

Optimize distributed inference

Manage performance across GPU clusters

Apply roofline analysis

Apply systematic profiling

Decompose bottlenecks across CUDA kernels

Decompose bottlenecks across frameworks

Decompose bottlenecks across serving layers

Contribute to TensorRT-LLM

Contribute to open-source projects

Partner with architecture teams

Partner with kernel teams

Partner with compiler teams

Raise technical bar for team

Drive cross-functional execution

Lead world-class team

How You'll Work.

Team & Collaboration

Cross-functional execution; Partner with architecture teams; Partner with kernel teams; Partner with compiler teams

Process & Methodology

Benchmark timelines

Full Job Description

We optimize and benchmark GenAI inference on NVIDIA's latest accelerators, defining the industry’s performance standards across language models, video generation, and speech workloads. We work directly within TensorRT-LLM, SGLang, and vLLM, building the tools that evaluate serving performance at scale. This team sits at the intersection of GPU performance engineering and public accountability. **What You Will Be Doing:** * Drive industry benchmark results: own the end-to-end optimization pipeline, implement and integrate optimizations in quantization, scheduling, memory management, and distributed inference across TensorRT-LLM, SGLang, and vLLM. * Define and optimize cutting-edge workloads: identify and shape next-generation inference benchmarks, multi-turn coding, agentic workflows, and other emerging AI use cases. Collaborate with framework and kernel teams to push performance to its extreme on large-scale LLM-MoE models, vision-language models, video diffusion models, recommendation, and speech workloads. * Architect distributed inference: Design and optimize execution from single-GPU to rack-scale clusters, managing performance across clusters of GPUs. * Establish performance methodology: Apply roofline analysis and systematic profiling to decompose bottlenecks across CUDA kernels, frameworks, and serving layers. * Influence the ecosystem: contribute to TensorRT-LLM, vLLM, SGLang, and other open-source projects. Partner with architecture, kernel, and compiler teams to shape GPU roadmaps based on real workload data. * Technical Leadership: Raise the technical bar for the team, drive cross-functional execution on tight benchmark timelines, and lead a world-class team. **What We Need To See:** * BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience. * 2+ years of relevant software development experience. * Strong Python or C++ programming, software design, and software engineering skills. * Expertise with a DL fr

Free ATS check

Applying for this AI Inference Performance Engineer - New College Grad 2026 role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about AI Inference Performance Engineer?

Real rants from real employees. Read before you apply.

Read Company Rants →