Nuance Labs

Technology

MemberofTechnicalStaff—ModelOptimizationandInference

$250–350k Seattle, Washington, United States FULL TIME

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid+ candidates.

The Brief

“Member of Technical Staff — Model Optimization and Inference at Nuance Labs. Skills: Model Optimization, Inference Serving, Real-time AI. Own end-to-end inference optimization. Implement and tune KV cache strategies”

What You'll Achieve.

Model responds under 500ms

Industry & Context.

Technology

Problems you'll solve

Systematic elimination of bottlenecks

What They're Looking For.

Must Have

Deep expertise in LLM inference optimization, Proficiency with inference serving frameworks, Experience optimizing diffusion model inference, Python and PyTorch comfort, Systematic approach to profiling and optimization

Nice to Have

Reading and writing CUDA or Triton kernels, Hands-on experience with post-training quantization, Familiarity with speculative decoding, Familiarity with multimodal or streaming inference architectures, Experience deploying real-time AI systems, Prior work at an AI lab, Prior work at an inference startup, Prior work on high-traffic model serving platform, Contributions to open-source inference frameworks

What You'll Do.

Own end-to-end inference optimization

Implement and tune KV cache strategies

and extend inference serving frameworks

Profile and benchmark end-to-end latency

Identify and eliminate bottlenecks

Build internal tooling

Accelerate diffusion model inference

Apply and develop quantization techniques

Work closely with research and infrastructure

How You'll Work.

Team & Collaboration

Research and infrastructure teams

Full Job Description

About Nuance Labs Nuance Labs is building photorealistic, real-time AI avatars with emotional intelligence: a full-duplex audiovisual system that can listen, speak, react, interrupt, and respond like a real person. We're a Series A company ($60M raised) backed by Lightspeed, Accel, South Park Commons, NVentures, and Define Ventures, with PhDs from MIT, UW, Oxford, CMU, and Johns Hopkins, and industry experience from Apple, Meta, Amazon AGI, and Discord. The team is small, the work is real, and the problems are unsolved. How Nuance Differentiates Most conversational AI avatars today are hacks — a face slapped on a speech-to-speech pipeline, stuck in the uncanny valley: emotionless, mechanical, one-turn-at-a-time. Current systems take 2–5 seconds to respond; natural conversation requires sub-500ms. That's a 10x improvement, and it demands rethinking the entire stack. That rethinking starts with full-duplex: an AI that listens and speaks simultaneously, perceives emotion in real time, and responds with a face that actually reflects it. It's an extremely hard problem, and we're developing foundation models designed for it from the ground up. About the Role We can train a great model. The next problem is making it fast enough to actually use in a real-time conversation — and that gap is enormous. A model that responds in 3 seconds is a demo. A model that responds in under 500ms is a product. We're looking for someone who specializes in taking trained models and squeezing every last millisecond out of them. You understand the full stack from model weights to serving infrastructure — quantization, KV cache optimization, kernel-level acceleration, batching strategies — and you know which lever to pull for which problem. You've worked with vLLM, SGLang, or similar frameworks and have opinions about where they fall short. Our stack is more complex than a standard LLM deployment: we're serving a full-duplex multimodal system that must satisfy strict real-time latency constrain

Free ATS check

Applying for this Member of Technical Staff — Model Optimization and Inference role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 32 detected · ranked by frequency

Model Optimization ×3

KV cache optimization ×3

Memory layout ×3

Attention kernels ×3

Step distillation ×3

Caching strategies ×3

Custom kernel optimizations ×3

Inference Serving ×2

Real-time AI ×2

Python

PyTorch

CUDA

Triton

LLM

Machine Learning

Deep Learning

Diffusion Models

Quantization

INT8

INT4

GPTQ

AWQ

Inference optimization

Kernel-level acceleration

Batching strategies

Latency reduction

Throughput increase

Quality performance tradeoffs

vLLM

SGLang

TensorRT-LLM

Role Details

Work Mode Onsite

Type FULL TIME

Category research

Salary Band 200k+

AI-Extracted Insights

Domain Areas

real-time-conversationfull-duplex-systemsmultimodal-systemsfoundation-modelsconversational-aiai-avatarsemotional-intelligencespeech-pipeline

How to Apply on Greenhouse

Create a Greenhouse profile before applying — it saves time across multiple applications.
Upload your resume as a PDF; the parser handles it better than Word.
Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Nuance Labs?

Real rants from real employees. Read before you apply.

Read Company Rants →