Tavus

Engineering, Product, & Design

AIResearcher(MultimodalAudio/VideoGeneration)

San Francisco, California, United States; London, United Kingdom; United States FULL TIME Remote Friendly

The Brief

“AI Researcher (Multimodal Audio/Video Generation) at Tavus. Skills: Audio-visual generation, Diffusion models, Long-video generation, Audio-visual modeling. Lead research efforts on audio-visual generation for avatars (Neural Avatars, Talking-Heads), with a focus on conversational settings. Design models that are coupled with conversation flow — capturing and generating verbal + non-verbal signals in sync”

What You'll Achieve.

Publish impactful work

Industry & Context.

Engineering, Product, & Design

What They're Looking For.

Must Have

PhD or equivalent research experience, 2–3+ years of hands-on experience applying generative models at scale, Expertise in diffusion models, Experience in multimodal generation — spanning video, audio, and language, Proven innovation in long-video generation and/or audio generation, Excellent programming skills — fluent in PyTorch and GPU-optimized workflows, Track record of publications in top-tier venues (CVPR, NeurIPS, BMVC, ICASSP, etc.), Experience leading research activities or mentoring teams

Nice to Have

Skills in 3D graphics, Gaussian splatting, or large-scale training setups, Broad exposure to generative AI models beyond your specialty, Familiarity with software development best practices

What You'll Do.

Lead research efforts on audio-visual generation for avatars (Neural Avatars

with a focus on conversational settings

Design models that are coupled with conversation flow — capturing and generating verbal + non-verbal signals in sync

Drive innovation in diffusion models

long-video generation

and audio-visual modeling

Translate research into production by partnering with Applied ML and engineering

set research directions

and publish impactful work

How You'll Work.

Team & Collaboration

Partnering with Applied ML and engineering

Free ATS check

Applying for this AI Researcher (Multimodal Audio/Video Generation) role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

Skill Signal 19 detected

Required

Audio-visual generation ×3

Diffusion models ×3

Long-video generation ×3

Audio-visual modeling ×3

Audio generation ×3

Video generation ×3

Multimodal generation ×3

GPU-optimized workflows ×3

PyTorch ×2

Nice to have

3D graphics

Gaussian splatting

Generative models

Neural Avatars

Talking-Heads

Conversational settings

Conversation flow

Applied ML

Large-scale training setups

Generative AI models

Behavioural

Thrives in ambiguity

Mentoring

Role Details

Work Mode

Hybrid

Type

FULL TIME

Experience

2–3 yrs

Education

PhD

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Tavus?

Real rants from real employees. Read before you apply.

Read Company Rants →