Cantina

MLResearchEngineer,TTS

Europe FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“ML Research Engineer, TTS at Cantina. Skills: Speech Systems, TTS, voice cloning, controllable TTS, voice conversion, large scale audio models, transformer architectures, diffusion models, audio language modelling, distributed model training, software engineering, PyTorch, production quality code, ML data. Architect, implement, pre-train, fine-tune, and post-train/alignment (e. g. , GRPO/DPO) for large-scale speech models. Independently lead small research projects while collaborating on larger ”

What You'll Achieve.

ship fast, reliable, and cost-aware models; meet production SLAs

Industry & Context.

Problems you'll solve

triangulate quality using subjective and objective signals; misuse/abuse mitigation

What They're Looking For.

Must Have

Exceptional research/development experience with large scale audio models (>3B models and >500k hours data), Exceptional understanding and hands-on experience with transformer architectures and/or diffusion models (inc. distillation and streaming) and/or audio language modelling, experience with multi-node and multi-gpu distributed model training, software engineering skills with a proven track record of building complex systems, writing reliable production quality code, Shipped large scale speech/audio models to production, Background in working with large-scale ML data, Ability to iterate on data, and triangulate quality using subjective and objective signals, Notable publications and/or open source contributions in speech/audio/ML, Experience with voice-cloning, speech-control, voice-generation

Nice to Have

Shipped large scale speech/audio models (TTS/VC/ASR) to production, Work on large-scale ML systems, Experience with audio language modelling, transformer architectures, Experience with voice-cloning, speech-control, voice-generation, Background in processing large-scale ML data, Publications or notable open-source in speech/audio/ML

What You'll Do.

and post-train/alignment (e. g.

GRPO/DPO) for large-scale speech models

Independently lead small research projects while collaborating on larger team initiatives

and analyze scientific experiments to advance our understanding of the models

Develop and improve dev tooling to enhance team productivity

Contribute to the entire stack

from low-level optimizations to high-level model design

Define data requirements and collaborate on acquisition

and synthetic data strategies

Design automated objective/subjective evaluations—listening tests

SV/WER/ASR-based metrics

robustness & bias checks

Harden the training → evaluation → inference profile latency

and and meet production SLAs with robust monitoring and rollback

Partner with infrastructure to run distributed training/inference on cloud fleets and productionize models with reliability and observability

Contribute to safety/consent guardrails and to misuse/abuse mitigation for responsible speech technology

How You'll Work.

Team & Collaboration

partnering closely with research, data, and infra to ship fast, reliable, and cost-aware models; collaborating on larger team initiatives; Partner with infrastructure to run distributed training/inference on cloud fleets

Process & Methodology

Independently lead small research projects

Full Job Description

About Cantina Cantina is a new social platform founded by Sean Parker with the most advanced AI character creator. Our bots are lifelike, social creatures that can interact wherever people are online—across voice, video, and text. Create yourself, imagine someone new, or choose from thousands of characters to share infinitely scalable, personalized content and seamless group chat. If you’re excited about how AI can shape creativity and social interaction, come help us build what’s next. About the Role: We’re looking for a Research / ML Engineer to join our Speech Team to build state-of-the-art speech systems end-to-end—from data specs through production inference. You’ll drive the model ↔ data ↔ eval flywheel for TTS and adjacent tasks (voice cloning, controllable TTS, voice conversion and more), partnering closely with research, data, and infra to ship fast, reliable, and cost-aware models. In this role, you will work at the intersection of cutting-edge research and practical engineering, contributing to the development of safe, steerable, and trustworthy AI systems. What You’ll Do: - Model Building: Architect, implement, pre-train, fine-tune, and post-train/alignment (e.g., GRPO/DPO) for large-scale speech models. - Project Leadership: Independently lead small research projects while collaborating on larger team initiatives. - Experimental Design: Design, run, and analyze scientific experiments to advance our understanding of the models. - Tool Development: Develop and improve dev tooling to enhance team productivity. - Full-Stack Contribution: Contribute to the entire stack, from low-level optimizations to high-level model design. - Data Ownership: Define data requirements and collaborate on acquisition, curation, augmentation, labeling quality, and synthetic data strategies. - Rigorous Evaluation: Design automated objective/subjective evaluations—listening tests, SV/WER/ASR-based metrics, robustness & bias checks, and red-team studies. - Pipeline Delivery: Harde

Free ATS check

Applying for this ML Research Engineer, TTS role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 55 detected · ranked by frequency

PyTorch ×4

TTS ×3

voice cloning ×3

controllable TTS ×3

voice conversion ×3

transformer architectures ×3

diffusion models ×3

audio language modelling ×3

Model Building ×3

pre-train ×3

fine-tune ×3

post-train/alignment ×3

GRPO/DPO ×3

Experimental Design ×3

Tool Development ×3

Full-Stack Contribution ×3

low-level optimizations ×3

high-level model design ×3

Data Ownership ×3

data acquisition ×3

data curation ×3

data augmentation ×3

labeling quality ×3

synthetic data strategies ×3

Rigorous Evaluation ×3

automated objective/subjective evaluations ×3

listening tests ×3

SV/WER/ASR-based metrics ×3

robustness & bias checks ×3

red-team studies ×3

Pipeline Delivery ×3

training → evaluation → inference profile latency ×3

BEHAVIOURAL

Project Leadershipcollaboration

Role Details

Experience 5–10 yrs

Level Senior

Type FULL TIME

Category engineering

AI-Extracted Insights

Domain Areas

speech-systemsttsvoice-cloningcontrollable-ttsvoice-conversionaudio-modelsspeech-audio-ml

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Cantina?

Real rants from real employees. Read before you apply.

Read Company Rants →