Pika

Research

MultimodalLLMResearcher(MLLM)

$185–400k Palo Alto, California, United States FULL TIME Remote Friendly

The Brief

“Multimodal LLM Researcher (MLLM) at Pika. Skills: Multimodal LLM, LLM, VLM, Audio LM, real-time generation, agentic platforms, deep learning, generative models, diffusion models. Lead and contribute to research efforts focused on real-time, multimodal generation—including text, image, video, and audio synthesis—as well as orchestration of agentic platform infrastructure. Design and prototype novel algorithms and architectures for high-fidelity, real-time multimodal synthesis and interactive expe”

What You'll Achieve.

drive forward our mission to make agentic real-time generative technology accessible, dynamic, and transformative for millions of creators; shaping the future of real-time creative platforms; bring research advancements into production-ready technologies

Industry & Context.

Research

What They're Looking For.

Must Have

5+ years of relevant experience, including research during graduate studies, in large language models, vision-language models, audio language models, deep learning, or related fields, Deep expertise in at least one area: language modeling (LLM), vision-language modeling (VLM), or audio language modeling (Audio LM), experience with generative models, including autoregressive and diffusion models, and their real-time deployment, Hands-on experience curating, constructing, or augmenting large, high-quality multimodal datasets, Experience developing and deploying real-time systems and/or agentic orchestration infrastructure, programming and prototyping skills (Python, PyTorch, TensorFlow, etc.)

Nice to Have

Demonstrated impact as first author on major publications in top conferences or journals (e. g. , NeurIPS, CVPR, ICML, ICCV, SIGGRAPH, Interspeech, etc. )

What You'll Do.

Lead and contribute to research efforts focused on real-time

multimodal generation—including text

and audio synthesis—as well as orchestration of agentic platform infrastructure

Design and prototype novel algorithms and architectures for high-fidelity

real-time multimodal synthesis and interactive experiences

Focus on real-time aspects of model inference and synthesis across modalities

Work on diffusion model distillation and/or develop diffusion-based world models for multimodal applications

Train and finetune autoregressive and diffusion models in LLM

or Audio LM contexts with a focus on real-time performance

Curate specific datasets

and sensory-rich data

Collaborate with cross-functional teams to bring research advancements into production-ready technologies

Publish work in top-tier conferences and communicate research results internally and externally

Stay at the cutting edge of real-time multimodal generative AI and agentic orchestration

How You'll Work.

Team & Collaboration

Collaborate closely with engineering and product teams; Collaborate with cross-functional teams; communicate research results internally and externally

Communication Scope

Excellent communication and collaboration skills; communicate research results internally and externally

Free ATS check

Applying for this Multimodal LLM Researcher (MLLM) role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

Skill Signal 35 detected

Core

diffusion models ×6

generative models ×5

autoregressive models ×4

agentic orchestration infrastructure ×4

Required

LLM ×3

VLM ×3

Audio LM ×3

deep learning ×3

multimodal generation ×3

text synthesis ×3

image synthesis ×3

video synthesis ×3

audio synthesis ×3

large multimodal language models ×3

Nice to have

research efforts focused on real-time, multimodal generation

novel algorithms and architectures for high-fidelity, real-time multimodal synthesis and interactive experiences

real-time aspects of model inference and synthesis across modalities

diffusion model distillation

develop diffusion-based world models for multimodal applications

Train and finetune autoregressive and diffusion models

Curate specific datasets

bring research advancements into production-ready technologies

Behavioural

Passion for building creative tools and platforms that empower users

Excellent communication and collaboration skills

Collaborative, mission-driven team environment

Role Details

Work Mode

Hybrid

Type

FULL TIME

Experience

5–10 yrs

Salary Band

150k-200k

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Pika?

Real rants from real employees. Read before you apply.

Read Company Rants →