Protege

ResearchScientist,Benchmarks&Evaluations

United States; Canada FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Research Scientist, Benchmarks & Evaluations at Protege. Skills: Design tasks and benchmarks, Validate evaluations rigorously, Develop the science of evals. Design tasks and benchmarks. Validate evaluations rigorously”

What You'll Achieve.

Shape the future of data and AI; Push the frontier forward; Publish on the questions that matter; Shape the eval datasets Protege delivers; Establish Protege as the standard-setter; Contribute to broader AI community understanding

Industry & Context.

Problems you'll solve

Solve hard problems

What They're Looking For.

Must Have

Advanced degree in a quantitative field, Hands-on experience evaluating LLMs, agents, or other ML systems, Experience with annotator quality and inter-rater reliability, Excellent scientific writing and communication, Bias toward velocity

Nice to Have

PhD, Experience with RL evaluation techniques, Ability to navigate new customer architectures, data systems, and requirements quickly, Experience with latent-variable models of annotator skill, Track record of published benchmarks or evaluation papers

What You'll Do.

Design tasks and benchmarks

Validate evaluations rigorously

Develop the science of evals

Run evaluations on current frontier models

Translate findings into product

Partnering with outsourced annotation vendors

How You'll Work.

Team & Collaboration

Work closely with data and engineering teams; Collaboration

Communication Scope

Scientific writing; Communication; Synthesize technical findings into narratives

Full Job Description

Company Overview: We are building Protege to solve the biggest unmet need in AI — getting access to the right training data. The process today is time intensive, incredibly expensive, and often ends in failure. The Protege platform facilitates the secure, efficient, and privacy-centric exchange of AI training data. Solving AI’s data problem is a generational opportunity. We’re backed by world-class investors and already powering partnerships with some of the most ambitious teams in AI. The company that succeeds will be one of the largest in AI — and in tech. We’re a lean, fast-moving, high-trust team of builders who are obsessed with velocity and impact. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI. DataLab is Protege’s research arm — a team of research scientists committed to tackling the fundamental challenges and open questions regarding data for AI. We bridge the gap between research theory and data deployment to push the frontier forward, publishing on the questions that matter: what agentic AI should actually be trained to do, how to quality-control large-scale corpora, and how to build evaluation datasets that reflect the real world rather than the leaderboard. We’re a lean, fast-moving, high-trust team of builders who deeply care about scientific rigor and impact. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI. The Role Benchmarks decide what AI gets built. Today, most evals don’t measure what we actually care about — they’re contaminated, gameable, synthetic or measure capabilities that don’t transfer to the real tasks frontier models are deployed against. We’re hiring a Research Scientist to lead the design of benchmarks and evaluations that frontier labs, enterprises, and policymakers can actually trust. You’ll own the science of evaluation across DataLab — designing tasks that meaningfully separate models, validating t

Free ATS check

Applying for this Research Scientist, Benchmarks & Evaluations role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 28 detected · ranked by frequency

Item response theory ×3

Contamination analysis ×3

Predictive validity studies ×3

Statistical frameworks ×3

Inter-rater reliability ×3

Annotator bias ×3

Annotator calibration ×3

Design tasks and benchmarks ×2

Validate evaluations rigorously ×2

Develop the science of evals ×2

LLMs

ML systems

RL evaluation techniques

Reward modeling

Off-policy evaluation

RLHF

RLAIF

Agentic RL

Dawid-Skene

MACE

IRT-style approaches

Scientific rigor

Statistical machinery

Trustworthiness scores

Prompting

Scaffolding

Tooling for evals

BEHAVIOURAL

IntegrityResourcefulResilientUrgencyDirect communicationRespectful communicationHonest feedbackGenuine careAccountabilityShared ownershipSweat the details

Role Details

Experience 5–10 yrs

Level Senior

Work Mode Remote

Type FULL TIME

Education Advanced degree (PhD preferred, or MSS plus equivalent indus

Category datalab

AI-Extracted Insights

Domain Areas

ai-training-dataagentic-ailarge-scale-corporaevaluation-datasetsfrontier-modelshealthcarefinancescientific-settings

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Protege?

Real rants from real employees. Read before you apply.

Read Company Rants →