Company

Research

ResearchScientist,AgenticData&Benchmarking

$150–450k Sunnyvale, California, United States FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid+ candidates.

The Brief

“Research Scientist, Agentic Data & Benchmarking. Skills: Agentic data, Benchmarking, Evaluation, Data curation. Design and run evaluations of agentic capabilities. Turn notions of intelligence into reproducible metrics”

What You'll Achieve.

Ship a view that makes result legible; Measure the lift on model performance; Keep data and benchmark quality high

Industry & Context.

Research
Problems you'll solve

Diagnose anomalous eval results; Troubleshooting

What They're Looking For.

Must Have

BS, MS, or PhD in Computer Science, Machine Learning, or related field, 2+ years of experience in evaluations or training-data curation for ML systems, Python and PyTorch development experience, Demonstrated experience designing and deep-diving into evaluations, or curating and generating training datasets, Hands-on experience using LLM agents

Nice to Have

PhD preferred, Experience with reinforcement learning, reward design, or RL environment construction for LLMs, Background in statistics and experimental design, Experience with large-scale dataset sourcing, curation, and processing, Experience building or operating data pipelines and evaluation infrastructure reliable at scale, Experience evaluating or generating data for software-engineering or computer-use agents, Contributions to published research, public benchmarks, and/or open-source ML software

What You'll Do.

Design and run evaluations of agentic capabilities

Turn notions of intelligence into reproducible metrics

Build and harden evaluation harnesses

Run experiments characterizing how prompting affects agentic performance

Diagnose anomalous eval results

Determine cause of anomalous eval results

Communicate cause of anomalous eval results

and curate high-quality agentic training data

Design and scale RL environments and reward signals

Measure impact of RL environments on model performance

Manage technical relationships with external data vendors

Evaluate data quality

Iterate quickly on feedback

Develop QA frameworks

Keep data and benchmark quality high

Contribute to technical reports

Contribute to research publications

Contribute to open-source benchmarks and tooling

Partner with research and product teams

Translate capability goals into data artifacts

Translate capability goals into evaluation artifacts

How You'll Work.

Team & Collaboration

Partner with research teams; Partner with product teams

Communication Scope

Communicate clearly

Full Job Description

## Key responsibilities Benchmarking & evaluation Design and run evaluations of agentic capabilities — multi-step reasoning, tool use, long-horizon planning, computer use, and safety properties — turning ambiguous notions of "intelligence" into defensible, reproducible metrics. Build and harden evaluation harnesses so benchmarks run reliably at scale against training checkpoints, with clear signal on regressions and model health. Run experiments characterizing how prompting, sampling, scaffolding, and environment design affect agentic performance on internal and public benchmarks. Diagnose anomalous eval results mid-training-run — determine whether the cause is the model, the data, the harness, or the infrastructure — and communicate the answer clearly. Agentic data Source, generate, and curate high-quality agentic training data: trajectories, tool-use traces, and task datasets for new capabilities. Design and scale RL environments and reward signals, and measure their impact on model performance. Manage technical relationships with external data vendors and domain experts, evaluating data quality and iterating quickly on feedback. Develop QA frameworks that catch reward hacking, label noise, and contamination, keeping data and benchmark quality high. Across both Contribute to technical reports, research publications, and open-source benchmarks and tooling. Partner with research and product teams to translate capability goals into measurable data and evaluation artifacts. ## Qualifications Academic qualifications BS, MS, or PhD (or equivalent experience) in Computer Science, Machine Learning, or a related field. Minimum qualifications 2+ years of experience with a clear emphasis on evaluations and/or training-data curation for ML systems (related areas: LLM training/fine-tuning, RL, or distributed ML systems). Strong Python and PyTorch development experience. Demonstrated experience designing and deep-diving into evaluations, or curating and generating training data

Free ATS check

Applying for this Research Scientist, Agentic Data & Benchmarking role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Lever

  • Lever uses a streamlined one-page form — apply in under 5 minutes.
  • LinkedIn import works well; review parsed data before submitting.
  • The cover letter field is optional but visible to reviewers — use it to differentiate.
  • Referral codes from employees can significantly boost visibility of your application.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →