Company
Research
ResearchScientist,AgenticData&Benchmarking
Neural analysis suggests this role is
optimal for Mid+ candidates.
“Research Scientist, Agentic Data & Benchmarking. Skills: Agentic data, Benchmarking, Evaluation, Data curation. Design and run evaluations of agentic capabilities. Turn notions of intelligence into reproducible metrics”
What You'll Achieve.
Ship a view that makes result legible; Measure the lift on model performance; Keep data and benchmark quality high
Industry & Context.
Diagnose anomalous eval results; Troubleshooting
What They're Looking For.
Must Have
BS, MS, or PhD in Computer Science, Machine Learning, or related field, 2+ years of experience in evaluations or training-data curation for ML systems, Python and PyTorch development experience, Demonstrated experience designing and deep-diving into evaluations, or curating and generating training datasets, Hands-on experience using LLM agents
Nice to Have
PhD preferred, Experience with reinforcement learning, reward design, or RL environment construction for LLMs, Background in statistics and experimental design, Experience with large-scale dataset sourcing, curation, and processing, Experience building or operating data pipelines and evaluation infrastructure reliable at scale, Experience evaluating or generating data for software-engineering or computer-use agents, Contributions to published research, public benchmarks, and/or open-source ML software
What You'll Do.
Design and run evaluations of agentic capabilities
Turn notions of intelligence into reproducible metrics
Build and harden evaluation harnesses
Run experiments characterizing how prompting affects agentic performance
Diagnose anomalous eval results
Determine cause of anomalous eval results
Communicate cause of anomalous eval results
and curate high-quality agentic training data
Design and scale RL environments and reward signals
Measure impact of RL environments on model performance
Manage technical relationships with external data vendors
Evaluate data quality
Iterate quickly on feedback
Develop QA frameworks
Keep data and benchmark quality high
Contribute to technical reports
Contribute to research publications
Contribute to open-source benchmarks and tooling
Partner with research and product teams
Translate capability goals into data artifacts
Translate capability goals into evaluation artifacts
How You'll Work.
Team & Collaboration
Partner with research teams; Partner with product teams
Communication Scope
Communicate clearly
Full Job Description
## Key responsibilities Benchmarking & evaluation Design and run evaluations of agentic capabilities — multi-step reasoning, tool use, long-horizon planning, computer use, and safety properties — turning ambiguous notions of "intelligence" into defensible, reproducible metrics. Build and harden evaluation harnesses so benchmarks run reliably at scale against training checkpoints, with clear signal on regressions and model health. Run experiments characterizing how prompting, sampling, scaffolding, and environment design affect agentic performance on internal and public benchmarks. Diagnose anomalous eval results mid-training-run — determine whether the cause is the model, the data, the harness, or the infrastructure — and communicate the answer clearly. Agentic data Source, generate, and curate high-quality agentic training data: trajectories, tool-use traces, and task datasets for new capabilities. Design and scale RL environments and reward signals, and measure their impact on model performance. Manage technical relationships with external data vendors and domain experts, evaluating data quality and iterating quickly on feedback. Develop QA frameworks that catch reward hacking, label noise, and contamination, keeping data and benchmark quality high. Across both Contribute to technical reports, research publications, and open-source benchmarks and tooling. Partner with research and product teams to translate capability goals into measurable data and evaluation artifacts. ## Qualifications Academic qualifications BS, MS, or PhD (or equivalent experience) in Computer Science, Machine Learning, or a related field. Minimum qualifications 2+ years of experience with a clear emphasis on evaluations and/or training-data curation for ML systems (related areas: LLM training/fine-tuning, RL, or distributed ML systems). Strong Python and PyTorch development experience. Demonstrated experience designing and deep-diving into evaluations, or curating and generating training data
Applying for this Research Scientist, Agentic Data & Benchmarking role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Lever
- Lever uses a streamlined one-page form — apply in under 5 minutes.
- LinkedIn import works well; review parsed data before submitting.
- The cover letter field is optional but visible to reviewers — use it to differentiate.
- Referral codes from employees can significantly boost visibility of your application.
ANONYMOUS · UNFILTERED
What do employees actually say about this company?
Real rants from real employees. Read before you apply.