Magic

Technology

MemberofTechnicalStaff,Evals

$200–550k San Francisco, California, United States FULL TIME

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid+ candidates.

The Brief

“Member of Technical Staff, Evals at Magic. Skills: Platform development, Evaluation systems, Machine learning evaluation. Build internal evals platform. Maintain internal evals platform”

What You'll Achieve.

Build trustworthy evaluation systems; Make better research decisions; Build better datasets; Ship better products

Industry & Context.

Technology

Problems you'll solve

Critical reasoning; Ambiguity navigation; Measurement validation

What They're Looking For.

Must Have

Experience building production systems, Experience building internal platforms, Experience building developer infrastructure, Experience working with machine learning systems, Experience working with evaluation frameworks, Experience working with data infrastructure, Experience working with research tooling, Track record of owning technical projects, Skepticism toward results, Ability to reason critically, Experience designing systems at scale, Experience implementing systems at scale, Experience operating systems at scale, Comfortable navigating ambiguity

Nice to Have

Experience with AI/ML, Experience with data engineering, Experience with security

What You'll Do.

Build internal evals platform

Maintain internal evals platform

Develop infrastructure for evaluations

Build systems to measure dataset quality

Identify opportunities to improve training data

Improve evaluation correctness

Improve evaluation reproducibility

Improve evaluation reliability

Audit public benchmarks

Improve public benchmarks

Audit evaluation methodologies

Improve evaluation methodologies

Audit open-source implementations

Improve open-source implementations

Partner with teams to define metrics

Build tooling for decision making

Build frameworks for decision making

How You'll Work.

Team & Collaboration

Research teams; Data teams; Inference teams; Product teams; Cross-functional teams

Process & Methodology

End-to-end project ownership

Full Job Description

Magic’s mission is to build safe AGI that accelerates humanity’s progress on the world’s most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alone. Our approach combines frontier-scale pre-training, domain-specific RL, ultra-long context, and inference-time compute to achieve this goal. ABOUT THE ROLE Evals builds the internal platform that teams across Magic use to evaluate the performance of internal and external models. The team supports pre-training, post-training, data, inference, and product, and sits on the critical path of many of the company's most important decisions. As a Member of Technical Staff on Evals, you will build both the platform and the evaluations themselves. You'll develop infrastructure for large-scale evaluations, data ablations, and dataset quality analysis, while designing and validating the methodologies used to measure model performance. Sweating the details matters on this team. Many benchmarks, papers, and open-source evaluation frameworks contain subtle bugs or flawed assumptions that lead to misleading conclusions. We care deeply about correctness, reproducibility, and measurement quality. Evals are essential to the success of the company. By building trustworthy evaluation systems, you will help Magic make better research decisions, build better datasets, and ship better products. WHAT YOU'LL WORK ON - Build and maintain the internal evals platform used across Magic - Design, implement, and validate eval tasks for pre-training, post-training, reinforcement learning, inference, and product systems - Develop infrastructure for running large-scale evaluations - Build systems to measure dataset quality and identify opportunities to improve training data - Improve evaluation correctness, reproducibility, and reliability - Audit and improve upon public benchmarks, evaluation methodologies, and open-sourc

Free ATS check

Applying for this Member of Technical Staff, Evals role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 24 detected · ranked by frequency

Large-scale evaluations ×3

Data ablations ×3

Dataset quality analysis ×3

Model performance measurement ×3

Evaluation correctness ×3

Evaluation reproducibility ×3

Evaluation reliability ×3

Public benchmarks audit ×3

Open-source implementations audit ×3

Metric definition ×3

Platform development ×2

Evaluation systems ×2

Machine learning evaluation ×2

Python

Machine Learning

AGI

System design

Methodology validation

Measurement quality

Research decisions

Product decisions

Evaluation frameworks

BEHAVIOURAL

IntegrityFocus

Role Details

Type FULL TIME

Category engineering

Salary Band 200k+

AI-Extracted Insights

Domain Areas

safe-agiagi-alignmentfrontier-scale-pre-trainingdomain-specific-rlultra-long-contextinference-time-computemodel-performancedataset-quality

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Magic?

Real rants from real employees. Read before you apply.

Read Company Rants →