LawZero
AI Safety
Director,Evaluations
Neural analysis suggests this role is
optimal for Director candidates.
“Director, Evaluations at LawZero. Skills: AI system evaluation, Team leadership, Strategy development, Benchmark design, Dataset creation, Red-teaming, AI safety measurement. Define LawZero’s evaluations strategy and roadmap. Prioritize what needs to be measured and when”
What You'll Achieve.
Ensure evaluations remain independent of the main research stream; Ensure capability and safety claims can be trusted both internally and externally; Build trust with the wider AI safety community
Industry & Context.
What They're Looking For.
Must Have
Advanced degree (MSc or higher) in machine learning, computer science, or a closely related field, 10+ years of experience in machine learning, At least 5 years in a leadership role building or scaling technical teams working on real-world ML products, Hands-on expertise in designing and running large-scale evaluations of LLMs or other frontier ML systems across capabilities, safety, and adversarial robustness, A track record of building evaluation datasets, benchmarks, or interactive environments from scratch, including for safety-relevant properties such as honesty, sycophancy, refusal behaviour, and adversarial robustness
Nice to Have
Experience leading red-teaming exercises (automated, manual, or both) and working with third-party evaluation or red-teaming partners, Experience releasing open-source datasets, benchmarks, or evaluation tooling, Familiarity with current AI safety policy and standards work (UK AISI, US AISI, NIST, EU AI Act, etc.), Experience contributing to or coordinating with external safety institutes, grant funders, or government bodies
What You'll Do.
Define LawZero’s evaluations strategy and roadmap
Prioritize what needs to be measured and when
and grow LawZero’s Evaluations Team
Design novel benchmarks
Oversee the design and construction of new datasets
and virtual or interactive environments
Measure performance across capabilities
and adversarial attacks
Lead evaluation of the Scientist AI when deployed as a guardrail
Establish and lead automated and manual red-teaming programmes
Lead the construction of internal tooling and infrastructure needed to run evaluations at scale
Automate and standardize the pipeline wherever possible
Support research and product streams with their own internal requirements w. r. t. evaluations and benchmarking
Own LawZero’s public communication of evaluation results
How You'll Work.
Team & Collaboration
Close coordination with research and product teams; Operate the team independently of the main research and product streams; Partnership with external providers for red-teaming; Support research and product streams with their own internal requirements; Engagements with AI safety institutes, research collaborators, and grant funders
Communication Scope
Written and verbal communication skills; Ability to translate technical results for non-technical audiences such as executives, funders, and policymakers; Public communication of evaluation results; Represent LawZero externally on evaluations and AI safety measurement
Full Job Description
LawZero is a non-profit building safe-by-design AI systems. We’re building the Scientist AI, an advanced AI system designed from the ground up to be both highly capable and safe. As we develop both general‑purpose Scientist AI models and safety guardrails for frontier LLMs, we need rigorous, independent evaluation of every capability and safety claim we make. We are looking for a Director of Evaluations to build, lead, and grow LawZero’s Evaluations Team. This is a foundational hire. You will define what world‑class evaluation looks like at LawZero, build the team and infrastructure to deliver it, and ensure that evaluations remain independent of the main research stream so that capability and safety claims can be trusted both internally and externally by the wider AI and AI safety community. Key responsibilities Define LawZero’s evaluations strategy and roadmap, prioritising what needs to be measured and when, in close coordination with both research and product teams. Build up the Evaluations Team during your first 3–6 months, scaling to roughly 8–10 people across research, engineering, dataset and benchmark design, and red‑teaming. Operate the team independently of the main research and product streams in order to avoid conflicts of interest, including designing novel benchmarks that can be applied apples‑to‑apples to evaluate both the Scientist AI and frontier LLMs. Oversee the design and construction of new datasets, tasks, and virtual or interactive environments to measure performance of the Scientist AI across capabilities, safety (including honesty and goal-directedness), explainability, causal mechanisms and detecting adversarial attacks. Lead evaluation of the Scientist AI when deployed as a guardrail around frontier models, including its ability to comply with harm specifications, detect and block harmful responses, explain its decisions, and resist adversarial attacks such as jailbreaks, prompt injection, and data poisoning. Establish and lead our automa
Applying for this Director, Evaluations role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Greenhouse
- Create a Greenhouse profile before applying — it saves time across multiple applications.
- Upload your resume as a PDF; the parser handles it better than Word.
- Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
- Enable email notifications to track application status in real time.
ANONYMOUS · UNFILTERED
What do employees actually say about LawZero?
Real rants from real employees. Read before you apply.