LawZero

AI Safety

Director,Evaluations

Montréal, Quebec, Canada

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Director candidates.

The Brief

“Director, Evaluations at LawZero. Skills: AI system evaluation, Team leadership, Strategy development, Benchmark design, Dataset creation, Red-teaming, AI safety measurement. Define LawZero’s evaluations strategy and roadmap. Prioritize what needs to be measured and when”

What You'll Achieve.

Ensure evaluations remain independent of the main research stream; Ensure capability and safety claims can be trusted both internally and externally; Build trust with the wider AI safety community

Industry & Context.

AI Safety

What They're Looking For.

Must Have

Advanced degree (MSc or higher) in machine learning, computer science, or a closely related field, 10+ years of experience in machine learning, At least 5 years in a leadership role building or scaling technical teams working on real-world ML products, Hands-on expertise in designing and running large-scale evaluations of LLMs or other frontier ML systems across capabilities, safety, and adversarial robustness, A track record of building evaluation datasets, benchmarks, or interactive environments from scratch, including for safety-relevant properties such as honesty, sycophancy, refusal behaviour, and adversarial robustness

Nice to Have

Experience leading red-teaming exercises (automated, manual, or both) and working with third-party evaluation or red-teaming partners, Experience releasing open-source datasets, benchmarks, or evaluation tooling, Familiarity with current AI safety policy and standards work (UK AISI, US AISI, NIST, EU AI Act, etc.), Experience contributing to or coordinating with external safety institutes, grant funders, or government bodies

What You'll Do.

Define LawZero’s evaluations strategy and roadmap

Prioritize what needs to be measured and when

and grow LawZero’s Evaluations Team

Design novel benchmarks

Oversee the design and construction of new datasets

and virtual or interactive environments

Measure performance across capabilities

and adversarial attacks

Lead evaluation of the Scientist AI when deployed as a guardrail

Establish and lead automated and manual red-teaming programmes

Lead the construction of internal tooling and infrastructure needed to run evaluations at scale

Automate and standardize the pipeline wherever possible

Support research and product streams with their own internal requirements w. r. t. evaluations and benchmarking

Own LawZero’s public communication of evaluation results

How You'll Work.

Team & Collaboration

Close coordination with research and product teams; Operate the team independently of the main research and product streams; Partnership with external providers for red-teaming; Support research and product streams with their own internal requirements; Engagements with AI safety institutes, research collaborators, and grant funders

Communication Scope

Written and verbal communication skills; Ability to translate technical results for non-technical audiences such as executives, funders, and policymakers; Public communication of evaluation results; Represent LawZero externally on evaluations and AI safety measurement

Full Job Description

LawZero is a non-profit building safe-by-design AI systems. We’re building the Scientist AI, an advanced AI system designed from the ground up to be both highly capable and safe. As we develop both general‑purpose Scientist AI models and safety guardrails for frontier LLMs, we need rigorous, independent evaluation of every capability and safety claim we make. We are looking for a Director of Evaluations to build, lead, and grow LawZero’s Evaluations Team. This is a foundational hire. You will define what world‑class evaluation looks like at LawZero, build the team and infrastructure to deliver it, and ensure that evaluations remain independent of the main research stream so that capability and safety claims can be trusted both internally and externally by the wider AI and AI safety community. Key responsibilities Define LawZero’s evaluations strategy and roadmap, prioritising what needs to be measured and when, in close coordination with both research and product teams. Build up the Evaluations Team during your first 3–6 months, scaling to roughly 8–10 people across research, engineering, dataset and benchmark design, and red‑teaming. Operate the team independently of the main research and product streams in order to avoid conflicts of interest, including designing novel benchmarks that can be applied apples‑to‑apples to evaluate both the Scientist AI and frontier LLMs. Oversee the design and construction of new datasets, tasks, and virtual or interactive environments to measure performance of the Scientist AI across capabilities, safety (including honesty and goal-directedness), explainability, causal mechanisms and detecting adversarial attacks. Lead evaluation of the Scientist AI when deployed as a guardrail around frontier models, including its ability to comply with harm specifications, detect and block harmful responses, explain its decisions, and resist adversarial attacks such as jailbreaks, prompt injection, and data poisoning. Establish and lead our automa

Free ATS check

Applying for this Director, Evaluations role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 46 detected · ranked by frequency

Red-teaming ×5

Building and scaling technical teams ×3

Designing and running large-scale evaluations ×3

Building evaluation datasets ×3

Building benchmarks ×3

Building interactive environments ×3

Automating and standardizing pipelines ×3

Translating technical results for non-technical audiences ×3

AI system evaluation ×2

Team leadership ×2

Strategy development ×2

Benchmark design ×2

Dataset creation ×2

AI safety measurement ×2

LLMs

Machine Learning

Python

Scala

Spark

dbt

Snowflake

BigQuery

Redshift

Databricks

Fivetran

Airflow

dbt Cloud

Delta Lake

scikit-learn

TensorFlow

PyTorch

BEHAVIOURAL

LeadershipCommunicationCollaborationAdaptabilityProblem-solvingStrategic thinkingIndependent operationBuilding structure

Role Details

Experience 10–+ yrs

Level Director

Education MSc or higher

Category science

AI-Extracted Insights

Domain Areas

ai-safetyfrontier-llmssafe-by-design-ai-systemsscientist-ai-modelsharm-specificationsai-risk-mitigation

How to Apply on Greenhouse

Create a Greenhouse profile before applying — it saves time across multiple applications.
Upload your resume as a PDF; the parser handles it better than Word.
Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about LawZero?

Real rants from real employees. Read before you apply.

Read Company Rants →