Notion

Engineering

SoftwareEngineer,AgentDevVelocity

$214–300k San Francisco, California, United States FULL TIME

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid+ candidates.

The Brief

“Software Engineer, Agent Dev Velocity at Notion. Skills: Agent Dev Velocity tooling, evaluation backbone, developer tooling, distributed systems, measurement, eval runners, eval harnesses, benchmark tooling, dataset tooling, reliability, observability. Build and improve scalable eval runners and harnesses that work locally, in CI, and on scheduled runs. Make it easy for engineers to add high-signal evals: better templates, fixtures, debugging tools, and clear workflows”

What You'll Achieve.

ship high-quality AI faster and more safely; makes AI evaluations easy to create, cheap to run, and hard to ignore; engineers across the AI org can iterate with confidence; keep us honest about quality over time; enable reusable eval workspaces and data-driven workflows that surface issues through data mining and continuous measurement

Industry & Context.

Engineering

Problems you'll solve

failure triage

Eligibility Requirements

work from our offices on Mondays, Tuesdays, and Thursdays, our designated Anchor Days, Certain teams or positions may require additional in-office workdays

What They're Looking For.

Must Have

software engineering fundamentals, experience shipping production systems, Proficiency with TypeScript/Node and/or Python, Experience building reliable systems in distributed environments (queues, retries, idempotency, and backfills), Comfort working with data pipelines (batch processing, data quality, versioning, and reproducibility), Practical experience designing measurement or evaluation systems

Nice to Have

LLM eval experience is a plus, testing and benchmarking instincts also apply, You don’t need to be an AI expert, but you’re curious and willing to adopt AI tools to work smarter and deliver better results, Experience building developer tooling (CLI tools, CI integrations, or internal platforms), Familiarity with LLM evaluation techniques (rubrics, human review loops, dataset curation, and regression detection), Experience collaborating across teams to roll out new workflows and drive adoption

What You'll Do.

Build and improve scalable eval runners and harnesses that work locally

and on scheduled runs

Make it easy for engineers to add high-signal evals: better templates

Build and maintain benchmark and dataset tooling (curation pipelines

and regression tracking)

Improve reliability and observability for eval execution (retries

cost and latency visibility

How You'll Work.

Team & Collaboration

Partner closely with AI product, AI platform, and infrastructure teams to integrate evals into day-to-day shipping workflows; Experience collaborating across teams to roll out new workflows and drive adoption

Full Job Description

ABOUT US: Notion helps you build beautiful tools for your life’s work. In today's world of endless apps and tabs, Notion provides one place for teams to get everything done, seamlessly connecting docs, notes, projects, calendar, and email—with AI built in to find answers and automate work. Millions of users, from individuals to large organizations like Toyota, Figma, and OpenAI, love Notion for its flexibility and choose it because it helps them save time and money. In-person collaboration is essential to Notion's culture. We require all team members to work from our offices on Mondays, Tuesdays, and Thursdays, our designated Anchor Days. Certain teams or positions may require additional in-office workdays. ABOUT THE ROLE: Agent Dev Velocity builds the tooling and evaluation backbone that helps Notion ship high-quality AI faster and more safely. We build the infrastructure that makes AI evaluations easy to create, cheap to run, and hard to ignore, so engineers across the AI org can iterate with confidence. In this role, you will work at the intersection of developer tooling, distributed systems, and measurement. You will build systems for running and maintaining evals at scale, and you will help create durable benchmarks and datasets that keep us honest about quality over time. You will help evolve evals into a system, by enabling reusable eval workspaces and data-driven workflows that surface issues through data mining and continuous measurement. WHAT YOU'LL ACHIEVE: - Build and improve scalable eval runners and harnesses that work locally, in CI, and on scheduled runs. - Make it easy for engineers to add high-signal evals: better templates, fixtures, debugging tools, and clear workflows. - Build and maintain benchmark and dataset tooling (curation pipelines, versioning, artifact management, and regression tracking). - Improve reliability and observability for eval execution (retries, idempotency, cost and latency visibility, and failure triage). - Partner closely

Free ATS check

Applying for this Software Engineer, Agent Dev Velocity role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 28 detected · ranked by frequency

developer tooling ×5

software engineering fundamentals ×3

shipping production systems ×3

building reliable systems in distributed environments ×3

data pipelines ×3

measurement systems ×3

evaluation systems ×3

LLM evaluation techniques ×3

Agent Dev Velocity tooling ×2

evaluation backbone ×2

distributed systems ×2

measurement ×2

eval runners ×2

eval harnesses ×2

benchmark tooling ×2

dataset tooling ×2

reliability ×2

observability ×2

TypeScript

Node

Python

AI product

AI platform

infrastructure teams

CLI tools

CI integrations

internal platforms

BEHAVIOURAL

curiouswilling to adopt AI toolsbuilder at heartenthusiastic

Role Details

Work Mode Hybrid

Type FULL TIME

Category engineering

Salary Band 200k+

AI-Extracted Insights

Domain Areas

aillm-evaluation

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Notion?

Real rants from real employees. Read before you apply.

Read Company Rants →