AGI, INC.

ResearchEngineer-Evals

San Francisco, California, United States; La Jolla, California, United States FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Research Engineer - Evals at AGI, INC.. Skills: Eval harness, Agent eval, On-device performance. Build eval harness. Build eval suites”

What You'll Achieve.

Ship eval harness; Ship eval suites; Ship dashboards and tooling; Ship against eval bar; Catch a regression; Clear a launch; Shape research roadmap

Industry & Context.

Problems you'll solve

Decide what better means; Make decisions on real signal

Eligibility Requirements

SF, in person

What They're Looking For.

Must Have

Experience with Python, Experience with machine learning frameworks, Knowledge of PostgreSQL, Proficient in AWS, Proficient in EC2, Proficient in S3, Proficient in Lambda, Proficient in Terraform, Proficient in Docker, Experience with Java, Experience with Spring Boot, Familiarity with Kafka, Familiarity with Redis, Experience with agent eval, Experience with tool use, Experience with long-horizon tasks, Experience with multilingual behavior, Experience with on-device perf trade-offs, Experience with QA at OEM scale, Experience with shipping consumer agents

Nice to Have

Kubernetes

What You'll Do.

Build dashboards and tooling

Set the bar for shipped

Protect bar from deadlines

Measure non-deterministic systems

How You'll Work.

Team & Collaboration

Work with product engineers; Work with partnerships

Full Job Description

THINK DIFFERENT. BUILD THE FUTURE. 🚀 OUR MISSION Build everyday AGI. Trustworthy, consumer-grade agents that redefine human–AI collaboration for millions. Software shouldn’t wait for commands; it should partner with you, amplifying what you can do every single day. WHY AGI, INC. We’re a stealth team of elite founders and AI researchers, with backgrounds spanning Stanford, OpenAI, and DeepMind. We’re industry leaders in mobile and computer-use agents, bringing these capabilities to consumer scale. Grounded in years of agent research, our AI is designed with trustworthiness and reliability as core pillars, not afterthoughts. We are supported by tier-1 investors who funded the first generation of AI giants; now they’re backing us to build the next: everyday AGI. (Watch the demo https://drive.google.com/file/d/1ZydjdMeMh3x-QItUPQFbJUbhBW-4-XHa/view?usp=sharing) If you see possibility where others see limits, read on. You decide what "better" means. Models, agents, and product features all ship behind one question: did this actually get better? Without a strong evals function, the lab ships vibes. With one, every training run, every prompt change, every agent capability moves a number we trust — and the team makes decisions on real signal, not the loudest opinion in the room. You'll build the eval harness for AGI — across model capability, agentic behavior, on-device performance, and end-user experience. You'll set the bar for what counts as "shipped" and protect it from the gravity of product deadlines. 🤩 TASKS YOU WILL OWN - The eval suites that gate every model and agent release — capability, behavior, regressions, and human-rated rubrics that catch what automated evals miss - The dashboards and tooling that make researcher experiment loops fast and leadership decisions easy - The bar — what counts as ready to ship, and how we know 🤚 AREAS WHERE YOU WILL ASSIST - Research, by making sure what we measure is what we want - Product engineers, by instrumenting real-use

Free ATS check

Applying for this Research Engineer - Evals role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 29 detected · ranked by frequency

Agent eval ×6

Tool use ×4

Long-horizon tasks ×4

Multilingual behavior ×4

On-device perf ×4

QA at OEM scale ×4

Consumer agents ×4

Eval harness ×3

On-device performance ×2

Python

Machine Learning

PostgreSQL

AWS

EC2

Lambda

Terraform

Docker

Kubernetes

Java

Spring Boot

Kafka

Redis

Trustworthiness

Reliability

Consumer scale

OEM partner

Dashboards

Tooling

Role Details

Experience 5–10 yrs

Level Senior

Work Mode In-person

Type FULL TIME

Category engineering

AI-Extracted Insights

Domain Areas

everyday-agimobile-agentscomputer-use-agentsconsumer-scale-agentstrustworthy-aireliable-ainon-deterministic-systemsagent-eval

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about AGI, INC.?

Real rants from real employees. Read before you apply.

Read Company Rants →