AGI, INC.
AI
ResearchEngineer-Evals
Neural analysis suggests this role is
optimal for Senior candidates.
“Research Engineer - Evals at AGI, INC.. Skills: Eval harness, Agent eval, On-device performance. Build eval harness. Build eval suites”
What You'll Achieve.
Ship eval harness; Ship eval suites; Ship dashboards and tooling; Ship against eval bar; Catch a regression; Clear a launch; Shape research roadmap
Industry & Context.
Decide what better means; Make decisions on real signal
SF, in person
What They're Looking For.
Must Have
Experience with Python, Experience with machine learning frameworks, Knowledge of PostgreSQL, Proficient in AWS, Proficient in EC2, Proficient in S3, Proficient in Lambda, Proficient in Terraform, Proficient in Docker, Experience with Java, Experience with Spring Boot, Familiarity with Kafka, Familiarity with Redis, Experience with agent eval, Experience with tool use, Experience with long-horizon tasks, Experience with multilingual behavior, Experience with on-device perf trade-offs, Experience with QA at OEM scale, Experience with shipping consumer agents
Nice to Have
Kubernetes
What You'll Do.
Build dashboards and tooling
Set the bar for shipped
Protect bar from deadlines
Measure non-deterministic systems
How You'll Work.
Team & Collaboration
Work with product engineers; Work with partnerships
Full Job Description
THINK DIFFERENT. BUILD THE FUTURE. 🚀 OUR MISSION Build everyday AGI. Trustworthy, consumer-grade agents that redefine human–AI collaboration for millions. Software shouldn’t wait for commands; it should partner with you, amplifying what you can do every single day. WHY AGI, INC. We’re a stealth team of elite founders and AI researchers, with backgrounds spanning Stanford, OpenAI, and DeepMind. We’re industry leaders in mobile and computer-use agents, bringing these capabilities to consumer scale. Grounded in years of agent research, our AI is designed with trustworthiness and reliability as core pillars, not afterthoughts. We are supported by tier-1 investors who funded the first generation of AI giants; now they’re backing us to build the next: everyday AGI. (Watch the demo https://drive.google.com/file/d/1ZydjdMeMh3x-QItUPQFbJUbhBW-4-XHa/view?usp=sharing) If you see possibility where others see limits, read on. You decide what "better" means. Models, agents, and product features all ship behind one question: did this actually get better? Without a strong evals function, the lab ships vibes. With one, every training run, every prompt change, every agent capability moves a number we trust — and the team makes decisions on real signal, not the loudest opinion in the room. You'll build the eval harness for AGI — across model capability, agentic behavior, on-device performance, and end-user experience. You'll set the bar for what counts as "shipped" and protect it from the gravity of product deadlines. 🤩 TASKS YOU WILL OWN - The eval suites that gate every model and agent release — capability, behavior, regressions, and human-rated rubrics that catch what automated evals miss - The dashboards and tooling that make researcher experiment loops fast and leadership decisions easy - The bar — what counts as ready to ship, and how we know 🤚 AREAS WHERE YOU WILL ASSIST - Research, by making sure what we measure is what we want - Product engineers, by instrumenting real-use
Applying for this Research Engineer - Evals role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Ashby
- Ashby is a fast modern ATS — most applications take under 3 minutes.
- The resume parser is strong; verify parsed experience dates and job titles.
- Custom screening questions are often scored algorithmically — answer completely.
- Location field affects geo-based screening; use your actual metro area.
ANONYMOUS · UNFILTERED
What do employees actually say about AGI, INC.?
Real rants from real employees. Read before you apply.