P-1 AI

AI

MemberofTechnicalStaffEvals

$170–200k United States FULL TIME Remote Friendly
The Brief

“Member of Technical Staff - Evals at P-1 AI. Skills: evals, software development, AI systems. implement and operate the system for organizing, transforming, running, grading, and reporting on eval benchmarks. design and execute the process by which we develop and QA our evals”

What You'll Achieve.

ensure that Archie is learning and retaining the skills needed to successfully perform its engineering work; benchmark it against industry skill expectations; continuously benchmarking our evolving AI platform and the experiments we’re performing around it

Industry & Context.

AI
Problems you'll solve

quantitative intuition over physical product domains

Eligibility Requirements

plan to spend one week per quarter co-working with the rest of the company in our San Mateo office, occasional team travel workshop in between

What They're Looking For.

Must Have

Experience in constructing comprehensive test suites for software and/or AI systems, including coordinating the contributions of others, Experience designing metrics to evaluate systems and visualize their performance, including differences across successive generations, Proficiency in Python programming, complex modules and modern software development tools and practices (Git, CI/CD, etc.), Ability to thrive in a fast-paced, dynamic startup environment

Nice to Have

Experience in developing, managing, and running evals against LLM-based systems is a plus

What You'll Do.

implement and operate the system for organizing

and reporting on eval benchmarks

design and execute the process by which we develop and QA our evals

Ensure that evals run effectively within our CI/CD system

Create methods for detecting and testing for common quality challenges of AI

Be a technical leader in the consistent implementation and organization of automated tests across other areas of our technology stacks

How You'll Work.

Team & Collaboration

coordinating the contributions of others; incorporating contributions from our own engineering team, industrial partners, and subject-matter experts; Good communication skills with a variety of stakeholders (AI researchers, domain experts, application developers)

Communication Scope

Good communication skills with a variety of stakeholders (AI researchers, domain experts, application developers)

Free ATS check

Applying for this Member of Technical Staff - Evals role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Ashby

  • Ashby is a fast modern ATS — most applications take under 3 minutes.
  • The resume parser is strong; verify parsed experience dates and job titles.
  • Custom screening questions are often scored algorithmically — answer completely.
  • Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about P-1 AI?

Real rants from real employees. Read before you apply.

Read Company Rants →