Mindrift

FreelanceAgentEvaluationEngineer

Remote PART TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Freelance Agent Evaluation Engineer at Mindrift. Skills: AI evaluation, Software development, Test writing. Build developer environments. Design tasks”

Industry & Context.

Problems you'll solve

Root cause analysis

What They're Looking For.

Must Have

5+ years in software development, Experience writing tests, English proficiency - B2+

What You'll Do.

Build developer environments

Define task solvability

Verify agent solutions

Review agent solutions

How You'll Work.

Communication Scope

English proficiency

Full Job Description

_Please submit your CV in English and indicate your level of English proficiency._ Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. **Participation is project-based, not permanent employment.** **What this opportunity involves** We're building a dataset to evaluate AI coding agents - how well a model handles real-world developer tasks. You'll create challenging tasks and evaluation criteria within realistic simulated environments: * Build realistic developer environments - a virtual company with codebase, infrastructure, and context (tickets, docs, conversations) that forms a believable development history * Design tasks from intermediate states of these environments - craft the prompt, define what "solved" means, and ensure the task is solvable by an AI agent * Write tests that verify agent solutions - accept all valid approaches and reject incorrect ones, neither too strict nor too lenient * Iterate on tasks and tests based on QA feedback - review agent solutions, analyze failures, and refine until the evaluation is fair and robust **What this is NOT** * Not data labeling * Not prompt engineering * Not writing code from scratch - the agent writes most of the code; you guide and evaluate **What we look for** * 5+ years in software development * Core stack: Python (FastAPI), JavaScript/TypeScript (React), Docker, Postgres, Kafka, Redis * Experience writing tests (functional, integration) * English proficiency - B2+ **Why this is hard ** Frontier models are already good at coding. Creating a task that genuinely challenges the best models is non-trivial. You need to deeply understand where models fail and what scenarios reveal the difference between a good and a bad solution. Tasks have many valid solutions - writing tests that accept all correct solutions and reject incorrect ones is harder than it sounds. **How it works** Apply → Pass qualification(s) → Join a proj

Free ATS check

Applying for this Freelance Agent Evaluation Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

ANONYMOUS · UNFILTERED

What do employees actually say about Mindrift?

Real rants from real employees. Read before you apply.

Read Company Rants →