Mindrift
FreelanceAgentEvaluationEngineer
Neural analysis suggests this role is
optimal for Senior candidates.
“Freelance Agent Evaluation Engineer at Mindrift. Skills: AI evaluation, Software development, Test writing. Build developer environments. Design tasks”
Industry & Context.
Root cause analysis
What They're Looking For.
Must Have
5+ years in software development, Experience writing tests, English proficiency - B2+
What You'll Do.
Build developer environments
Define task solvability
Verify agent solutions
Review agent solutions
How You'll Work.
Communication Scope
English proficiency
Full Job Description
_Please submit your CV in English and indicate your level of English proficiency._ Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. **Participation is project-based, not permanent employment.** **What this opportunity involves** We're building a dataset to evaluate AI coding agents - how well a model handles real-world developer tasks. You'll create challenging tasks and evaluation criteria within realistic simulated environments: * Build realistic developer environments - a virtual company with codebase, infrastructure, and context (tickets, docs, conversations) that forms a believable development history * Design tasks from intermediate states of these environments - craft the prompt, define what "solved" means, and ensure the task is solvable by an AI agent * Write tests that verify agent solutions - accept all valid approaches and reject incorrect ones, neither too strict nor too lenient * Iterate on tasks and tests based on QA feedback - review agent solutions, analyze failures, and refine until the evaluation is fair and robust **What this is NOT** * Not data labeling * Not prompt engineering * Not writing code from scratch - the agent writes most of the code; you guide and evaluate **What we look for** * 5+ years in software development * Core stack: Python (FastAPI), JavaScript/TypeScript (React), Docker, Postgres, Kafka, Redis * Experience writing tests (functional, integration) * English proficiency - B2+ **Why this is hard ** Frontier models are already good at coding. Creating a task that genuinely challenges the best models is non-trivial. You need to deeply understand where models fail and what scenarios reveal the difference between a good and a bad solution. Tasks have many valid solutions - writing tests that accept all correct solutions and reject incorrect ones is harder than it sounds. **How it works** Apply → Pass qualification(s) → Join a proj
Applying for this Freelance Agent Evaluation Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about Mindrift?
Real rants from real employees. Read before you apply.