Company

Software Engineering for AI

SeniorSoftwareEngineer—AIEvaluation&Benchmarks(Python)

$0–0k Toronto, Ontario, Canada; Canada CONTRACT Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Software Engineer — AI Evaluation & Benchmarks (Python). Skills: Python, AI Evaluation, Benchmarking. Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work. Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code”

What You'll Achieve.

benchmarks that meaningfully separate what frontier models can and can't do; evaluations that actually distinguish models from weak ones

Industry & Context.

Software Engineering for AI

Problems you'll solve

reasoning; debugging; edge-case failures

Eligibility Requirements

contractors in accepted locations only, valid documentation to work as an independent contractor in their country of residence, not compatible with F-1 OPT, STEM OPT, or any visa status that requires W-2 employment, guaranteed hours, or employer sponsorship, unable to provide offer letters or employment verification for this role

What They're Looking For.

Must Have

4+ years of professional software engineering experience, Expert Python — clean, performant, well-tested code, Hands-on experience working in large, complex codebases, Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines, command of Git and modern development workflows, Track record at a high-growth tech company or top-tier software organization, written English communication

Nice to Have

Senior or Lead-level profile with a history of technical ownership, Bachelor's or Master's in CS, ML, or related field (or equivalent professional experience), Proficiency in additional languages: JavaScript, Go, C++, or others, CI/CD experience and writing robust unit tests (pytest, Mocha, JUnit), Background in security engineering or significant open-source contributions, Familiarity with AI/ML evaluation methodologies or model benchmarking

What You'll Do.

Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work

Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning

and production-quality code

Build and maintain scalable data pipelines for evaluation workflows

Analyze model-generated code for correctness

and edge-case failures

Construct structured evaluation scenarios across large repos and multi-language environments

Provide detailed technical feedback on model performance and failure patterns

Contribute to evaluation frameworks that set the bar for how coding ability is measured

How You'll Work.

Communication Scope

written English communication

Full Job Description

BEFORE APPLYING This role is open to contractors in accepted locations only. Please confirm your country is on the list before applying — we're unable to process applications from unlisted locations. List of accepted countries and locations. https://docs.google.com/document/d/1FK0v1X3O3rqY0oB2k5xt0u5eiYaoYYKv_E4XS3kHXUs/edit?tab=t.0#heading=h.8jwvoue7ks7z For US applicants: This is a 1099 independent contractor role. It is not compatible with F-1 OPT, STEM OPT, or any visa status that requires W-2 employment, guaranteed hours, or employer sponsorship. We are unable to provide offer letters or employment verification for this role. WHAT YOU'LL BE DOING Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work: - Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code - Build and maintain scalable data pipelines for evaluation workflows - Analyze model-generated code for correctness, reliability, and edge-case failures - Construct structured evaluation scenarios across large repos and multi-language environments - Provide detailed technical feedback on model performance and failure patterns - Contribute to evaluation frameworks that set the bar for how coding ability is measured End result: benchmarks that meaningfully separate what frontier models can and can't do — and shape how the next generation is trained and improved. AI coding evaluation in one line: Design task → build harness → run model → analyze failures → feed findings back into the benchmark → evaluations that actually distinguish strong models from weak ones. WHAT YOU'LL NEED - 4+ years of professional software engineering experience (non-negotiable) - Expert Python — clean, performant, well-tested code - Hands-on experience working in large, complex codebases - Proven experience designing and implementing LLM coding benchmarks and evaluation data pip

Free ATS check

Applying for this Senior Software Engineer — AI Evaluation & Benchmarks (Python) role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 15 detected · ranked by frequency

Python ×3

designing coding benchmarks ×3

building scalable data pipelines ×3

analyzing model-generated code ×3

constructing structured evaluation scenarios ×3

writing robust unit tests ×3

AI Evaluation ×2

Benchmarking ×2

Git ×2

JavaScript

technical ownership

AI coding evaluation

model benchmarking

evaluation methodologies

BEHAVIOURAL

detailed technical feedback

Role Details

Experience 4–10 yrs

Level Senior

Work Mode Fully remote

Type CONTRACT

Category software-engineering-for-ai

Salary Band 150k-200k

AI-Extracted Insights

Domain Areas

ai-evaluationmodel-benchmarkingfrontier-ai-modelssoftware-engineering-work

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →