Firecrawl

Engineering Team

ResearchEngineer–Evals

$160–240k San Francisco, California, United States FULL TIME Remote Friendly

The Brief

“Research Engineer – Evals at Firecrawl. Skills: build evaluation systems, design metrics, build pipelines, generate datasets, own the feedback loop, design benchmarks, LLM evaluation methodology, production-minded, fast iteration. Build the eval stack from scratch. Design and own the systems that measure whether Firecrawl's outputs are actually good — across scrape, crawl, extract, and map”

Industry & Context.

Engineering Team

Problems you'll solve

edge cases that break naive approaches; failure modes of LLM-based evaluation; debugged the places where they lie

Eligibility Requirements

US Citizenship/Visa required for N/A for Remote

What They're Looking For.

Must Have

3+ years in ML engineering, applied AI, or data quality — with production systems, Builds their own eval infrastructure, Knows what "good" means for unstructured web data, Fluent in LLM evaluation methodology, Production-minded, Fast and clear

Nice to Have

Any other niche expertise and skills, Previous experience at a scraping, automation, or security-focused startup, Ex-founder

What You'll Do.

Build the eval stack from scratch

Design and own the systems that measure whether Firecrawl's outputs are actually good — across scrape

integrating evals into CI/CD so regressions get caught before they ship

Design benchmarks that reflect reality

Build benchmark datasets that cover the real distribution of what our customers send us

including the edge cases that break naive approaches

Design the collection and labeling systems too

Own LLM-as-judge pipelines

Design and validate automated judges that score extraction quality at scale

know the failure modes of LLM-based evaluation

build the human review tooling needed when automation isn't enough

Close the loop with models and RL

Turn quality measurements into reward signals and feedback loops that make models meaningfully better

Run fast experiments and communicate clearly

Design experiments that test meaningful hypotheses

make decisions based on results

How You'll Work.

Team & Collaboration

Close the loop with RL and Search/IR research engineers; anyone on the team can understand what they mean

Communication Scope

communicate clearly; anyone on the team can understand what they mean — no decoder ring required; anyone on the team can understand what they mean — and what to do next

Free ATS check

Applying for this Research Engineer – Evals role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

Skill Signal 40 detected

Core

build evaluation systems ×5

design metrics ×5

build pipelines ×5

generate datasets ×5

own the feedback loop ×5

design benchmarks ×5

LLM evaluation methodology ×4

Required

build benchmark datasets ×3

design ground truth collection and labeling systems ×3

own LLM-as-judge pipelines ×3

design and validate automated judges ×3

build human review tooling ×3

close the loop with models and RL ×3

turn quality measurements into reward signals and feedback loops ×3

run fast experiments ×3

design experiments ×3

write pipelines ×3

Nice to have

ML engineering

applied AI

LLM-as-judge systems

RLHF pipelines

production systems

unstructured web data

production behavior

hard tradeoffs between evaluation depth, coverage, and cost

impact, not tenure

CI/CD

Behavioural

care about what "good" actually means

engineering depth to measure it

understand the difference between an eval that measures something real and one that just flatters the system

communicate clearly

make decisions based on results

anyone on the team can understand what they mean — no decoder ring required

you understand that infra choices directly affect what you're actually measuring

you understand why markdown quality is hard to define

Role Details

Type

FULL TIME

Experience

3–5 yrs

Salary Band

150k-200k

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Firecrawl?

Real rants from real employees. Read before you apply.

Read Company Rants →