Firecrawl

Engineering Team

ResearchEngineerEvals

$160–240k San Francisco, California, United States FULL TIME Remote Friendly
The Brief

“Research Engineer – Evals at Firecrawl. Skills: build evaluation systems, design metrics, build pipelines, generate datasets, own the feedback loop, design benchmarks, LLM evaluation methodology, production-minded, fast iteration. Build the eval stack from scratch. Design and own the systems that measure whether Firecrawl's outputs are actually good — across scrape, crawl, extract, and map”

Industry & Context.

Engineering Team
Problems you'll solve

edge cases that break naive approaches; failure modes of LLM-based evaluation; debugged the places where they lie

Eligibility Requirements

US Citizenship/Visa required for N/A for Remote

What They're Looking For.

Must Have

3+ years in ML engineering, applied AI, or data quality — with production systems, Builds their own eval infrastructure, Knows what "good" means for unstructured web data, Fluent in LLM evaluation methodology, Production-minded, Fast and clear

Nice to Have

Any other niche expertise and skills, Previous experience at a scraping, automation, or security-focused startup, Ex-founder

What You'll Do.

Build the eval stack from scratch

Design and own the systems that measure whether Firecrawl's outputs are actually good — across scrape

integrating evals into CI/CD so regressions get caught before they ship

Design benchmarks that reflect reality

Build benchmark datasets that cover the real distribution of what our customers send us

including the edge cases that break naive approaches

Design the collection and labeling systems too

Own LLM-as-judge pipelines

Design and validate automated judges that score extraction quality at scale

know the failure modes of LLM-based evaluation

build the human review tooling needed when automation isn't enough

Close the loop with models and RL

Turn quality measurements into reward signals and feedback loops that make models meaningfully better

Run fast experiments and communicate clearly

Design experiments that test meaningful hypotheses

make decisions based on results

How You'll Work.

Team & Collaboration

Close the loop with RL and Search/IR research engineers; anyone on the team can understand what they mean

Communication Scope

communicate clearly; anyone on the team can understand what they mean — no decoder ring required; anyone on the team can understand what they mean — and what to do next

Free ATS check

Applying for this Research Engineer – Evals role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Ashby

  • Ashby is a fast modern ATS — most applications take under 3 minutes.
  • The resume parser is strong; verify parsed experience dates and job titles.
  • Custom screening questions are often scored algorithmically — answer completely.
  • Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Firecrawl?

Real rants from real employees. Read before you apply.

Read Company Rants →