Firecrawl
Engineering Team
ResearchEngineer–Evals
“Research Engineer – Evals at Firecrawl. Skills: build evaluation systems, design metrics, build pipelines, generate datasets, own the feedback loop, design benchmarks, LLM evaluation methodology, production-minded, fast iteration. Build the eval stack from scratch. Design and own the systems that measure whether Firecrawl's outputs are actually good — across scrape, crawl, extract, and map”
Industry & Context.
edge cases that break naive approaches; failure modes of LLM-based evaluation; debugged the places where they lie
US Citizenship/Visa required for N/A for Remote
What They're Looking For.
Must Have
3+ years in ML engineering, applied AI, or data quality — with production systems, Builds their own eval infrastructure, Knows what "good" means for unstructured web data, Fluent in LLM evaluation methodology, Production-minded, Fast and clear
Nice to Have
Any other niche expertise and skills, Previous experience at a scraping, automation, or security-focused startup, Ex-founder
What You'll Do.
Build the eval stack from scratch
Design and own the systems that measure whether Firecrawl's outputs are actually good — across scrape
integrating evals into CI/CD so regressions get caught before they ship
Design benchmarks that reflect reality
Build benchmark datasets that cover the real distribution of what our customers send us
including the edge cases that break naive approaches
Design the collection and labeling systems too
Own LLM-as-judge pipelines
Design and validate automated judges that score extraction quality at scale
know the failure modes of LLM-based evaluation
build the human review tooling needed when automation isn't enough
Close the loop with models and RL
Turn quality measurements into reward signals and feedback loops that make models meaningfully better
Run fast experiments and communicate clearly
Design experiments that test meaningful hypotheses
make decisions based on results
How You'll Work.
Team & Collaboration
Close the loop with RL and Search/IR research engineers; anyone on the team can understand what they mean
Communication Scope
communicate clearly; anyone on the team can understand what they mean — no decoder ring required; anyone on the team can understand what they mean — and what to do next
Applying for this Research Engineer – Evals role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Ashby
- Ashby is a fast modern ATS — most applications take under 3 minutes.
- The resume parser is strong; verify parsed experience dates and job titles.
- Custom screening questions are often scored algorithmically — answer completely.
- Location field affects geo-based screening; use your actual metro area.
ANONYMOUS · UNFILTERED
What do employees actually say about Firecrawl?
Real rants from real employees. Read before you apply.