Fundamental

MLResearcher-Evaluations

Barcelona, Spain FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid+ candidates.

The Brief

“ML Researcher - Evaluations at Fundamental. Skills: model evaluation, Python programming, building automated testing pipelines, translating real-world problems into quantifiable metrics. Design and implement rigorous evaluation frameworks. Translate real-world requirements gathered internally into measurable metrics that accurately reflect downstream use cases”

Industry & Context.

Problems you'll solve

translate ambiguous, real-world data challenges into concrete, defensible metrics; decode the final results; empirically measuring exactly why a model fails and where it excels

What They're Looking For.

Must Have

Proven experience in Machine Learning, Data Science, or AI Engineering, with a focus on model evaluation, testing, or benchmarking, programming skills in Python and relevant libraries such as pandas, A solid understanding of traditional ML metrics alongside emerging ways to evaluate foundation model outputs, Experience building and maintaining automated testing pipelines or evaluation harnesses, Excellent internal communication skills, Experience with translating real-world problems into quantifiable metrics

Nice to Have

Experience with tabular data or time series forecasting

What You'll Do.

Design and implement rigorous evaluation frameworks

Translate real-world requirements gathered internally into measurable metrics that accurately reflect downstream use cases

and maintain the internal Python pipelines and datasets used to stress-test our models on a day-to-day basis

Scout the industry for new

relevant external benchmarks for tabular data

Evaluate our models against these public benchmarks and maintain those pipelines

Monitor external foundation models and classical ML baselines

Integrate and update external foundation models and classical ML baselines within our system

Create and maintain a comprehensive leaderboard and characterization of our models

Report back to the research team exactly where our models are excelling and where they are falling short

How You'll Work.

Team & Collaboration

Working alongside our core researchers; taking signals from our internal deployment teams; reporting back to the research team

Communication Scope

Excellent internal communication skills; comfortable telling the research team hard truths about model regressions; adept at translating field requirements into technical metrics

Full Job Description

ABOUT FUNDAMENTAL Fundamental is an AI company pioneering the future of enterprise decision-making. Founded by DeepMind alumni, Fundamental has developed NEXUS – the world's most powerful Large Tabular Model (LTM) – purpose-built for the structured records that actually drive enterprise decisions. Backed by world class investors and trusted by Fortune 100 companies, Fundamental unlocks trillions of dollars of value by giving businesses the Power to Predict. At Fundamental, you'll work on unprecedented technical challenges in foundation model development and build technology that transforms how the world's largest companies make decisions. This is your opportunity to be part of a category-defining company from the ground-up. Join the team defining the future of enterprise AI. We are looking for a Machine Learning Researcher - Evaluations to establish the ground truth for what our models can actually do. In this role, you will take ambiguous, real-world data challenges and translate them into concrete, defensible metrics that our researchers and leadership can trust. Evaluation is not an afterthought here; it is the engine that drives our research roadmap. Working alongside our core researchers, you will be embedded in the entire lifecycle of model development. This means taking signals from our internal deployment teams to define what matters, tracking performance across live training runs, and decoding the final results. If you are obsessed with empirically measuring exactly why a model fails and where it excels, this role is for you. KEY RESPONSIBILITIES - Develop Signal-Driven Evals: Design and implement rigorous evaluation frameworks. You will translate real-world requirements gathered internally into measurable metrics that accurately reflect downstream use cases. - Own the Evaluation Infrastructure: Build, scale, and maintain the internal Python pipelines and datasets used to stress-test our models on a day-to-day basis. - Explore External Benchmarks: Scout the

Free ATS check

Applying for this ML Researcher - Evaluations role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 26 detected · ranked by frequency

Python programming ×5

building automated testing pipelines ×5

model evaluation ×3

pandas library usage ×3

maintaining automated testing pipelines ×3

building evaluation harnesses ×3

maintaining evaluation harnesses ×3

monitoring external foundation models ×3

integrating external foundation models ×3

updating external foundation models ×3

monitoring classical ML baselines ×3

integrating classical ML baselines ×3

updating classical ML baselines ×3

creating leaderboards ×3

maintaining leaderboards ×3

characterizing models ×3

reporting model performance ×3

translating real-world problems into quantifiable metrics ×2

Python ×2

pandas ×2

model testing

benchmarking

translating real-world requirements into measurable metrics

translating field requirements into technical metrics

BEHAVIOURAL

communication skillsownershipbias toward actionlow-ego culture

Role Details

Type FULL TIME

Category research

AI-Extracted Insights

Domain Areas

tabular-datatime-series-forecastingfoundation-model-developmententerprise-aienterprise-decision-making

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Fundamental?

Real rants from real employees. Read before you apply.

Read Company Rants →