Wizard

AIAppliedScientist

$225–280k Erlangen, Germany Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“AI Applied Scientist at Wizard. Skills: AI Applied Scientist, ML, LLM evaluation, experimentation, data analysis. Define and evolve accuracy metrics across the full shopping experience (retrieval, ranking, recommendations, outcomes). Design and run experiments to measure improvements and regressions”

What You'll Achieve.

Define and evolve accuracy metrics; Design and run experiments; Build and maintain evaluation datasets, benchmarks, and scoring frameworks; Improve the LLM judges; Translate ambiguous product questions into clear, measurable hypotheses and analysis; Validate model changes and guide iteration; Drive improvements through data; Make agent performance visible, trusted, and actionable; Own the evaluation framework; Drive measurable improvements to LLM judge quality; Run experiments that influence at least one significant model or product change; Stand up automated evaluation; Build dashboards and reporting; Lead applied science work on the next frontier; Influence team-level strategy; Mentor and help grow the science function; Clear, trusted accuracy metrics are consistently used across product and engineering; A robust automated evaluation framework for both offline and live experiments; Model and product changes are consistently measured before and after launch; Demonstrable improvements in LLM judge quality and eval coverage; Science leadership that informs what we build, not just whether it works

Industry & Context.

Problems you'll solve

Translate ambiguous product questions into clear, measurable hypotheses and analysis; Identify failure modes and edge cases, and drive improvements through data

What They're Looking For.

Must Have

5+ years in Applied ML, AI Research, or Applied Science, Hands-on experience evaluating modern AI/ML systems: LLMs, agents, ranking, or recommendations, Direct experience with LLM-based systems: judge models, RAG, prompt engineering, fine-tuning, RLHF, or similar, experimentation foundations: A/B testing, causal inference, statistical rigor, Proven ability to operate in ambiguity: defining problems, not just solving pre-defined ones, Clear, structured communication that influences across ML, engineering, and product

Nice to Have

PhD or equivalent depth strongly preferred, GCP Professional Data Engineer, AWS Data Analytics, Databricks Certified, dbt Certified

What You'll Do.

Define and evolve accuracy metrics across the full shopping experience (retrieval

Design and run experiments to measure improvements and regressions

Build and maintain evaluation datasets

and scoring frameworks

Improve the LLM judges that power our evaluation pipeline: prompting

and fine-tuning where it matters

Translate ambiguous product questions into clear

measurable hypotheses and analysis

Partner with ML Engineers to validate model changes and guide iteration

Identify failure modes and edge cases

and drive improvements through data

Make agent performance visible

and actionable across product and engineering

Own the evaluation framework: datasets

both offline and online

Drive measurable improvements to LLM judge quality (calibration

fine-tuning where appropriate)

Run experiments that influence at least one significant model or product change

Stand up automated evaluation the team trusts before and after every launch

Build dashboards and reporting that make agent performance legible to leadership

Lead applied science work on the next frontier as the agent grows: multi-turn evaluation

conversational understanding

Influence team-level strategy on what we measure

Mentor and help grow the science function as it expands

How You'll Work.

Team & Collaboration

Partner with ML Engineering and AI Engineering; Partner with ML Engineers to validate model changes and guide iteration; Make agent performance visible, trusted, and actionable across product and engineering; Build relationships with ML, AI Engineering, and Product

Communication Scope

Clear, structured communication that influences across ML, engineering, and product

Full Job Description

About Wizard Wizard is the top-performing AI Shopping Agent, delivering the best products from across the web with unmatched accuracy, quality, and trust. The Role We’re looking for an Applied Scientist to own how we measure, understand, and improve the accuracy of our AI agent. This role sits at the intersection of applied ML, evaluation science, and product. You’ll define what “good” looks like for our agent, build the systems to measure it, and lead the science work to improve it, including fine-tuning the LLM judges that power our evaluation pipeline. You’ll partner with ML Engineering and AI Engineering. What you will do is bring scientific rigor to the most important question at Wizard: is our agent getting better, and how do we know? This is a foundational hire on our science team. Evaluation is the starting point, and the role is scoped to grow into broader applied science work as the surface area of the agent expands (recommendations, personalization, ranking, multimodal, conversational understanding). What You’ll Do Define and evolve accuracy metrics across the full shopping experience (retrieval, ranking, recommendations, outcomes) Design and run experiments to measure improvements and regressions Build and maintain evaluation datasets, benchmarks, and scoring frameworks Improve the LLM judges that power our evaluation pipeline: prompting, calibration, and fine-tuning where it matters Translate ambiguous product questions into clear, measurable hypotheses and analysis Partner with ML Engineers to validate model changes and guide iteration Identify failure modes and edge cases, and drive improvements through data Make agent performance visible, trusted, and actionable across product and engineering First 3 months Go deep on the agent, the current eval pipeline, and the metrics we use today Audit existing accuracy metrics and benchmarks; identify gaps, blind spots, and signals that aren’t trustworthy Build relationships with ML, AI Engineering, and Product

Free ATS check

Applying for this AI Applied Scientist role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 33 detected · ranked by frequency

data analysis ×5

ML ×3

fine-tuning LLM judges ×3

prompting ×3

calibration ×3

experiment design ×3

metric definition ×3

scoring frameworks ×3

automated evaluation ×3

dashboarding ×3

reporting ×3

A/B testing ×3

causal inference ×3

statistical rigor ×3

AI Applied Scientist ×2

LLM evaluation ×2

experimentation ×2

LLM

RAG

RLHF

ranking

recommendations

multimodal

conversational understanding

evaluation science

product strategy

applied science work

AI evaluation strategy

judge models

agent benchmarking

LLM judges

BEHAVIOURAL

communicationcollaborationinfluencementorship

Role Details

Experience 5–10 yrs

Level Senior

Work Mode Remote

Category ai-&-machine-learning

Salary Band 200k+

AI-Extracted Insights

Domain Areas

ai-shopping-agentshopping-experienceretrievalrankingrecommendationsoutcomes

How to Apply on Greenhouse

Create a Greenhouse profile before applying — it saves time across multiple applications.
Upload your resume as a PDF; the parser handles it better than Word.
Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Wizard?

Real rants from real employees. Read before you apply.

Read Company Rants →