PERIODIC LABS

Tech / AI / Software

ResearchEngineer-Data

$350–400k menlo park, california, united states FULL TIME

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid+ candidates.

The Brief

“Research Engineer - Data at PERIODIC LABS. Skills: Data strategy, Data pipelines, Data quality, Distributed data processing, Dataset versioning, Python engineering. Build and drive the data foundation for research efforts. Own data strategy end-to-end”

Industry & Context.

Tech / AI / Software

Problems you'll solve

Translate data needs into pipeline requirements

Eligibility Requirements

Visa sponsorship

What They're Looking For.

Must Have

Experience building large-scale data pipelines for LLM pretraining or midtraining, including web-scale or scientific corpora, Expertise in data quality techniques such as exact and fuzzy deduplication (MinHash, SimHash), perplexity filtering, classifier-based quality scoring, and PII scrubbing, Experience working with diverse scientific data formats — papers, patents, structured databases, simulation outputs, lab instrument exports — and normalizing them for model consumption, Experience with distributed data processing frameworks such as Apache Spark, Ray, or Dask at multi-terabyte to petabyte scale, Familiarity with dataset versioning, lineage tracking, and reproducibility tooling such as DVC, Delta Lake, or custom solutions, Experience sourcing and evaluating third-party datasets, including licensing considerations and quality assessment, Python engineering skills and comfort building production-quality tooling in a research environment, Experience collaborating directly with ML researchers to translate data needs into pipeline requirements and back again, A research-oriented mindset — you run experiments on data, measure outcomes, and iterate with rigor

Nice to Have

Experience curating scientific datasets specifically for domain-adaptive continued pretraining or instruction tuning, Familiarity with synthetic data generation methods, including model-generated data pipelines and quality verification, A background in a physical science or engineering discipline that informs how you think about scientific data quality and structure, Experience with multimodal data — integrating text, structured numerical data, molecular representations, or spectral data into unified training pipelines

What You'll Do.

Build and drive the data foundation for research efforts

Own data strategy end-to-end

Source and procure external datasets

Integrate internally generated experimental data into the training stack

Ensure the team always has the right data — in the right shape — to train and improve frontier models

Collect and organize diverse data sources

Improve data quality through deduplication and preprocessing

Ensure new experimental results are incorporated in a structured

Own data strategy across the training stack — identifying gaps

evaluating new sources

and shaping the overall data roadmap

and procure external datasets across scientific domains

Build and maintain robust pipelines for ingesting

and versioning large-scale datasets from heterogeneous sources

Design and implement data quality systems

Integrate internally generated experimental data — from lab instrumentation

and model outputs — into the training stack

Build tooling that makes it easy for researchers to inspect

and understand the data

Instrument data pipelines with metadata

Collaborate with pretraining and midtraining engineers on token budget management

and curriculum design

Stay current with research on data-efficient training

synthetic data generation

and data selection methods — and bring relevant ideas into production

How You'll Work.

Team & Collaboration

Collaborate with pretraining and midtraining engineers on token budget management, data mixing ratios, and curriculum design; Collaborate directly with ML researchers to translate data needs into pipeline requirements and back again

Full Job Description

ABOUT PERIODIC LABS The most important scientific discoveries of our time won’t happen in a traditional lab. We’re an AI and physical sciences company building state-of-the-art models to accelerate breakthroughs across materials, energy, and beyond. Backed by world-class investors and growing rapidly, we operate at the pace the frontier requires. Our team brings deep expertise, genuine ownership, and an insatiable drive to push the boundaries of what’s scientifically possible. ABOUT THE ROLE You will build and drive the data foundation for our research efforts. This means owning data strategy end-to-end: sourcing and procuring external datasets, integrating internally generated experimental data into the training stack, and ensuring the team always has the right data — in the right shape — to train and improve frontier models. This role sits at the intersection of data engineering, research infrastructure, and strategy. You will work closely with pretraining, midtraining, and RL researchers to understand what data the models need, then build the pipelines and systems to get it there. The work spans collecting and organizing diverse data sources, improving data quality through deduplication and preprocessing, and ensuring that new experimental results are incorporated in a structured, repeatable way that makes them useful for model development. WHAT YOU’LL DO - Own data strategy across the training stack — identifying gaps, evaluating new sources, and shaping the overall data roadmap in collaboration with research leads - Source, evaluate, and procure external datasets across scientific domains including chemistry, physics, materials science, mathematics, and lab instrumentation - Build and maintain robust pipelines for ingesting, processing, and versioning large-scale datasets from heterogeneous sources - Design and implement data quality systems including deduplication, domain classification, quality filtering, and format normalization at scale - Integrate internal

Free ATS check

Applying for this Research Engineer - Data role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 60 detected · ranked by frequency

Data strategy ×6

Lineage tracking ×6

Data quality ×5

Distributed data processing ×5

Dataset versioning ×5

Python engineering ×5

Data engineering ×3

Research infrastructure ×3

Data sourcing ×3

Data procurement ×3

Data integration ×3

Data processing ×3

Data versioning ×3

Deduplication ×3

Preprocessing ×3

Format normalization ×3

Metadata tracking ×3

Reproducibility ×3

Auditable data decisions ×3

Token budget management ×3

Data mixing ratios ×3

Curriculum design ×3

Data-efficient training ×3

Synthetic data generation ×3

Data selection methods ×3

LLM pretraining ×3

LLM midtraining ×3

Web-scale corpora ×3

Scientific corpora ×3

Exact deduplication ×3

Fuzzy deduplication ×3

MinHash ×3

BEHAVIOURAL

OwnershipInsatiable driveCollaborationResearch-oriented mindset

Role Details

Type FULL TIME

Education Bachelor’s degree or an equivalent combination of education

Category bits:-research,-llms,-machine-learning,-infra

Salary Band 200k+

AI-Extracted Insights

Domain Areas

chemistryphysicsmaterials-sciencemathematicslab-instrumentationscientific-data-formatsphysical-scienceengineering-discipline

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about PERIODIC LABS?

Real rants from real employees. Read before you apply.

Read Company Rants →