Omilia

Tech / AI / Software

SeniorDataArchitect

Remote FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Data Architect at Omilia. Skills: data architecture, data engineering, LLM/ML data infrastructure, Snowflake, AWS S3, Airflow, dbt, AWS SageMaker, SQL, Python, data modeling, schema design, data pipeline architecture, annotation requirements, data cataloging, data quality frameworks. Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines. Define and govern data selection and sampling strategy”

Industry & Context.

Tech / AI / Software
Problems you'll solve

Analytical mindset with the ability to make informed trade-off decisions on data quality, diversity, and scale

What They're Looking For.

Must Have

5+ years in data architecture, data engineering, or LLM/ML data infrastructure, demonstrated ownership of production data systems serving ML/AI model development, understanding of ML training data requirements, Deep experience with data modeling, schema design, and data pipeline architecture, proficiency with Snowflake, AWS S3, and ETL/ELT orchestration tools (Airflow, dbt, or similar), Experience defining annotation requirements and managing data annotation workflows, Experience with data cataloging, metadata management, and dataset discovery at scale, SQL and Python skills for data pipeline development and data quality analysis, Experience with data quality frameworks: deduplication, sampling strategies, diversity optimization, Master's degree or PhD in Computer Science, Data Engineering, Information Systems, or a related field

Nice to Have

hands-on experience with LLM training data preparation — instruction tuning datasets, preference data, RLHF/DPO annotation, synthetic data generation, experience with data anonymization and PII/PCI redaction as part of ML data pipelines, familiarity with AWS SageMaker ML pipeline integration and active learning/data selection strategies, knowledge of voice/audio data handling, storage, and processing at scale, Experience with conversational AI data (dialog transcripts, ASR outputs, NLU annotations), Experience with data governance for regulated industries (financial services, healthcare), Familiarity with NER/NLU-based data processing approaches (spaCy, HuggingFace, custom entity recognition)

What You'll Do.

Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines

Define and govern data selection and sampling strategy

Build and maintain the data catalog and dataset discovery infrastructure

Define annotation pipeline architecture

Architect the data flywheel

Own and maintain data pipelines and infrastructure spanning Snowflake

ETL/ELT pipelines (Airflow)

and integration with ML training workflows on AWS SageMaker

Work directly with LLM

and Agentic systems teams to understand training data requirements

Define and maintain the data architecture for Omilia's Training Environment

Design data quality frameworks that directly improve model outcomes

Define annotation requirements for ML model development

Build and maintain the data catalog that enables cross-team dataset discovery

Architect the closed-loop data flywheel

Identify gaps in production training data and define requirements for external data acquisition

Work closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to align data architecture with model training

Maintain comprehensive documentation of data architecture

dataset specifications

pipeline configurations

produce data architecture RFCs for significant changes and share best practices with ML teams

How You'll Work.

Team & Collaboration

Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management stakeholders; track record of working effectively with ML engineers, platform teams, and product stakeholders

Communication Scope

ability to translate ML team data needs into concrete pipeline specifications; explain data architecture decisions to both technical and compliance audiences

Full Job Description

### Accountabilities * Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development. * Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies. * Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction. * Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors. * Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation. * Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker. **Key Responsibilities** * Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements — what conversational patterns improve zero-shot routing accuracy, what dialog structures train better task planners, what edge cases stress-test agentic reasoning — and translate these into concrete dataset specifications and pipeline configurations. * Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production (OCP) to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipe

Free ATS check

Applying for this Senior Data Architect role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

ANONYMOUS · UNFILTERED

What do employees actually say about Omilia?

Real rants from real employees. Read before you apply.

Read Company Rants →