Omilia
Tech / AI / Software
SeniorDataArchitect
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Data Architect at Omilia. Skills: data architecture, data engineering, LLM/ML data infrastructure, Snowflake, AWS S3, Airflow, dbt, AWS SageMaker, SQL, Python, data modeling, schema design, data pipeline architecture, annotation requirements, data cataloging, data quality frameworks. Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines. Define and govern data selection and sampling strategy”
Industry & Context.
Analytical mindset with the ability to make informed trade-off decisions on data quality, diversity, and scale
What They're Looking For.
Must Have
5+ years in data architecture, data engineering, or LLM/ML data infrastructure, demonstrated ownership of production data systems serving ML/AI model development, understanding of ML training data requirements, Deep experience with data modeling, schema design, and data pipeline architecture, proficiency with Snowflake, AWS S3, and ETL/ELT orchestration tools (Airflow, dbt, or similar), Experience defining annotation requirements and managing data annotation workflows, Experience with data cataloging, metadata management, and dataset discovery at scale, SQL and Python skills for data pipeline development and data quality analysis, Experience with data quality frameworks: deduplication, sampling strategies, diversity optimization, Master's degree or PhD in Computer Science, Data Engineering, Information Systems, or a related field
Nice to Have
hands-on experience with LLM training data preparation — instruction tuning datasets, preference data, RLHF/DPO annotation, synthetic data generation, experience with data anonymization and PII/PCI redaction as part of ML data pipelines, familiarity with AWS SageMaker ML pipeline integration and active learning/data selection strategies, knowledge of voice/audio data handling, storage, and processing at scale, Experience with conversational AI data (dialog transcripts, ASR outputs, NLU annotations), Experience with data governance for regulated industries (financial services, healthcare), Familiarity with NER/NLU-based data processing approaches (spaCy, HuggingFace, custom entity recognition)
What You'll Do.
Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines
Define and govern data selection and sampling strategy
Build and maintain the data catalog and dataset discovery infrastructure
Define annotation pipeline architecture
Architect the data flywheel
Own and maintain data pipelines and infrastructure spanning Snowflake
ETL/ELT pipelines (Airflow)
and integration with ML training workflows on AWS SageMaker
Work directly with LLM
and Agentic systems teams to understand training data requirements
Define and maintain the data architecture for Omilia's Training Environment
Design data quality frameworks that directly improve model outcomes
Define annotation requirements for ML model development
Build and maintain the data catalog that enables cross-team dataset discovery
Architect the closed-loop data flywheel
Identify gaps in production training data and define requirements for external data acquisition
Work closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to align data architecture with model training
Maintain comprehensive documentation of data architecture
dataset specifications
pipeline configurations
produce data architecture RFCs for significant changes and share best practices with ML teams
How You'll Work.
Team & Collaboration
Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management stakeholders; track record of working effectively with ML engineers, platform teams, and product stakeholders
Communication Scope
ability to translate ML team data needs into concrete pipeline specifications; explain data architecture decisions to both technical and compliance audiences
Full Job Description
### Accountabilities * Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development. * Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies. * Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction. * Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors. * Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation. * Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker. **Key Responsibilities** * Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements — what conversational patterns improve zero-shot routing accuracy, what dialog structures train better task planners, what edge cases stress-test agentic reasoning — and translate these into concrete dataset specifications and pipeline configurations. * Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production (OCP) to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipe
Applying for this Senior Data Architect role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about Omilia?
Real rants from real employees. Read before you apply.