Karya
DataCurationIntern
Neural analysis suggests this role is
optimal for Entry candidates.
“Data Curation Intern at Karya. Skills: Data Curation, AI/ML model training, Indian language data, multilingual data, text data pipelines, speech and voice model training. Audit and profile open-source datasets. Design and implement data cleaning pipelines”
What You'll Achieve.
build high-quality datasets for training AI/ML models; deliver high quality, timely, and price competitive data to its clients; rapidly scale our impact by bringing economic opportunities to millions of underserved users in India
What They're Looking For.
Must Have
attention to detail, Comfort with Python for data processing, pandas, regex, basic NLP libraries, spaCy, NLTK, Familiarity with text data formats, CSV, JSONL, Parquet, plain text corpora, Curiosity about AI/ML, language technology, or computational linguistics, Ability to work independently, document work clearly, communicate blockers early
Nice to Have
Prior exposure to NLP datasets or open-source language resources, IndicNLP, AI4Bharat, Hugging Face datasets, Knowledge of one or more Indian languages beyond English, Experience with data versioning tools, DVC, Git-LFS, dataset platforms, Hugging Face Hub, Basic understanding of how language models or speech models are trained
What You'll Do.
Audit and profile open-source datasets
Design and implement data cleaning pipelines
Create and apply metadata tagging schemas
Build validation checklists and quality scorecards
Document data provenance
phonetically diverse text passages
Ensure text selection covers domain
Assist in defining metadata standards for audio datasets
Support the pipeline transition from text corpus to aligned speech dataset
How You'll Work.
Communication Scope
communicate blockers early; document work clearly
Full Job Description
About Karya: Why was Karya on the cover of the Time Magazine , highlighted by Satya Nadella , and invited to present its work to Sundar Pichai one on one? In part, because Karya is on a mission to provide AI enabled earning and learning opportunities to communities with high talent, but low access to opportunities. Karya achieves this while also delivering high quality, timely, and price competitive data to its clients. Karya builds high quality datasets for large companies like Google and Microsoft, while providing ethical work opportunities and fair wages to its workforce. Karya’s workers make nearly 20 times the Indian minimum wage and through our one-of-a-kind digital work platform, we have delivered over 40 million digital tasks and have positively impacted over 100 thousand workers. In the coming years, our goal is to rapidly scale our impact by bringing economic opportunities to millions of underserved users in India. With a rapidly growing global presence, we are also looking to expand our client base in the Indian market by partnering with leading Indian enterprises. About the Role We are looking for a detail-oriented and curious Data Curation Intern to help build high-quality datasets for training AI/ML models with a specific focus on Indian language and multilingual data. You will work with large open-source datasets (e.g., Sangraha by AI4Bharat) that require significant cleaning, structuring, and enrichment before they can be used effectively in model training pipelines. This is a hands-on, high-impact role at the intersection of data engineering, linguistics, and AI. You will start with text data pipelines and progressively move toward preparing data for read-speech and voice model training. What You'll Do Phase 1: Text Data Curation Audit and profile open-source datasets (Sangraha, Common Crawl, IndicCorp, etc.) to assess quality, coverage, and noise levels Design and implement data cleaning pipelines: deduplication, script normalisation, encoding fixe
Applying for this Data Curation Intern role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Greenhouse
- Create a Greenhouse profile before applying — it saves time across multiple applications.
- Upload your resume as a PDF; the parser handles it better than Word.
- Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
- Enable email notifications to track application status in real time.
ANONYMOUS · UNFILTERED
What do employees actually say about Karya?
Real rants from real employees. Read before you apply.