Karya

DataCurationIntern

Bengaluru, Karnataka, India

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Entry candidates.

The Brief

“Data Curation Intern at Karya. Skills: Data Curation, AI/ML model training, Indian language data, multilingual data, text data pipelines, speech and voice model training. Audit and profile open-source datasets. Design and implement data cleaning pipelines”

What You'll Achieve.

build high-quality datasets for training AI/ML models; deliver high quality, timely, and price competitive data to its clients; rapidly scale our impact by bringing economic opportunities to millions of underserved users in India

What They're Looking For.

Must Have

attention to detail, Comfort with Python for data processing, pandas, regex, basic NLP libraries, spaCy, NLTK, Familiarity with text data formats, CSV, JSONL, Parquet, plain text corpora, Curiosity about AI/ML, language technology, or computational linguistics, Ability to work independently, document work clearly, communicate blockers early

Nice to Have

Prior exposure to NLP datasets or open-source language resources, IndicNLP, AI4Bharat, Hugging Face datasets, Knowledge of one or more Indian languages beyond English, Experience with data versioning tools, DVC, Git-LFS, dataset platforms, Hugging Face Hub, Basic understanding of how language models or speech models are trained

What You'll Do.

Audit and profile open-source datasets

Design and implement data cleaning pipelines

Create and apply metadata tagging schemas

Build validation checklists and quality scorecards

Document data provenance

phonetically diverse text passages

Ensure text selection covers domain

Assist in defining metadata standards for audio datasets

Support the pipeline transition from text corpus to aligned speech dataset

How You'll Work.

Communication Scope

communicate blockers early; document work clearly

Full Job Description

About Karya: Why was Karya on the cover of the Time Magazine , highlighted by Satya Nadella , and invited to present its work to Sundar Pichai one on one? In part, because Karya is on a mission to provide AI enabled earning and learning opportunities to communities with high talent, but low access to opportunities. Karya achieves this while also delivering high quality, timely, and price competitive data to its clients. Karya builds high quality datasets for large companies like Google and Microsoft, while providing ethical work opportunities and fair wages to its workforce. Karya’s workers make nearly 20 times the Indian minimum wage and through our one-of-a-kind digital work platform, we have delivered over 40 million digital tasks and have positively impacted over 100 thousand workers. In the coming years, our goal is to rapidly scale our impact by bringing economic opportunities to millions of underserved users in India. With a rapidly growing global presence, we are also looking to expand our client base in the Indian market by partnering with leading Indian enterprises. About the Role We are looking for a detail-oriented and curious Data Curation Intern to help build high-quality datasets for training AI/ML models with a specific focus on Indian language and multilingual data. You will work with large open-source datasets (e.g., Sangraha by AI4Bharat) that require significant cleaning, structuring, and enrichment before they can be used effectively in model training pipelines. This is a hands-on, high-impact role at the intersection of data engineering, linguistics, and AI. You will start with text data pipelines and progressively move toward preparing data for read-speech and voice model training. What You'll Do Phase 1: Text Data Curation Audit and profile open-source datasets (Sangraha, Common Crawl, IndicCorp, etc.) to assess quality, coverage, and noise levels Design and implement data cleaning pipelines: deduplication, script normalisation, encoding fixe

Free ATS check

Applying for this Data Curation Intern role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 34 detected · ranked by frequency

data processing ×3

text data formats ×3

data cleaning pipelines ×3

metadata tagging ×3

validation checklists ×3

quality scorecards ×3

data provenance ×3

licensing ×3

processing steps ×3

text passage curation ×3

speech data preparation ×3

audio datasets metadata standards ×3

transcription format ×3

Data Curation ×2

AI/ML model training ×2

Indian language data ×2

multilingual data ×2

text data pipelines ×2

speech and voice model training ×2

pandas ×2

spaCy ×2

NLTK ×2

DVC ×2

Git-LFS ×2

Hugging Face Hub ×2

Python

regex

CSV

JSONL

Parquet

Hugging Face datasets

data engineering

BEHAVIOURAL

attention to detailCuriosityAbility to work independentlycommunicate blockers early

Role Details

Experience 0–2 yrs

Level Entry

Category technology

AI-Extracted Insights

Domain Areas

indian-language-datamultilingual-datalanguage-technologycomputational-linguisticsai-mlspeech-modelslanguage-models

How to Apply on Greenhouse

Create a Greenhouse profile before applying — it saves time across multiple applications.
Upload your resume as a PDF; the parser handles it better than Word.
Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Karya?

Real rants from real employees. Read before you apply.

Read Company Rants →