RajNLP-50K: India's First Open Rajasthani-Hindi Code-Switched Corpus

00:02:00:26

Motivation

India's linguistic landscape is deeply code-switched — people routinely mix two or more languages within a single sentence. Rajasthani, spoken by over 80 million people, is almost entirely absent from existing NLP benchmarks. Models trained on Hindi or English fail catastrophically on Rajasthani-Hindi mixed text, making NLP tools inaccessible to a huge population.

RajNLP-50K is the first step toward fixing that.

Dataset Construction

Collection

Text was collected from:

Social media posts (Twitter/X, Facebook groups in Rajasthan)
News portals publishing regional content
Transcriptions of conversational audio recordings
Online forums and community discussions

All sources were filtered for authentic code-switching (not pure Hindi, not pure Rajasthani).

Annotation Pipeline

Raw Text → Language ID tagging → Entity labeling → Sentiment scoring → Toxicity flagging

Each sentence was annotated by native Rajasthani speakers with:

NER tags: Person, Location, Organization, Event
Sentiment: Positive / Negative / Neutral
Toxicity: Binary label + severity score

Inter-annotator agreement was measured using Cohen's Kappa (κ > 0.78 across all tasks).

Scale

| Split | Sentences | |-------|-----------| | Train | 40,000 | | Validation | 5,000 | | Test | 5,000 | | Total | 50,000 |

Models Evaluated

We fine-tuned and evaluated the following models:

| Model | NER F1 | Sentiment Acc | Toxicity F1 | |-------|--------|--------------|-------------| | GPT-4o (zero-shot) | 61.2 | 68.4 | 59.8 | | mBERT | 72.1 | 74.2 | 68.3 | | XLM-RoBERTa | 75.8 | 77.1 | 71.4 | | MuRIL (fine-tuned) | 84.3 | 83.7 | 79.6 |

Fine-tuned MuRIL — Google's Multilingual Representations for Indian Languages model — significantly outperforms GPT-4o on all three tasks, demonstrating that targeted fine-tuning on regional language data beats large general-purpose models.

Key Findings

GPT-4o struggles with Rajasthani lexicon — it often treats Rajasthani words as misspellings of Hindi equivalents
Transliteration variance is the biggest challenge — the same word can appear in 4–6 different romanized spellings
MuRIL's pre-training on Indian language Wikipedia gives it a significant head start over non-India-focused multilingual models

Impact

RajNLP-50K is released openly to enable:

Hate speech and toxicity detection for Rajasthani social media
Chatbots and voice assistants for Rajasthani speakers
Machine translation research for low-resource Indian languages

Tech Stack

Models: MuRIL, XLM-RoBERTa, mBERT, GPT-4o API
Framework: PyTorch, HuggingFace Transformers
Data processing: Python, Pandas, spaCy
Annotation tool: Label Studio