Motivation
India's linguistic landscape is deeply code-switched — people routinely mix two or more languages within a single sentence. Rajasthani, spoken by over 80 million people, is almost entirely absent from existing NLP benchmarks. Models trained on Hindi or English fail catastrophically on Rajasthani-Hindi mixed text, making NLP tools inaccessible to a huge population.
RajNLP-50K is the first step toward fixing that.
Dataset Construction
Collection
Text was collected from:
- Social media posts (Twitter/X, Facebook groups in Rajasthan)
- News portals publishing regional content
- Transcriptions of conversational audio recordings
- Online forums and community discussions
All sources were filtered for authentic code-switching (not pure Hindi, not pure Rajasthani).
Annotation Pipeline
Raw Text → Language ID tagging → Entity labeling → Sentiment scoring → Toxicity flagging
Each sentence was annotated by native Rajasthani speakers with:
- NER tags: Person, Location, Organization, Event
- Sentiment: Positive / Negative / Neutral
- Toxicity: Binary label + severity score
Inter-annotator agreement was measured using Cohen's Kappa (κ > 0.78 across all tasks).
Scale
| Split | Sentences | |-------|-----------| | Train | 40,000 | | Validation | 5,000 | | Test | 5,000 | | Total | 50,000 |
Models Evaluated
We fine-tuned and evaluated the following models:
| Model | NER F1 | Sentiment Acc | Toxicity F1 | |-------|--------|--------------|-------------| | GPT-4o (zero-shot) | 61.2 | 68.4 | 59.8 | | mBERT | 72.1 | 74.2 | 68.3 | | XLM-RoBERTa | 75.8 | 77.1 | 71.4 | | MuRIL (fine-tuned) | 84.3 | 83.7 | 79.6 |
Fine-tuned MuRIL — Google's Multilingual Representations for Indian Languages model — significantly outperforms GPT-4o on all three tasks, demonstrating that targeted fine-tuning on regional language data beats large general-purpose models.
Key Findings
- GPT-4o struggles with Rajasthani lexicon — it often treats Rajasthani words as misspellings of Hindi equivalents
- Transliteration variance is the biggest challenge — the same word can appear in 4–6 different romanized spellings
- MuRIL's pre-training on Indian language Wikipedia gives it a significant head start over non-India-focused multilingual models
Impact
RajNLP-50K is released openly to enable:
- Hate speech and toxicity detection for Rajasthani social media
- Chatbots and voice assistants for Rajasthani speakers
- Machine translation research for low-resource Indian languages
Tech Stack
- Models: MuRIL, XLM-RoBERTa, mBERT, GPT-4o API
- Framework: PyTorch, HuggingFace Transformers
- Data processing: Python, Pandas, spaCy
- Annotation tool: Label Studio
