---
language: te
tags:
  - transliteration
  - telugu
  - tenglish
  - byt5
  - romanized-telugu
  - indicnlp
  - telugu-unicode
  - low-resource
  - indic-languages
  - seq2seq
license: apache-2.0
datasets:
  - harinpurumandla/telugu-transliterator-dataset
metrics:
  - cer
---

# Telugu Transliterator v1.1 (Tenglish → Telugu)

Converts colloquial Romanized Telugu (Tenglish) to Telugu Unicode script.

Fine-tuned from [google/byt5-small](https://huggingface.co/google/byt5-small) on 5.6M transliteration pairs including 6,364 targeted synthetic pairs generated to fix colloquial error patterns identified in v1.0.

---

## Example Outputs

| Input (Tenglish) | Output (Telugu) | Notes |
|---|---|---|
| nenu telugu matladatanu | నేను తెలుగు మాట్లాడతాను | standard sentence |
| ela unnav bro | ఎలా ఉన్నావ్ బ్రో | casual greeting with English code-mix |
| chala bagundi aa movie | చాలా బాగుంది ఆ మూవీ | informal review |
| repu veltunnanu | రేపు వెళ్తున్నాను | near-future plan |
| antey em cheppali | అంటే ఏం చెప్పాలి | disambiguation: antey = "means" |
| ante idi chalu | అంతే ఇది చాలు | disambiguation: ante = "that's all" |
| naaku teliyadu | నాకు తెలియదు | dative suffix: naaku |
| pakka correct ga cheppadu | పక్కా కరెక్ట్ గా చెప్పాడు | Urdu loanword: pakka |

---

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("harinpurumandla/tenglish-to-telugu")
model = AutoModelForSeq2SeqLM.from_pretrained("harinpurumandla/tenglish-to-telugu")

def to_telugu(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=128, truncation=True)
    outputs = model.generate(**inputs, max_length=128, num_beams=4)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(to_telugu("nenu telugu matladatanu"))
# నేను తెలుగు మాట్లాడతాను

print(to_telugu("ela unnav bro"))
# ఎలా ఉన్నావ్ బ్రో
```

---

## Model Details

| Property | Value |
|---|---|
| Base model | google/byt5-small |
| Architecture | Byte-level T5 (seq2seq) |
| Parameters | ~300M |
| Direction | Tenglish (Roman) → Telugu script |
| Training data | 5,586,373 pairs (5 corpus sources + 6,364 synthetic) |
| Training | 1 epoch continued fine-tune, LR=1e-5, bf16, dual RTX 3090 |
| Eval CER | 1.89% |
| Decoding | Beam search (num_beams=4) |
| Max input length | 128 tokens (bytes) |
| License | Apache-2.0 |

---

## Evaluation

Evaluated on a held-out val set (5,000 sampled pairs), beam search decoding (num_beams=4).

### Version History

| Version | CER | Notes |
|---|---|---|
| v1.0 | 7.08% | Baseline — trained from scratch on 5.5M pairs |
| **v1.1** | **1.89%** | Continued fine-tune with targeted synthetic colloquial data |

### Comparison with Gemma4 31B

Evaluated on 50 clean Tenglish inputs (8–60 chars, alphabetic):

| Model | CER | Parameters |
|---|---|---|
| **ByT5-small v1.1 (this model)** | **2.70%** | 300M |
| Gemma4 31B (zero-shot, Ollama) | 8.48% | 31B |

ByT5-small achieves **3× lower error rate at 1% of the parameter count**. The key difference: Gemma4 normalizes colloquial spellings toward standard Telugu (e.g., `chupiyaru` → చూపించారు), while this model faithfully preserves the informal romanization convention (→ చుపియారు). For Tenglish transliteration, faithfulness to the input spelling is the correct behavior.

### Per-Slice (v1.0 baseline for reference)

| Slice | v1.0 CER | Coverage |
|---|---|---|
| Overall | 7.08% | 10,000 pairs |
| Formal | 3.25% | 71 pairs |
| Colloquial | 3.09% | 3,656 pairs |
| Long sentence | 16.21% | 206 pairs |

---

## Synthetic Data

v1.1 added 6,364 targeted synthetic pairs generated with local Ollama gemma4:31b covering 9 confusion categories identified from v1.0 error analysis:

| Category | Pairs | Focus |
|---|---|---|
| antey_disambiguation | 704 | అంటే (means) vs అంతే (that's all) in context |
| dative_suffix_context | 701 | naaku / naku / ki suffix forms |
| colloquial_chat | 705 | WhatsApp-style greetings, questions, reactions |
| urdu_loanwords | 1,000 | pakka, bilkul, mast, zabardast in Telugu sentences |
| code_mix | 1,000 | English words with Telugu morphological markers |
| long_sentences | 800 | Full sentences >15 words |
| reactions_expressions | 701 | Emotional interjections and expressive phrases |
| continuous_negation | 403 | Negation + continuous tense combinations |
| long_vowel_disambiguation | 350 | Phonemically distinct long vs short vowels |

All synthetic pairs were validated for Telugu script density (≥10%), length ratio, and ASCII content before inclusion.

---

## Known Limitations

- **Long sentences (>40 words):** ByT5 1024-byte window truncates inputs. Split long text before transliterating.
- **Rare proper nouns:** Unusual place names and personal names may be inconsistently transliterated.
- **Spelling convention:** When multiple valid Tenglish spellings exist for a word, the model outputs the most common convention learned from training data.
- **Pure English words:** Code-mixed English words pass through but capitalization may not be preserved.

---

## Companion Model

For the reverse direction (Telugu script → Tenglish), see [harinpurumandla/telugu-to-tenglish](https://huggingface.co/harinpurumandla/telugu-to-tenglish).

---

## Citation

```bibtex
@misc{telugu-transliterator-2026,
  title={Telugu Transliterator: Tenglish to Telugu Unicode Transliteration},
  author={Purumandla, Sai Harin Kumar Reddy},
  year={2026},
  url={https://huggingface.co/harinpurumandla/tenglish-to-telugu}
}
```