Telugu Transliterator v1.1 (Tenglish → Telugu)

Converts colloquial Romanized Telugu (Tenglish) to Telugu Unicode script.

Fine-tuned from google/byt5-small on 5.6M transliteration pairs including 6,364 targeted synthetic pairs generated to fix colloquial error patterns identified in v1.0.


Example Outputs

Input (Tenglish) Output (Telugu) Notes
nenu telugu matladatanu నేను తెలుగు మాట్లాడతాను standard sentence
ela unnav bro ఎలా ఉన్నావ్ బ్రో casual greeting with English code-mix
chala bagundi aa movie చాలా బాగుంది ఆ మూవీ informal review
repu veltunnanu రేపు వెళ్తున్నాను near-future plan
antey em cheppali అంటే ఏం చెప్పాలి disambiguation: antey = "means"
ante idi chalu అంతే ఇది చాలు disambiguation: ante = "that's all"
naaku teliyadu నాకు తెలియదు dative suffix: naaku
pakka correct ga cheppadu పక్కా కరెక్ట్ గా చెప్పాడు Urdu loanword: pakka

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("harinpurumandla/tenglish-to-telugu")
model = AutoModelForSeq2SeqLM.from_pretrained("harinpurumandla/tenglish-to-telugu")

def to_telugu(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=128, truncation=True)
    outputs = model.generate(**inputs, max_length=128, num_beams=4)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(to_telugu("nenu telugu matladatanu"))
# నేను తెలుగు మాట్లాడతాను

print(to_telugu("ela unnav bro"))
# ఎలా ఉన్నావ్ బ్రో

Model Details

Property Value
Base model google/byt5-small
Architecture Byte-level T5 (seq2seq)
Parameters ~300M
Direction Tenglish (Roman) → Telugu script
Training data 5,586,373 pairs (5 corpus sources + 6,364 synthetic)
Training 1 epoch continued fine-tune, LR=1e-5, bf16, dual RTX 3090
Eval CER 1.89%
Decoding Beam search (num_beams=4)
Max input length 128 tokens (bytes)
License Apache-2.0

Evaluation

Evaluated on a held-out val set (5,000 sampled pairs), beam search decoding (num_beams=4).

Version History

Version CER Notes
v1.0 7.08% Baseline — trained from scratch on 5.5M pairs
v1.1 1.89% Continued fine-tune with targeted synthetic colloquial data

Comparison with Gemma4 31B

Evaluated on 50 clean Tenglish inputs (8–60 chars, alphabetic):

Model CER Parameters
ByT5-small v1.1 (this model) 2.70% 300M
Gemma4 31B (zero-shot, Ollama) 8.48% 31B

ByT5-small achieves 3× lower error rate at 1% of the parameter count. The key difference: Gemma4 normalizes colloquial spellings toward standard Telugu (e.g., chupiyaru → చూపించారు), while this model faithfully preserves the informal romanization convention (→ చుపియారు). For Tenglish transliteration, faithfulness to the input spelling is the correct behavior.

Per-Slice (v1.0 baseline for reference)

Slice v1.0 CER Coverage
Overall 7.08% 10,000 pairs
Formal 3.25% 71 pairs
Colloquial 3.09% 3,656 pairs
Long sentence 16.21% 206 pairs

Synthetic Data

v1.1 added 6,364 targeted synthetic pairs generated with local Ollama gemma4:31b covering 9 confusion categories identified from v1.0 error analysis:

Category Pairs Focus
antey_disambiguation 704 అంటే (means) vs అంతే (that's all) in context
dative_suffix_context 701 naaku / naku / ki suffix forms
colloquial_chat 705 WhatsApp-style greetings, questions, reactions
urdu_loanwords 1,000 pakka, bilkul, mast, zabardast in Telugu sentences
code_mix 1,000 English words with Telugu morphological markers
long_sentences 800 Full sentences >15 words
reactions_expressions 701 Emotional interjections and expressive phrases
continuous_negation 403 Negation + continuous tense combinations
long_vowel_disambiguation 350 Phonemically distinct long vs short vowels

All synthetic pairs were validated for Telugu script density (≥10%), length ratio, and ASCII content before inclusion.


Known Limitations

  • Long sentences (>40 words): ByT5 1024-byte window truncates inputs. Split long text before transliterating.
  • Rare proper nouns: Unusual place names and personal names may be inconsistently transliterated.
  • Spelling convention: When multiple valid Tenglish spellings exist for a word, the model outputs the most common convention learned from training data.
  • Pure English words: Code-mixed English words pass through but capitalization may not be preserved.

Companion Model

For the reverse direction (Telugu script → Tenglish), see harinpurumandla/telugu-to-tenglish.


Citation

@misc{telugu-transliterator-2026,
  title={Telugu Transliterator: Tenglish to Telugu Unicode Transliteration},
  author={Purumandla, Sai Harin Kumar Reddy},
  year={2026},
  url={https://huggingface.co/harinpurumandla/tenglish-to-telugu}
}
Downloads last month
86
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using harinpurumandla/tenglish-to-telugu 1