Telugu Transliterator v1.1 (Tenglish → Telugu)

Converts colloquial Romanized Telugu (Tenglish) to Telugu Unicode script.

Fine-tuned from google/byt5-small on 5.6M transliteration pairs including 6,364 targeted synthetic pairs generated to fix colloquial error patterns identified in v1.0.

Example Outputs

Input (Tenglish)	Output (Telugu)	Notes
nenu telugu matladatanu	నేను తెలుగు మాట్లాడతాను	standard sentence
ela unnav bro	ఎలా ఉన్నావ్ బ్రో	casual greeting with English code-mix
chala bagundi aa movie	చాలా బాగుంది ఆ మూవీ	informal review
repu veltunnanu	రేపు వెళ్తున్నాను	near-future plan
antey em cheppali	అంటే ఏం చెప్పాలి	disambiguation: antey = "means"
ante idi chalu	అంతే ఇది చాలు	disambiguation: ante = "that's all"
naaku teliyadu	నాకు తెలియదు	dative suffix: naaku
pakka correct ga cheppadu	పక్కా కరెక్ట్ గా చెప్పాడు	Urdu loanword: pakka

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("harinpurumandla/tenglish-to-telugu")
model = AutoModelForSeq2SeqLM.from_pretrained("harinpurumandla/tenglish-to-telugu")

def to_telugu(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=128, truncation=True)
    outputs = model.generate(**inputs, max_length=128, num_beams=4)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(to_telugu("nenu telugu matladatanu"))
# నేను తెలుగు మాట్లాడతాను

print(to_telugu("ela unnav bro"))
# ఎలా ఉన్నావ్ బ్రో

Model Details

Property	Value
Base model	google/byt5-small
Architecture	Byte-level T5 (seq2seq)
Parameters	~300M
Direction	Tenglish (Roman) → Telugu script
Training data	5,586,373 pairs (5 corpus sources + 6,364 synthetic)
Training	1 epoch continued fine-tune, LR=1e-5, bf16, dual RTX 3090
Eval CER	1.89%
Decoding	Beam search (num_beams=4)
Max input length	128 tokens (bytes)
License	Apache-2.0

Evaluation

Evaluated on a held-out val set (5,000 sampled pairs), beam search decoding (num_beams=4).

Version History

Version	CER	Notes
v1.0	7.08%	Baseline — trained from scratch on 5.5M pairs
v1.1	1.89%	Continued fine-tune with targeted synthetic colloquial data

Comparison with Gemma4 31B

Evaluated on 50 clean Tenglish inputs (8–60 chars, alphabetic):

Model	CER	Parameters
ByT5-small v1.1 (this model)	2.70%	300M
Gemma4 31B (zero-shot, Ollama)	8.48%	31B

ByT5-small achieves 3× lower error rate at 1% of the parameter count. The key difference: Gemma4 normalizes colloquial spellings toward standard Telugu (e.g., chupiyaru → చూపించారు), while this model faithfully preserves the informal romanization convention (→ చుపియారు). For Tenglish transliteration, faithfulness to the input spelling is the correct behavior.

Per-Slice (v1.0 baseline for reference)

Slice	v1.0 CER	Coverage
Overall	7.08%	10,000 pairs
Formal	3.25%	71 pairs
Colloquial	3.09%	3,656 pairs
Long sentence	16.21%	206 pairs

Synthetic Data

v1.1 added 6,364 targeted synthetic pairs generated with local Ollama gemma4:31b covering 9 confusion categories identified from v1.0 error analysis:

Category	Pairs	Focus
antey_disambiguation	704	అంటే (means) vs అంతే (that's all) in context
dative_suffix_context	701	naaku / naku / ki suffix forms
colloquial_chat	705	WhatsApp-style greetings, questions, reactions
urdu_loanwords	1,000	pakka, bilkul, mast, zabardast in Telugu sentences
code_mix	1,000	English words with Telugu morphological markers
long_sentences	800	Full sentences >15 words
reactions_expressions	701	Emotional interjections and expressive phrases
continuous_negation	403	Negation + continuous tense combinations
long_vowel_disambiguation	350	Phonemically distinct long vs short vowels

All synthetic pairs were validated for Telugu script density (≥10%), length ratio, and ASCII content before inclusion.

Known Limitations

Long sentences (>40 words): ByT5 1024-byte window truncates inputs. Split long text before transliterating.
Rare proper nouns: Unusual place names and personal names may be inconsistently transliterated.
Spelling convention: When multiple valid Tenglish spellings exist for a word, the model outputs the most common convention learned from training data.
Pure English words: Code-mixed English words pass through but capitalization may not be preserved.

Companion Model

For the reverse direction (Telugu script → Tenglish), see harinpurumandla/telugu-to-tenglish.

Citation

@misc{telugu-transliterator-2026,
  title={Telugu Transliterator: Tenglish to Telugu Unicode Transliteration},
  author={Purumandla, Sai Harin Kumar Reddy},
  year={2026},
  url={https://huggingface.co/harinpurumandla/tenglish-to-telugu}
}

Downloads last month: 86

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

harinpurumandla
/

tenglish-to-telugu