Charsiu G2P (byT5-tiny-16-layers) — `safetensors` port

Self-contained, safetensors-packaged conversion of Charsiu's g2p_multilingual_byT5_tiny_16_layers multilingual grapheme-to-phoneme model.

The upstream repo ships pytorch_model.bin and no tokenizer files. This repo adds:

model.safetensors — same weights, HF-standard T5 key naming, F32.
config.json — upstream's config, with the stale local path stripped.
tokenizer_config.json + special_tokens_map.json — byte-level byT5 tokenizer config (from google/byt5-small, which byT5 G2P uses).

No architecture or training changes. Predictions match the upstream PyTorch model within floating-point tolerance.

Usage

Python (transformers / PyTorch)

from transformers import AutoTokenizer, T5ForConditionalGeneration

tok = AutoTokenizer.from_pretrained("bearcove/g2p-multilingual-byT5-tiny-16-layers-mlx")
model = T5ForConditionalGeneration.from_pretrained("bearcove/g2p-multilingual-byT5-tiny-16-layers-mlx")

# Prepend the language tag, e.g. "<eng-us>: " for American English.
inputs = tok(["<eng-us>: hello"], return_tensors="pt")
out = model.generate(**inputs, num_beams=1, max_length=50)
print(tok.batch_decode(out, skip_special_tokens=True))  # => ['ˈhɛɫoʊ']

Rust (MLX, Apple Silicon)

The reference Rust implementation is in bearcove/bee — see the bee-g2p-charsiu-mlx crate. It reads model.safetensors directly via mlx-rs.

use bee_g2p_charsiu_mlx::engine::G2pEngine;

let mut engine = G2pEngine::load("path/to/model-dir")?;
let ipa = engine.g2p("hello", "eng-us")?;  // => "ˈhɛɫoʊ"

Language codes

Charsiu uses ISO 639-derived codes with dialect suffixes where applicable — e.g. eng-us for American English, eng-uk for British English. See the Charsiu language code table for all 100 supported languages.

License

MIT, matching the upstream Charsiu project.

Citation

If you use this model, please cite the original Charsiu paper:

@inproceedings{zhu2022byt5,
  title={{ByT5} model for massively multilingual grapheme-to-phoneme conversion},
  author={Zhu, Jian and Zhang, Cong and Jurgens, David},
  booktitle={Proc. Interspeech 2022},
  year={2022},
  eprint={2204.03067},
  archivePrefix={arXiv}
}

Upstream project: https://github.com/lingjzhu/CharsiuG2P

Downloads last month: 17

Safetensors

Model size

20.9M params

Tensor type

F32

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bearcove/g2p-multilingual-byT5-tiny-16-layers-mlx

Base model

charsiu/g2p_multilingual_byT5_tiny_16_layers

Finetuned

(1)

this model

Paper for bearcove/g2p-multilingual-byT5-tiny-16-layers-mlx

ByT5 model for massively multilingual grapheme-to-phoneme conversion

Paper • 2204.03067 • Published Apr 6, 2022

Charsiu G2P (byT5-tiny-16-layers) — safetensors port