--- library_name: fairseq2 datasets: - google/fleurs - Thomcles/Persian-Farsi-Speech language: - fa tags: - automatic-speech-recognition - persian - farsi - omnilingual-asr - ctc --- # Omnilingual ASR CTC 300M v2, FLEURS Persian + Thomcles Continuation This repository contains the step-5000 checkpoint from continuing `Peacockery/omni-ctc-300m-v2-fleurs-fa-ir` on `Thomcles/Persian-Farsi-Speech`. The checkpoint is a fairseq2 / Omnilingual ASR checkpoint. It is not packaged as a Transformers model. ## Files - `checkpoint-step-5000.pt`: final continued model checkpoint. - `fairseq2_card.yaml`: local fairseq2 asset card for the checkpoint. - `training-config.yaml`: continuation training configuration. - `benchmarks/fleurs-test-step5000-thomcles-summary.md`: FLEURS `fa_ir` test benchmark after Thomcles continuation. - `benchmarks/fleurs-test-before-thomcles-summary.md`: FLEURS `fa_ir` test benchmark before Thomcles continuation. - `dev-scores/`: Thomcles dev WER scores saved during continuation training. - `data/thomcles-language_distribution_0.tsv`: prepared Thomcles training-hour summary. ## Results FLEURS `fa_ir` test, 871 samples: | Checkpoint | WER | CER | | --- | ---: | ---: | | FLEURS + Thomcles step 5000 | 18.02% | 5.11% | | FLEURS-only step 5000 | 18.55% | 5.28% | Thomcles dev validation: | Step | WER | | --- | ---: | | 500 | 31.41% | | 1000 | 28.91% | | 1500 | 26.54% | | 2000 | 26.11% | | 2500 | 25.64% | | 3000 | 24.69% | | 3500 | 24.21% | | 4000 | 23.95% | | 4500 | 23.73% | | 5000 | 23.62% | ## Training Notes - Starting checkpoint: `Peacockery/omni-ctc-300m-v2-fleurs-fa-ir` - Continuation dataset: `Thomcles/Persian-Farsi-Speech` - Prepared Thomcles data: 108,306 train rows, 1,095 dev rows, 417.46 total hours - Tokenizer: `omniASR_tokenizer_written_v2` - Continuation steps: 5000 - Optimizer learning rate: `1e-5` - Gradient accumulation: 8 batches - Precision: bfloat16 See `training-config.yaml` for the exact trainer settings.