---
library_name: fairseq2
datasets:
- google/fleurs
- Thomcles/Persian-Farsi-Speech
language:
- fa
tags:
- automatic-speech-recognition
- persian
- farsi
- omnilingual-asr
- ctc
---

# Omnilingual ASR CTC 300M v2, FLEURS Persian + Thomcles Continuation

This repository contains the step-5000 checkpoint from continuing `Peacockery/omni-ctc-300m-v2-fleurs-fa-ir` on `Thomcles/Persian-Farsi-Speech`.

The checkpoint is a fairseq2 / Omnilingual ASR checkpoint. It is not packaged as a Transformers model.

## Files

- `checkpoint-step-5000.pt`: final continued model checkpoint.
- `fairseq2_card.yaml`: local fairseq2 asset card for the checkpoint.
- `training-config.yaml`: continuation training configuration.
- `benchmarks/fleurs-test-step5000-thomcles-summary.md`: FLEURS `fa_ir` test benchmark after Thomcles continuation.
- `benchmarks/fleurs-test-before-thomcles-summary.md`: FLEURS `fa_ir` test benchmark before Thomcles continuation.
- `dev-scores/`: Thomcles dev WER scores saved during continuation training.
- `data/thomcles-language_distribution_0.tsv`: prepared Thomcles training-hour summary.

## Results

FLEURS `fa_ir` test, 871 samples:

| Checkpoint | WER | CER |
| --- | ---: | ---: |
| FLEURS + Thomcles step 5000 | 18.02% | 5.11% |
| FLEURS-only step 5000 | 18.55% | 5.28% |

Thomcles dev validation:

| Step | WER |
| --- | ---: |
| 500 | 31.41% |
| 1000 | 28.91% |
| 1500 | 26.54% |
| 2000 | 26.11% |
| 2500 | 25.64% |
| 3000 | 24.69% |
| 3500 | 24.21% |
| 4000 | 23.95% |
| 4500 | 23.73% |
| 5000 | 23.62% |

## Training Notes

- Starting checkpoint: `Peacockery/omni-ctc-300m-v2-fleurs-fa-ir`
- Continuation dataset: `Thomcles/Persian-Farsi-Speech`
- Prepared Thomcles data: 108,306 train rows, 1,095 dev rows, 417.46 total hours
- Tokenizer: `omniASR_tokenizer_written_v2`
- Continuation steps: 5000
- Optimizer learning rate: `1e-5`
- Gradient accumulation: 8 batches
- Precision: bfloat16

See `training-config.yaml` for the exact trainer settings.