Qwen3-ASR-1.7B Speaker-Tagged
A Qwen3-ASR-1.7B finetune that transcribes English speech with inline speaker labels such as [S0], [S1], [S2], and [S3].
The goal is simple: one model output stream that can carry the words and the speaker turns directly, without bolting on an external diarization pipeline.
What It Does
- Produces normal ASR text.
- Adds inline speaker-change tags in the transcript.
- Supports compact multispeaker clips with up to four local speakers.
- Assigns
[S0],[S1], etc. by first appearance inside the clip.
Prompt
Use a prompt in this style:
Transcribe the English meeting audio and include speaker changes inline. Use only [S0], [S1], [S2], and [S3] speaker tags. Assign speakers by first appearance in the clip and repeat the tag before each speaker turn.
Quick Eval
On this non-cherrypicked held-out multispeaker slice, the model emitted speaker tags for 6/6 clips, matched the number of speakers on 3/6 clips, and averaged -1.67 predicted turns vs. reference.
The samples below are sorted for easier listening and inspection. They are not cherry-picked; they are the first held-out multispeaker eval artifacts from the run path, organized by speaker count and turn count. Full WAV files are stored in the repo; the inline players are compact base64 MP3 previews so the model card remains loadable.
Listen To The Eval
Sample 00: 3 speakers, 7 turns
- Full WAV artifact: stored with the eval artifacts in this repo
- Inline player: base64 MP3 preview generated from the same eval WAV
- Duration: 33.0s
- Speaker count: reference 2 / predicted 3 (mismatch)
- Turn count: reference 7 / predicted 5 (delta -2)
Reference
[S0] SO YOU COULD LIKE UM LIKE IF YOU HAVE I DUNNO IF YOU HAVE SATELLITE IF YOU HAVE A HUNDRED CHANNELS [S0] OKAY [S0] SO THROUGHOUT OUR NEXT COUPLE OF MEETINGS WE'LL WE'LL BE COVERING THESE THINGS [S0] YOU CAN CALL ME ABBIE [S0] SO I GUESS WE HAVE TO REFLECT ON OUR EXPERIENCES WITH REMOTE CONTROLS TO DECIDE WHAT UM WE WOULD LIKE TO SEE IN A CONVENIENT PRACTICAL [S1] SORRY CAN YOU JUST SAY THAT WHAT'S THE WHAT ARE OUR PRICE GOALS AGAIN [S0] ALTHOUGH UH I'LL JUST PUT THERE
Prediction
[S0] SO YOU COULD LIKE UM LIKE IF YOU HAVE I DON'T KNOW IF YOU HAVE A SATELLITE IF YOU HAVE A HUNDRED CHANNELS [S1] OKAY [S0] SO THROUGHOUT OUR NEXT COUPLE OF MEETINGS WE W WILL BE COVERING THESE THINGS [S2] YOU CAN CALL ME ABBY [S0] SO I GUESS WE HAVE TO REFLECT ON OUR EXPERIENCES WITH REMOTE
Sample 01: 3 speakers, 10 turns
- Full WAV artifact: stored with the eval artifacts in this repo
- Inline player: base64 MP3 preview generated from the same eval WAV
- Duration: 20.8s
- Speaker count: reference 3 / predicted 3 (match)
- Turn count: reference 10 / predicted 9 (delta -1)
Reference
[S0] YEAH YEAH YEAH [S0] THEN YOU CAN PRO SORT OF HAVE A PROGRAMMING FUNCTION [S0] FOUR MILLION [S1] UM [S1] UM SO F HERE'S OUR AGENDA FOR TODAY [S2] I'M STEPHANIE AND I AM THE USER INTERFACE DESIGNER [S1] YEAH [S1] SORRY ABOUT THAT [S0] NO IT CAN'T BE UH UH [S2] UM
Prediction
[S0] YEAH YEAH [S1] THEN YOU CAN PRO SORT OF HAVE A PROGRAMMING FUNCTION [S2] OH [S1] UM [S1] UM SO HERE'S OUR AGENDA FOR TODAY [S1] I'M STEPHANY AND I AM THE USER INTERFACE DESIGNER [S0] YEAH YEAH YEAH [S2] SORRY ABOUT THAT [S2] NO IT CAN'T REALLY
Sample 02: 3 speakers, 10 turns
- Full WAV artifact: stored with the eval artifacts in this repo
- Inline player: base64 MP3 preview generated from the same eval WAV
- Duration: 27.2s
- Speaker count: reference 3 / predicted 3 (match)
- Turn count: reference 10 / predicted 8 (delta -2)
Reference
[S0] WE'LL GET ROUND TO THAT LATER [S0] SOMETHING LIKE NOT IT'S NOT ON THE BUTTON BUT IT'S TELLING YOU WHAT TO DO IS THAT WHAT YOU MEAN [S0] UM [S1] BIGGER [S2] UM [S2] THAT'S VERY GOOD [S2] OKAY [S2] MAYBE WE SHOULD DESIGN COUCHES THAT HAVE THE REMOTE CONTROL IN THE SIDE ARM [S1] I DON'T YEAH I MEAN IT BUT LIKE I MEAN JUST I MEAN LIKE YOUR PHONE EVEN JUST HAS SO LIKE IT CAN VIBRATE [S1] I LIKE ITS TAIL
Prediction
[S0] WELL THAT'S ONE THING THAT'S A THING [S1] SOMETHING LIKE NOT IT'S NOT ON THE BOTTOM BUT IT'S TELLING YOU WHAT TO DO [S2] UM [S1] BIGGER [S1] UH [S1] NO SORRY [S0] OKAY [S0] WHO WE SHOULD DESIGN COUCHES THAT HAVE THE REMOTE CONTROL IN THE MIDDLE RIGHT NOW [S1
Sample 03: 3 speakers, 10 turns
- Full WAV artifact: stored with the eval artifacts in this repo
- Inline player: base64 MP3 preview generated from the same eval WAV
- Duration: 24.9s
- Speaker count: reference 3 / predicted 3 (match)
- Turn count: reference 10 / predicted 7 (delta -3)
Reference
[S0] UH [S1] DOES ANYONE KNOW WHAT THEY WANNA DRAW [S0] MARKETING [S2] 'CAUSE I HATE THAT WHEN THEY HAVE TOO MANY BUTTONS AND I MEAN I KNOW IT HAS TO HAVE ENOUGH FUNCTIONS BUT LIKE [S0] SELECT [S1] MM [S1] SO I GUESS YOU'D BE LOOKING AT LOTS OF THE THINGS WE DISCUSSED ABOUT SCREEN AND UM THAT SORT OF THING [S2] MM I GOTTA THINK ABOUT IT FOR A SECOND LIKE [S0] OOH [S0] YEAH LIKE
Prediction
[S0] RIGHT [S1] SO THEY WANNA KNOW WHAT THEY WANNA DRAW [S2] MARKETING [S1] 'CAUSE I HATE THINGS WHEN THEY HAVE TOO MANY BUTTONS AND THEN BUT THEN YOU KNOW I MEAN I KNOW IT HAS TO HAVE NO FUNCTIONS BUT LIKE [S1] SELECT [S1] YEAH [S1] SO I GUESS WE'D BE LOOKING AT ABOUT SOME OF THE THINGS WE
Sample 04: 4 speakers, 10 turns
- Full WAV artifact: stored with the eval artifacts in this repo
- Inline player: base64 MP3 preview generated from the same eval WAV
- Duration: 18.2s
- Speaker count: reference 4 / predicted 3 (mismatch)
- Turn count: reference 10 / predicted 9 (delta -1)
Reference
[S0] YEAH [S1] NO IT REALLY WOULDN'T BE [S2] LIKE SPACESHIP [S3] AND FURRY [S3] WHAT IS TECHNICAL FUNCTIONS EXACTL I I DON'T REALLY KNOW WHAT [S3] YEAH [S2] BUT DEFINITELY NOT WELL I DON'T KNOW [S0] LET'S SHALL WE ALL INTRODUCE OURSELVES [S0] IT'S A PIG [S3] YEAH
Prediction
[S0] YEAH [S1] NO IT REALLY WON'T BE [S2] I DON'T KNOW [S1] AND FURRY [S1] WHAT IS TECHNICAL FUNCTIONS EXACTLY I I DON'T REALLY KNOW [S1] YEAH [S1] BUT DEFINITELY NOT UH I DON'T KNOW [S1] LET'S TRY ALL INTRODUCE OURSELVES [S0] IT'S A P
Sample 05: 4 speakers, 10 turns
- Full WAV artifact: stored with the eval artifacts in this repo
- Inline player: base64 MP3 preview generated from the same eval WAV
- Duration: 23.4s
- Speaker count: reference 4 / predicted 3 (mismatch)
- Turn count: reference 10 / predicted 9 (delta -1)
Reference
[S0] SO WHAT DO WE THINK THIS REMOTE CONTROL SHOULD [S1] IT SAID UM [S1] THAT'S QUITE ANNOYING [S0] INDUSTRIAL DESIGNER WHAT DOES THAT STAND FOR I. D. YEAH UM IS GOING TO BE LOOKING MORE INTO THE WORKING DESIGN [S1] BE A MEDIUM BETWEEN YOU AND THE TELLY I THINK THAT'S UH [S2] OR JUST DRAW IT [S2] YEAH [S3] SUPPOSE I CAN DRAW AN ANIMAL YEAH [S0] SO [S0] YES
Prediction
[S0] SO WHAT DO WE THINK THIS REMOTE CONTROL SHOULD [S1] IT'S A UM [S1] THAT'S QUITE TERRIFYING [S2] INDUSTRIAL DESIGN YEAH I THINK SO [S1] BE A MEDIUM BETWEEN YOU AND THE TALLY [S1] OR JUST TRY IT [S1] YEAH [S1] I SUPPOSE I CAN DRAW AN ANIMAL [S2] SO [
Limitations
- Speaker labels are local to each clip, not global speaker identities.
- The model can merge or over-split speaker turns, especially during short backchannels or overlapping speech.
- This is not a replacement for high-precision diarization where timestamps and speaker identity tracking are required.
- The current eval slice is intentionally small and listenable; broader benchmarking is still needed.
Intended Use
Use this for experiments where you want ASR text and lightweight speaker-turn markup in a single generated transcript. It is most useful for prototyping diarized transcription, meeting-style audio, conversational clips, and model-side formatting experiments.
Training Note
This checkpoint was trained as a full finetune from Qwen3-ASR-1.7B on public English speech data with speaker-turn supervision. The recipe intentionally teaches the model to emit textual speaker tags directly.
Special thanks to Hugging Face for compute support, and to the public speech-data community for making this kind of work possible.
- Downloads last month
- 52
Model tree for mrfakename/qwen3-asr-1.7b-ami-diarization-fft-r6-20260422
Base model
Qwen/Qwen3-ASR-1.7B