Qwen3-ASR-1.7B Speaker-Tagged

A Qwen3-ASR-1.7B finetune that transcribes English speech with inline speaker labels such as [S0], [S1], [S2], and [S3].

The goal is simple: one model output stream that can carry the words and the speaker turns directly, without bolting on an external diarization pipeline.

What It Does

Produces normal ASR text.
Adds inline speaker-change tags in the transcript.
Supports compact multispeaker clips with up to four local speakers.
Assigns [S0], [S1], etc. by first appearance inside the clip.

Prompt

Use a prompt in this style:

Transcribe the English meeting audio and include speaker changes inline. Use only [S0], [S1], [S2], and [S3] speaker tags. Assign speakers by first appearance in the clip and repeat the tag before each speaker turn.

Quick Eval

On this non-cherrypicked held-out multispeaker slice, the model emitted speaker tags for 6/6 clips, matched the number of speakers on 3/6 clips, and averaged -1.67 predicted turns vs. reference.

The samples below are sorted for easier listening and inspection. They are not cherry-picked; they are the first held-out multispeaker eval artifacts from the run path, organized by speaker count and turn count. Full WAV files are stored in the repo; the inline players are compact base64 MP3 previews so the model card remains loadable.

Listen To The Eval

Sample 00: 3 speakers, 7 turns

Full WAV artifact: stored with the eval artifacts in this repo
Inline player: base64 MP3 preview generated from the same eval WAV
Duration: 33.0s
Speaker count: reference 2 / predicted 3 (mismatch)
Turn count: reference 7 / predicted 5 (delta -2)

Reference

[S0] SO YOU COULD LIKE UM LIKE IF YOU HAVE I DUNNO IF YOU HAVE SATELLITE IF YOU HAVE A HUNDRED CHANNELS [S0] OKAY [S0] SO THROUGHOUT OUR NEXT COUPLE OF MEETINGS WE'LL WE'LL BE COVERING THESE THINGS [S0] YOU CAN CALL ME ABBIE [S0] SO I GUESS WE HAVE TO REFLECT ON OUR EXPERIENCES WITH REMOTE CONTROLS TO DECIDE WHAT UM WE WOULD LIKE TO SEE IN A CONVENIENT PRACTICAL [S1] SORRY CAN YOU JUST SAY THAT WHAT'S THE WHAT ARE OUR PRICE GOALS AGAIN [S0] ALTHOUGH UH I'LL JUST PUT THERE

Prediction

[S0] SO YOU COULD LIKE UM LIKE IF YOU HAVE I DON'T KNOW IF YOU HAVE A SATELLITE IF YOU HAVE A HUNDRED CHANNELS [S1] OKAY [S0] SO THROUGHOUT OUR NEXT COUPLE OF MEETINGS WE W WILL BE COVERING THESE THINGS [S2] YOU CAN CALL ME ABBY [S0] SO I GUESS WE HAVE TO REFLECT ON OUR EXPERIENCES WITH REMOTE

Sample 01: 3 speakers, 10 turns

Full WAV artifact: stored with the eval artifacts in this repo
Inline player: base64 MP3 preview generated from the same eval WAV
Duration: 20.8s
Speaker count: reference 3 / predicted 3 (match)
Turn count: reference 10 / predicted 9 (delta -1)

Reference

[S0] YEAH YEAH YEAH [S0] THEN YOU CAN PRO SORT OF HAVE A PROGRAMMING FUNCTION [S0] FOUR MILLION [S1] UM [S1] UM SO F HERE'S OUR AGENDA FOR TODAY [S2] I'M STEPHANIE AND I AM THE USER INTERFACE DESIGNER [S1] YEAH [S1] SORRY ABOUT THAT [S0] NO IT CAN'T BE UH UH [S2] UM

Prediction

[S0] YEAH YEAH [S1] THEN YOU CAN PRO SORT OF HAVE A PROGRAMMING FUNCTION [S2] OH [S1] UM [S1] UM SO HERE'S OUR AGENDA FOR TODAY [S1] I'M STEPHANY AND I AM THE USER INTERFACE DESIGNER [S0] YEAH YEAH YEAH [S2] SORRY ABOUT THAT [S2] NO IT CAN'T REALLY

Sample 02: 3 speakers, 10 turns

Full WAV artifact: stored with the eval artifacts in this repo
Inline player: base64 MP3 preview generated from the same eval WAV
Duration: 27.2s
Speaker count: reference 3 / predicted 3 (match)
Turn count: reference 10 / predicted 8 (delta -2)

Reference

[S0] WE'LL GET ROUND TO THAT LATER [S0] SOMETHING LIKE NOT IT'S NOT ON THE BUTTON BUT IT'S TELLING YOU WHAT TO DO IS THAT WHAT YOU MEAN [S0] UM [S1] BIGGER [S2] UM [S2] THAT'S VERY GOOD [S2] OKAY [S2] MAYBE WE SHOULD DESIGN COUCHES THAT HAVE THE REMOTE CONTROL IN THE SIDE ARM [S1] I DON'T YEAH I MEAN IT BUT LIKE I MEAN JUST I MEAN LIKE YOUR PHONE EVEN JUST HAS SO LIKE IT CAN VIBRATE [S1] I LIKE ITS TAIL

Prediction

[S0] WELL THAT'S ONE THING THAT'S A THING [S1] SOMETHING LIKE NOT IT'S NOT ON THE BOTTOM BUT IT'S TELLING YOU WHAT TO DO [S2] UM [S1] BIGGER [S1] UH [S1] NO SORRY [S0] OKAY [S0] WHO WE SHOULD DESIGN COUCHES THAT HAVE THE REMOTE CONTROL IN THE MIDDLE RIGHT NOW [S1

Sample 03: 3 speakers, 10 turns

Full WAV artifact: stored with the eval artifacts in this repo
Inline player: base64 MP3 preview generated from the same eval WAV
Duration: 24.9s
Speaker count: reference 3 / predicted 3 (match)
Turn count: reference 10 / predicted 7 (delta -3)

Reference

[S0] UH [S1] DOES ANYONE KNOW WHAT THEY WANNA DRAW [S0] MARKETING [S2] 'CAUSE I HATE THAT WHEN THEY HAVE TOO MANY BUTTONS AND I MEAN I KNOW IT HAS TO HAVE ENOUGH FUNCTIONS BUT LIKE [S0] SELECT [S1] MM [S1] SO I GUESS YOU'D BE LOOKING AT LOTS OF THE THINGS WE DISCUSSED ABOUT SCREEN AND UM THAT SORT OF THING [S2] MM I GOTTA THINK ABOUT IT FOR A SECOND LIKE [S0] OOH [S0] YEAH LIKE

Prediction

[S0] RIGHT [S1] SO THEY WANNA KNOW WHAT THEY WANNA DRAW [S2] MARKETING [S1] 'CAUSE I HATE THINGS WHEN THEY HAVE TOO MANY BUTTONS AND THEN BUT THEN YOU KNOW I MEAN I KNOW IT HAS TO HAVE NO FUNCTIONS BUT LIKE [S1] SELECT [S1] YEAH [S1] SO I GUESS WE'D BE LOOKING AT ABOUT SOME OF THE THINGS WE

Sample 04: 4 speakers, 10 turns

Full WAV artifact: stored with the eval artifacts in this repo
Inline player: base64 MP3 preview generated from the same eval WAV
Duration: 18.2s
Speaker count: reference 4 / predicted 3 (mismatch)
Turn count: reference 10 / predicted 9 (delta -1)

Reference

[S0] YEAH [S1] NO IT REALLY WOULDN'T BE [S2] LIKE SPACESHIP [S3] AND FURRY [S3] WHAT IS TECHNICAL FUNCTIONS EXACTL I I DON'T REALLY KNOW WHAT [S3] YEAH [S2] BUT DEFINITELY NOT WELL I DON'T KNOW [S0] LET'S SHALL WE ALL INTRODUCE OURSELVES [S0] IT'S A PIG [S3] YEAH

Prediction

[S0] YEAH [S1] NO IT REALLY WON'T BE [S2] I DON'T KNOW [S1] AND FURRY [S1] WHAT IS TECHNICAL FUNCTIONS EXACTLY I I DON'T REALLY KNOW [S1] YEAH [S1] BUT DEFINITELY NOT UH I DON'T KNOW [S1] LET'S TRY ALL INTRODUCE OURSELVES [S0] IT'S A P

Sample 05: 4 speakers, 10 turns

Full WAV artifact: stored with the eval artifacts in this repo
Inline player: base64 MP3 preview generated from the same eval WAV
Duration: 23.4s
Speaker count: reference 4 / predicted 3 (mismatch)
Turn count: reference 10 / predicted 9 (delta -1)

Reference

[S0] SO WHAT DO WE THINK THIS REMOTE CONTROL SHOULD [S1] IT SAID UM [S1] THAT'S QUITE ANNOYING [S0] INDUSTRIAL DESIGNER WHAT DOES THAT STAND FOR I. D. YEAH UM IS GOING TO BE LOOKING MORE INTO THE WORKING DESIGN [S1] BE A MEDIUM BETWEEN YOU AND THE TELLY I THINK THAT'S UH [S2] OR JUST DRAW IT [S2] YEAH [S3] SUPPOSE I CAN DRAW AN ANIMAL YEAH [S0] SO [S0] YES

Prediction

[S0] SO WHAT DO WE THINK THIS REMOTE CONTROL SHOULD [S1] IT'S A UM [S1] THAT'S QUITE TERRIFYING [S2] INDUSTRIAL DESIGN YEAH I THINK SO [S1] BE A MEDIUM BETWEEN YOU AND THE TALLY [S1] OR JUST TRY IT [S1] YEAH [S1] I SUPPOSE I CAN DRAW AN ANIMAL [S2] SO [

Limitations

Speaker labels are local to each clip, not global speaker identities.
The model can merge or over-split speaker turns, especially during short backchannels or overlapping speech.
This is not a replacement for high-precision diarization where timestamps and speaker identity tracking are required.
The current eval slice is intentionally small and listenable; broader benchmarking is still needed.

Intended Use

Use this for experiments where you want ASR text and lightweight speaker-turn markup in a single generated transcript. It is most useful for prototyping diarized transcription, meeting-style audio, conversational clips, and model-side formatting experiments.

Training Note

This checkpoint was trained as a full finetune from Qwen3-ASR-1.7B on public English speech data with speaker-turn supervision. The recipe intentionally teaches the model to emit textual speaker tags directly.

Special thanks to Hugging Face for compute support, and to the public speech-data community for making this kind of work possible.

Downloads last month: 52

Model tree for mrfakename/qwen3-asr-1.7b-ami-diarization-fft-r6-20260422

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(53)

this model