Instructions to use mlx-community/mel-roformer-zfturbo-vocals-v1-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/mel-roformer-zfturbo-vocals-v1-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir mel-roformer-zfturbo-vocals-v1-mlx mlx-community/mel-roformer-zfturbo-vocals-v1-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
mlx-community/mel-roformer-zfturbo-vocals-v1-mlx
This model was converted to MLX format from the v1.0.0 release asset model_vocals_mel_band_roformer_sdr_8.42.ckpt of ZFTurbo/Music-Source-Separation-Training using a custom Mel-Band-RoFormer MLX port (the Blaizzy/mlx-audio PR + the xocialize/mel-roformer-mlx-swift Swift consumer). Refer to the original repository for more details on the model.
Model
- Family: Mel-Band-RoFormer (Lu, Wang, Won, "Mel-Band RoFormer for Music Source Separation," arXiv:2310.01809)
- Checkpoint: ZFTurbo
vocals_v1(v1.0.0 release, training-time SDR 8.42) - Parameters: ~33M
- Sample rate: 44100 Hz, stereo
- Stems produced:
vocals(single-stem model β deriveinstrumentalasmixture - vocals) - Chunk size: 352800 samples (~8 s at 44.1 kHz), 50% overlap
- STFT:
n_fft=2048,hop_length=512,win_length=2048 - Transformer:
dim=192,depth=8,heads=8,dim_head=64 - Bands: 60 mel bands
- Mask estimator depth: 1
This preset matches the v1.0.0 release asset specifically. It is smaller and uses a different hop_length (512 vs. 441) than Kim Vocal 2 / viperx Mel-Band-RoFormer configurations. Confirmed against the published state-dict shapes β the YAML default mask_estimator_depth=2 does not match this checkpoint; the shipped weights were trained with mask_estimator_depth=1.
The smaller parameter count gives faster inference (~27% lower RTF in our internal anime-dialogue test) at roughly 95% of Kim Vocal 2's separation strength. Pick this preset when model size or inference latency matters more than top-end separation quality.
Full hyperparameters in config.json.
Source
- Original repository: https://github.com/ZFTurbo/Music-Source-Separation-Training
- Original author: Roman Solovyev (@ZFTurbo)
- Original license: MIT (inherited from the MSS-Training repository
LICENSE) - Original release: v1.0.0
- Source file:
model_vocals_mel_band_roformer_sdr_8.42.ckpt
The MIT license attaches to release assets distributed by an MIT-licensed GitHub repository. ZFTurbo's MSS-Training repository ships its release checkpoints directly from that MIT-licensed repo, so the weights inherit the repository's MIT terms.
License
This redistribution is MIT-licensed, matching the current license of the source repository. See LICENSE.
License timeline
For full transparency: the MSS-Training repository's LICENSE file (MIT) was committed on 2024-11-04 (commit 6149a22), approximately one year after the v1.0.0 release (published 2023-11-06) that produced this checkpoint asset. At the moment the v1.0.0 release was published, the repository did not yet carry an explicit LICENSE file.
This redistribution treats the weights as inheriting the repository's current MIT terms, on the basis that:
- The asset is still distributed today by the now-MIT-licensed repository.
- Adding a
LICENSEfile is the codified expression of the author's grant; nothing in the repository indicates a different license intent ever applied to the v1.0.0 release assets specifically. - The author (Roman Solovyev / @ZFTurbo) is an active OSS maintainer who has continued to publish work in the same repository under MIT.
If a downstream redistributor wants a stricter trail (i.e., explicit author confirmation that the v1.0.0 release artifacts were always-MIT), reaching out to the author directly is the recommended path. We did not collect that confirmation prior to this redistribution.
Conversion
- Tool:
mlx_audio.sts.models.mel_roformer.convertβ merged upstream into Blaizzy/mlx-audio via PR #654 (2026-04-27) and shipped inmlx-audio==0.4.3and later. - Tool version at conversion time:
8380ab8on thefeat/mel-band-roformerbranch (xocialize fork) β this is the exact commit that producedmodel.safetensors. The merged upstream code is functionally equivalent. - mlx (Python) version: 0.31.0
- Architecture preset:
MelRoFormerConfig.zfturbo_vocals_v1() - Output precision:
float16(see Precision below for the rationale β bf16 was tested first per the upload guide but failed parity at this small model size) - Source
model_vocals_mel_band_roformer_sdr_8.42.ckptSHA-256:d9ce706b49cebf0af018590d8deb47ad5434987bf8f7bd3a87a4e5e8c30acb26 - Conversion date: 2026-04-25
Precision
This repository ships fp16 weights, while the sibling mlx-community/mel-roformer-kim-vocal-2-mlx ships bf16. The reason is empirical: bf16 quantization degrades this checkpoint's parity vs the PyTorch reference below the upload acceptance threshold, while Kim's wider architecture absorbs bf16 truncation cleanly. Three-way comparison on the same parity harness, same input chunk:
| Precision | File size | SDR vs PyTorch reference | Verdict |
|---|---|---|---|
| fp32 | 128.5 MB | 74.86 dB | Bit-exact-equivalent (baseline) |
| fp16 | 64.3 MB | 44.19 dB | Published β > 40 dB target |
| bf16 | 64.3 MB | 21.96 dB | Below target β not published |
The dim=192 / 33.7M-parameter architecture is more sensitive to bf16's narrow 7-bit mantissa than wider Mel-Band-RoFormer presets. fp16 retains more mantissa bits (10) at the same file size and recovers parity. fp32 was rejected as the published precision only on size grounds; if a downstream user prefers fp32 fidelity over the 64 MB savings, the upstream PyTorch checkpoint is freely available from the v1.0.0 release and can be re-converted via mlx_audio.sts.models.mel_roformer.convert with --dtype float32.
Parity
Verified against the PyTorch reference implementation:
- SDR: 44.19 dB between PyTorch and MLX outputs (target: > 40 dB per the upload guide). See the Precision section above for the cross-dtype comparison.
- PyTorch reference:
bs_roformer==0.3.10βbs_roformer.MelBandRoformerinstantiated from the v1.0.0 training YAML withmask_estimator_depth=1(the YAML default2does not match the shipped weights β confirmed against state-dict shapes). Newerbs_roformerreleases (0.4+) reorder thelayersModuleList and add nGPT-style normalization, breaking checkpoint compatibility. - Test signal: 8-second stereo 44.1 kHz clip (mid-episode anime audio, mono β stereo duplication, music + dialogue mix).
- Reproduce: see
mlx_audio/tests/sts/test_mel_roformer_parity.pyandtests/sts/torch_infer.pyin the xocialize/mlx-audio fork.
Intended use
Vocal isolation for music source separation. Input is a stereo music mixture; output is a separated vocal stem. Trained for vocals; not validated for general-purpose source separation (drums, bass, other). Derive an instrumental stem as mixture - vocals if needed.
Files
model.safetensorsβ MLX weightsconfig.jsonβ architecture hyperparametersLICENSEβ MIT license text
Usage
Python (mlx-audio)
The Mel-Band-RoFormer architecture is included in
mlx-audio>=0.4.3(merged via PR #654 on 2026-04-27). Install withpip install "mlx-audio>=0.4.3".
import soundfile as sf
import numpy as np
import mlx.core as mx
from mlx_audio.sts.models.mel_roformer import MelRoFormer, MelRoFormerConfig
from mlx_audio.utils import load_audio
# 1. Load model + weights from the Hub.
model = MelRoFormer.from_pretrained(
"mlx-community/mel-roformer-zfturbo-vocals-v1-mlx",
config=MelRoFormerConfig.zfturbo_vocals_v1(), # optional if config.json is present
)
model.eval()
# 2. Load the input mixture as 44.1 kHz stereo and add a batch axis.
mixture = load_audio("input_mixture.wav", sample_rate=44100) # mx.array [2, samples]
batched = mixture[None, ...] # [1, 2, samples]
# 3. Separate vocals.
vocals = model(batched)[0] # [2, samples]
# 4. Derive instrumental as (mixture - vocals).
instrumental = mixture - vocals
# 5. Write stems to disk (soundfile expects [samples, channels]).
sf.write("vocals.wav", np.array(vocals).T, 44100)
sf.write("instrumental.wav", np.array(instrumental).T, 44100)
For long inputs, chunk the audio at chunk_size = 352800 samples with 50% overlap and overlap-add the outputs β see the model code for the canonical helper once it's added.
Swift (mel-roformer-mlx-swift)
import SwiftRoFormer
// One-shot Hub download + weight load. The bundled config.json's
// `checkpoint_family` field auto-resolves the correct preset.
let model = try await MelRoFormer.fromPretrained(
"mlx-community/mel-roformer-zfturbo-vocals-v1-mlx"
)
// Forward pass: input is (batch, channels, samples) at 44100 Hz.
let vocals = model(input)
For lower-level control (explicit configuration, pre-downloaded weights, custom Hub clients), construct the model directly and pass a local URL to WeightLoader.loadWeights:
let model = MelRoFormer(config: .zfturboVocalsV1)
try WeightLoader.loadWeights(into: model, from: localWeightsURL)
Citation
If you use this checkpoint, please cite the Mel-Band-RoFormer paper and the ZFTurbo MSS-Training release:
@misc{lu2023melband,
title = {Mel-Band {RoFormer} for Music Source Separation},
author = {Lu, Wei-Tsung and Wang, Ju-Chiang and Won, Minz and Choi, Keunwoo and Song, Xuchen},
year = {2023},
eprint = {2310.01809},
archivePrefix = {arXiv},
primaryClass = {eess.AS},
url = {https://arxiv.org/abs/2310.01809}
}
@misc{zfturbo_mss_training,
title = {Music Source Separation Training β Mel-Band-RoFormer vocals\_v1 (SDR 8.42)},
author = {Solovyev, Roman (ZFTurbo)},
year = {2023--2026},
url = {https://github.com/ZFTurbo/Music-Source-Separation-Training}
}
The MLX implementation lineage is lucidrains/BS-RoFormer (MIT) β Blaizzy/mlx-audio (Apache-2.0).
- Downloads last month
- 26
Quantized