mlx-community/mel-roformer-zfturbo-vocals-v1-mlx

This model was converted to MLX format from the v1.0.0 release asset model_vocals_mel_band_roformer_sdr_8.42.ckpt of ZFTurbo/Music-Source-Separation-Training using a custom Mel-Band-RoFormer MLX port (the Blaizzy/mlx-audio PR + the xocialize/mel-roformer-mlx-swift Swift consumer). Refer to the original repository for more details on the model.

Model

Family: Mel-Band-RoFormer (Lu, Wang, Won, "Mel-Band RoFormer for Music Source Separation," arXiv:2310.01809)
Checkpoint: ZFTurbo vocals_v1 (v1.0.0 release, training-time SDR 8.42)
Parameters: ~33M
Sample rate: 44100 Hz, stereo
Stems produced: vocals (single-stem model — derive instrumental as mixture - vocals)
Chunk size: 352800 samples (~8 s at 44.1 kHz), 50% overlap
STFT: n_fft=2048, hop_length=512, win_length=2048
Transformer: dim=192, depth=8, heads=8, dim_head=64
Bands: 60 mel bands
Mask estimator depth: 1

This preset matches the v1.0.0 release asset specifically. It is smaller and uses a different hop_length (512 vs. 441) than Kim Vocal 2 / viperx Mel-Band-RoFormer configurations. Confirmed against the published state-dict shapes — the YAML default mask_estimator_depth=2 does not match this checkpoint; the shipped weights were trained with mask_estimator_depth=1.

The smaller parameter count gives faster inference (~27% lower RTF in our internal anime-dialogue test) at roughly 95% of Kim Vocal 2's separation strength. Pick this preset when model size or inference latency matters more than top-end separation quality.

Full hyperparameters in config.json.

Source

Original repository: https://github.com/ZFTurbo/Music-Source-Separation-Training
Original author: Roman Solovyev (@ZFTurbo)
Original license: MIT (inherited from the MSS-Training repository LICENSE)
Original release: v1.0.0
Source file: model_vocals_mel_band_roformer_sdr_8.42.ckpt

The MIT license attaches to release assets distributed by an MIT-licensed GitHub repository. ZFTurbo's MSS-Training repository ships its release checkpoints directly from that MIT-licensed repo, so the weights inherit the repository's MIT terms.

License

This redistribution is MIT-licensed, matching the current license of the source repository. See LICENSE.

License timeline

For full transparency: the MSS-Training repository's LICENSE file (MIT) was committed on 2024-11-04 (commit 6149a22), approximately one year after the v1.0.0 release (published 2023-11-06) that produced this checkpoint asset. At the moment the v1.0.0 release was published, the repository did not yet carry an explicit LICENSE file.

This redistribution treats the weights as inheriting the repository's current MIT terms, on the basis that:

The asset is still distributed today by the now-MIT-licensed repository.
Adding a LICENSE file is the codified expression of the author's grant; nothing in the repository indicates a different license intent ever applied to the v1.0.0 release assets specifically.
The author (Roman Solovyev / @ZFTurbo) is an active OSS maintainer who has continued to publish work in the same repository under MIT.

If a downstream redistributor wants a stricter trail (i.e., explicit author confirmation that the v1.0.0 release artifacts were always-MIT), reaching out to the author directly is the recommended path. We did not collect that confirmation prior to this redistribution.

Conversion

Tool: mlx_audio.sts.models.mel_roformer.convert — merged upstream into Blaizzy/mlx-audio via PR #654 (2026-04-27) and shipped in mlx-audio==0.4.3 and later.
Tool version at conversion time: 8380ab8 on the feat/mel-band-roformer branch (xocialize fork) — this is the exact commit that produced model.safetensors. The merged upstream code is functionally equivalent.
mlx (Python) version: 0.31.0
Architecture preset: MelRoFormerConfig.zfturbo_vocals_v1()
Output precision: float16 (see Precision below for the rationale — bf16 was tested first per the upload guide but failed parity at this small model size)
Source model_vocals_mel_band_roformer_sdr_8.42.ckpt SHA-256: d9ce706b49cebf0af018590d8deb47ad5434987bf8f7bd3a87a4e5e8c30acb26
Conversion date: 2026-04-25

Precision

This repository ships fp16 weights, while the sibling mlx-community/mel-roformer-kim-vocal-2-mlx ships bf16. The reason is empirical: bf16 quantization degrades this checkpoint's parity vs the PyTorch reference below the upload acceptance threshold, while Kim's wider architecture absorbs bf16 truncation cleanly. Three-way comparison on the same parity harness, same input chunk:

Precision	File size	SDR vs PyTorch reference	Verdict
fp32	128.5 MB	74.86 dB	Bit-exact-equivalent (baseline)
fp16	64.3 MB	44.19 dB	Published — > 40 dB target
bf16	64.3 MB	21.96 dB	Below target — not published

The dim=192 / 33.7M-parameter architecture is more sensitive to bf16's narrow 7-bit mantissa than wider Mel-Band-RoFormer presets. fp16 retains more mantissa bits (10) at the same file size and recovers parity. fp32 was rejected as the published precision only on size grounds; if a downstream user prefers fp32 fidelity over the 64 MB savings, the upstream PyTorch checkpoint is freely available from the v1.0.0 release and can be re-converted via mlx_audio.sts.models.mel_roformer.convert with --dtype float32.

Parity

Verified against the PyTorch reference implementation:

SDR: 44.19 dB between PyTorch and MLX outputs (target: > 40 dB per the upload guide). See the Precision section above for the cross-dtype comparison.
PyTorch reference: bs_roformer==0.3.10 — bs_roformer.MelBandRoformer instantiated from the v1.0.0 training YAML with mask_estimator_depth=1 (the YAML default 2 does not match the shipped weights — confirmed against state-dict shapes). Newer bs_roformer releases (0.4+) reorder the layers ModuleList and add nGPT-style normalization, breaking checkpoint compatibility.
Test signal: 8-second stereo 44.1 kHz clip (mid-episode anime audio, mono → stereo duplication, music + dialogue mix).
Reproduce: see mlx_audio/tests/sts/test_mel_roformer_parity.py and tests/sts/torch_infer.py in the xocialize/mlx-audio fork.

Intended use

Vocal isolation for music source separation. Input is a stereo music mixture; output is a separated vocal stem. Trained for vocals; not validated for general-purpose source separation (drums, bass, other). Derive an instrumental stem as mixture - vocals if needed.

Files

model.safetensors — MLX weights
config.json — architecture hyperparameters
LICENSE — MIT license text

Usage

Python (mlx-audio)

The Mel-Band-RoFormer architecture is included in mlx-audio>=0.4.3 (merged via PR #654 on 2026-04-27). Install with pip install "mlx-audio>=0.4.3".

import soundfile as sf
import numpy as np
import mlx.core as mx

from mlx_audio.sts.models.mel_roformer import MelRoFormer, MelRoFormerConfig
from mlx_audio.utils import load_audio

# 1. Load model + weights from the Hub.
model = MelRoFormer.from_pretrained(
    "mlx-community/mel-roformer-zfturbo-vocals-v1-mlx",
    config=MelRoFormerConfig.zfturbo_vocals_v1(),  # optional if config.json is present
)
model.eval()

# 2. Load the input mixture as 44.1 kHz stereo and add a batch axis.
mixture = load_audio("input_mixture.wav", sample_rate=44100)  # mx.array [2, samples]
batched = mixture[None, ...]                                  # [1, 2, samples]

# 3. Separate vocals.
vocals = model(batched)[0]                                    # [2, samples]

# 4. Derive instrumental as (mixture - vocals).
instrumental = mixture - vocals

# 5. Write stems to disk (soundfile expects [samples, channels]).
sf.write("vocals.wav",       np.array(vocals).T,       44100)
sf.write("instrumental.wav", np.array(instrumental).T, 44100)

For long inputs, chunk the audio at chunk_size = 352800 samples with 50% overlap and overlap-add the outputs — see the model code for the canonical helper once it's added.

Swift (`mel-roformer-mlx-swift`)

import SwiftRoFormer

// One-shot Hub download + weight load. The bundled config.json's
// `checkpoint_family` field auto-resolves the correct preset.
let model = try await MelRoFormer.fromPretrained(
    "mlx-community/mel-roformer-zfturbo-vocals-v1-mlx"
)

// Forward pass: input is (batch, channels, samples) at 44100 Hz.
let vocals = model(input)

For lower-level control (explicit configuration, pre-downloaded weights, custom Hub clients), construct the model directly and pass a local URL to WeightLoader.loadWeights:

let model = MelRoFormer(config: .zfturboVocalsV1)
try WeightLoader.loadWeights(into: model, from: localWeightsURL)

Citation

If you use this checkpoint, please cite the Mel-Band-RoFormer paper and the ZFTurbo MSS-Training release:

@misc{lu2023melband,
  title         = {Mel-Band {RoFormer} for Music Source Separation},
  author        = {Lu, Wei-Tsung and Wang, Ju-Chiang and Won, Minz and Choi, Keunwoo and Song, Xuchen},
  year          = {2023},
  eprint        = {2310.01809},
  archivePrefix = {arXiv},
  primaryClass  = {eess.AS},
  url           = {https://arxiv.org/abs/2310.01809}
}

@misc{zfturbo_mss_training,
  title  = {Music Source Separation Training — Mel-Band-RoFormer vocals\_v1 (SDR 8.42)},
  author = {Solovyev, Roman (ZFTurbo)},
  year   = {2023--2026},
  url    = {https://github.com/ZFTurbo/Music-Source-Separation-Training}
}

The MLX implementation lineage is lucidrains/BS-RoFormer (MIT) → Blaizzy/mlx-audio (Apache-2.0).

Downloads last month: 26

Safetensors

Model size

33.7M params

Tensor type

F16

MLX

Hardware compatibility

Quantized

Collection including mlx-community/mel-roformer-zfturbo-vocals-v1-mlx

Mel-Band-RoFormer (MLX)

Collection

MLX-format Mel-Band-RoFormer vocal source separation models (MIT-licensed, parity-tested vs PyTorch reference) • 2 items • Updated 3 days ago • 1

Paper for mlx-community/mel-roformer-zfturbo-vocals-v1-mlx

Mel-Band RoFormer for Music Source Separation

Paper • 2310.01809 • Published Oct 3, 2023