mlx-community/mel-roformer-zfturbo-vocals-v1-mlx

This model was converted to MLX format from the v1.0.0 release asset model_vocals_mel_band_roformer_sdr_8.42.ckpt of ZFTurbo/Music-Source-Separation-Training using a custom Mel-Band-RoFormer MLX port (the Blaizzy/mlx-audio PR + the xocialize/mel-roformer-mlx-swift Swift consumer). Refer to the original repository for more details on the model.

Model

  • Family: Mel-Band-RoFormer (Lu, Wang, Won, "Mel-Band RoFormer for Music Source Separation," arXiv:2310.01809)
  • Checkpoint: ZFTurbo vocals_v1 (v1.0.0 release, training-time SDR 8.42)
  • Parameters: ~33M
  • Sample rate: 44100 Hz, stereo
  • Stems produced: vocals (single-stem model β€” derive instrumental as mixture - vocals)
  • Chunk size: 352800 samples (~8 s at 44.1 kHz), 50% overlap
  • STFT: n_fft=2048, hop_length=512, win_length=2048
  • Transformer: dim=192, depth=8, heads=8, dim_head=64
  • Bands: 60 mel bands
  • Mask estimator depth: 1

This preset matches the v1.0.0 release asset specifically. It is smaller and uses a different hop_length (512 vs. 441) than Kim Vocal 2 / viperx Mel-Band-RoFormer configurations. Confirmed against the published state-dict shapes β€” the YAML default mask_estimator_depth=2 does not match this checkpoint; the shipped weights were trained with mask_estimator_depth=1.

The smaller parameter count gives faster inference (~27% lower RTF in our internal anime-dialogue test) at roughly 95% of Kim Vocal 2's separation strength. Pick this preset when model size or inference latency matters more than top-end separation quality.

Full hyperparameters in config.json.

Source

The MIT license attaches to release assets distributed by an MIT-licensed GitHub repository. ZFTurbo's MSS-Training repository ships its release checkpoints directly from that MIT-licensed repo, so the weights inherit the repository's MIT terms.

License

This redistribution is MIT-licensed, matching the current license of the source repository. See LICENSE.

License timeline

For full transparency: the MSS-Training repository's LICENSE file (MIT) was committed on 2024-11-04 (commit 6149a22), approximately one year after the v1.0.0 release (published 2023-11-06) that produced this checkpoint asset. At the moment the v1.0.0 release was published, the repository did not yet carry an explicit LICENSE file.

This redistribution treats the weights as inheriting the repository's current MIT terms, on the basis that:

  1. The asset is still distributed today by the now-MIT-licensed repository.
  2. Adding a LICENSE file is the codified expression of the author's grant; nothing in the repository indicates a different license intent ever applied to the v1.0.0 release assets specifically.
  3. The author (Roman Solovyev / @ZFTurbo) is an active OSS maintainer who has continued to publish work in the same repository under MIT.

If a downstream redistributor wants a stricter trail (i.e., explicit author confirmation that the v1.0.0 release artifacts were always-MIT), reaching out to the author directly is the recommended path. We did not collect that confirmation prior to this redistribution.

Conversion

  • Tool: mlx_audio.sts.models.mel_roformer.convert β€” merged upstream into Blaizzy/mlx-audio via PR #654 (2026-04-27) and shipped in mlx-audio==0.4.3 and later.
  • Tool version at conversion time: 8380ab8 on the feat/mel-band-roformer branch (xocialize fork) β€” this is the exact commit that produced model.safetensors. The merged upstream code is functionally equivalent.
  • mlx (Python) version: 0.31.0
  • Architecture preset: MelRoFormerConfig.zfturbo_vocals_v1()
  • Output precision: float16 (see Precision below for the rationale β€” bf16 was tested first per the upload guide but failed parity at this small model size)
  • Source model_vocals_mel_band_roformer_sdr_8.42.ckpt SHA-256: d9ce706b49cebf0af018590d8deb47ad5434987bf8f7bd3a87a4e5e8c30acb26
  • Conversion date: 2026-04-25

Precision

This repository ships fp16 weights, while the sibling mlx-community/mel-roformer-kim-vocal-2-mlx ships bf16. The reason is empirical: bf16 quantization degrades this checkpoint's parity vs the PyTorch reference below the upload acceptance threshold, while Kim's wider architecture absorbs bf16 truncation cleanly. Three-way comparison on the same parity harness, same input chunk:

Precision File size SDR vs PyTorch reference Verdict
fp32 128.5 MB 74.86 dB Bit-exact-equivalent (baseline)
fp16 64.3 MB 44.19 dB Published β€” > 40 dB target
bf16 64.3 MB 21.96 dB Below target β€” not published

The dim=192 / 33.7M-parameter architecture is more sensitive to bf16's narrow 7-bit mantissa than wider Mel-Band-RoFormer presets. fp16 retains more mantissa bits (10) at the same file size and recovers parity. fp32 was rejected as the published precision only on size grounds; if a downstream user prefers fp32 fidelity over the 64 MB savings, the upstream PyTorch checkpoint is freely available from the v1.0.0 release and can be re-converted via mlx_audio.sts.models.mel_roformer.convert with --dtype float32.

Parity

Verified against the PyTorch reference implementation:

  • SDR: 44.19 dB between PyTorch and MLX outputs (target: > 40 dB per the upload guide). See the Precision section above for the cross-dtype comparison.
  • PyTorch reference: bs_roformer==0.3.10 β€” bs_roformer.MelBandRoformer instantiated from the v1.0.0 training YAML with mask_estimator_depth=1 (the YAML default 2 does not match the shipped weights β€” confirmed against state-dict shapes). Newer bs_roformer releases (0.4+) reorder the layers ModuleList and add nGPT-style normalization, breaking checkpoint compatibility.
  • Test signal: 8-second stereo 44.1 kHz clip (mid-episode anime audio, mono β†’ stereo duplication, music + dialogue mix).
  • Reproduce: see mlx_audio/tests/sts/test_mel_roformer_parity.py and tests/sts/torch_infer.py in the xocialize/mlx-audio fork.

Intended use

Vocal isolation for music source separation. Input is a stereo music mixture; output is a separated vocal stem. Trained for vocals; not validated for general-purpose source separation (drums, bass, other). Derive an instrumental stem as mixture - vocals if needed.

Files

  • model.safetensors β€” MLX weights
  • config.json β€” architecture hyperparameters
  • LICENSE β€” MIT license text

Usage

Python (mlx-audio)

The Mel-Band-RoFormer architecture is included in mlx-audio>=0.4.3 (merged via PR #654 on 2026-04-27). Install with pip install "mlx-audio>=0.4.3".

import soundfile as sf
import numpy as np
import mlx.core as mx

from mlx_audio.sts.models.mel_roformer import MelRoFormer, MelRoFormerConfig
from mlx_audio.utils import load_audio

# 1. Load model + weights from the Hub.
model = MelRoFormer.from_pretrained(
    "mlx-community/mel-roformer-zfturbo-vocals-v1-mlx",
    config=MelRoFormerConfig.zfturbo_vocals_v1(),  # optional if config.json is present
)
model.eval()

# 2. Load the input mixture as 44.1 kHz stereo and add a batch axis.
mixture = load_audio("input_mixture.wav", sample_rate=44100)  # mx.array [2, samples]
batched = mixture[None, ...]                                  # [1, 2, samples]

# 3. Separate vocals.
vocals = model(batched)[0]                                    # [2, samples]

# 4. Derive instrumental as (mixture - vocals).
instrumental = mixture - vocals

# 5. Write stems to disk (soundfile expects [samples, channels]).
sf.write("vocals.wav",       np.array(vocals).T,       44100)
sf.write("instrumental.wav", np.array(instrumental).T, 44100)

For long inputs, chunk the audio at chunk_size = 352800 samples with 50% overlap and overlap-add the outputs β€” see the model code for the canonical helper once it's added.

Swift (mel-roformer-mlx-swift)

import SwiftRoFormer

// One-shot Hub download + weight load. The bundled config.json's
// `checkpoint_family` field auto-resolves the correct preset.
let model = try await MelRoFormer.fromPretrained(
    "mlx-community/mel-roformer-zfturbo-vocals-v1-mlx"
)

// Forward pass: input is (batch, channels, samples) at 44100 Hz.
let vocals = model(input)

For lower-level control (explicit configuration, pre-downloaded weights, custom Hub clients), construct the model directly and pass a local URL to WeightLoader.loadWeights:

let model = MelRoFormer(config: .zfturboVocalsV1)
try WeightLoader.loadWeights(into: model, from: localWeightsURL)

Citation

If you use this checkpoint, please cite the Mel-Band-RoFormer paper and the ZFTurbo MSS-Training release:

@misc{lu2023melband,
  title         = {Mel-Band {RoFormer} for Music Source Separation},
  author        = {Lu, Wei-Tsung and Wang, Ju-Chiang and Won, Minz and Choi, Keunwoo and Song, Xuchen},
  year          = {2023},
  eprint        = {2310.01809},
  archivePrefix = {arXiv},
  primaryClass  = {eess.AS},
  url           = {https://arxiv.org/abs/2310.01809}
}

@misc{zfturbo_mss_training,
  title  = {Music Source Separation Training β€” Mel-Band-RoFormer vocals\_v1 (SDR 8.42)},
  author = {Solovyev, Roman (ZFTurbo)},
  year   = {2023--2026},
  url    = {https://github.com/ZFTurbo/Music-Source-Separation-Training}
}

The MLX implementation lineage is lucidrains/BS-RoFormer (MIT) β†’ Blaizzy/mlx-audio (Apache-2.0).

Downloads last month
26
Safetensors
Model size
33.7M params
Tensor type
F16
Β·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including mlx-community/mel-roformer-zfturbo-vocals-v1-mlx

Paper for mlx-community/mel-roformer-zfturbo-vocals-v1-mlx