---
tags:
- music-aesthetics
- audio-scoring
- quality-estimation
- whisper-features
library_name: torch
pipeline_tag: audio-classification
license: cc-by-4.0
datasets:
- laion/captioned-ai-music-snippets
---

# Music Aesthetics Scorer 

This model predicts **aesthetic quality scores** for music audio files. It is a "Mixture of Experts" model that takes embeddings extracted from the https://huggingface.co/laion/music-whisper model and predicts 5 metrics defined by the SongEval dataset.

## The Metrics
The model rates audio on a scale of **1.0 to 5.0** for the following qualities:

1. **Naturalness**: Does the audio sound like a realistic, human performance? Low scores indicate synthetic artifacts, robotic voices, or obvious generation glitches.
2. **Clarity**: The quality of the production and mixing. Is the audio distinct and clear, or muddy and noisy?
3. **Musicality**: The overall musical quality. Does it follow musical rules (harmony, rhythm) and is it pleasant to listen to?
4. **Coherence**: The structural logic of the piece. Does the music progress naturally, or does it feel random and disjointed?
5. **Memorability**: The "catchiness" of the track. How distinct and memorable is the melody or rhythm?

**Overall Aesthetics Score:** This is calculated by taking the average of the 5 individual metric scores.

## Architecture
This model operates on top of the OpenAI Whisper encoder (specifically the fine-tuned version linked above).

1. **Audio Encoder**: 30 seconds of audio $\to$ Whisper Encoder $\to$ Last Hidden State `(1, 1500, 768)`.
2. **Feature Extraction**: 
    - The 1500 frames are split into 10 segments.
    - For each segment, we calculate `Mean`, `Max`, and `Min` pooling.
    - These are concatenated and flattened into a single vector of size **23,040**.
3. **Aesthetics Model**:
    - **Bottleneck**: Reduces the 23k input to 256 dimensions (Shared knowledge).
    - **Expert Heads**: 5 separate MLPs that predict the specific score for each metric.

## Inference Example

To use this model, you need `librosa`, `transformers`, `torch`, and `huggingface_hub`.

```python
import os
import torch
import numpy as np
import librosa
from huggingface_hub import hf_hub_download
from transformers import WhisperModel, WhisperProcessor
from model_architecture import MusicAestheticsModel # Downloaded from this repo

# Configuration
# 1. The Audio Encoder (The Music Whisper Model)
WHISPER_REPO = "laion/music-whisper"
# 2. This Aesthetics Model
AESTHETICS_REPO = "laion/music-aesthetics"

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def load_models():
    print("Loading Whisper Encoder...")
    processor = WhisperProcessor.from_pretrained(WHISPER_REPO)
    # We only need the encoder part of Whisper
    whisper = WhisperModel.from_pretrained(WHISPER_REPO).encoder.to(DEVICE)
    whisper.eval()

    print("Loading Aesthetics Experts...")
    # Initialize the architecture
    model = MusicAestheticsModel().to(DEVICE)
    
    # Download and load weights
    # 1. Load Shared Bottleneck
    bt_path = hf_hub_download(repo_id=AESTHETICS_REPO, filename="stage1_bottleneck.pt")
    model.bottleneck.load_state_dict(torch.load(bt_path, map_location=DEVICE))
    
    # 2. Load Expert Heads
    for metric in model.metrics:
        head_path = hf_hub_download(repo_id=AESTHETICS_REPO, filename=f"expert_{metric}.pt")
        model.heads[metric].load_state_dict(torch.load(head_path, map_location=DEVICE))
    
    model.eval()
    return processor, whisper, model

def predict_score(audio_path, processor, whisper, aesthetic_model):
    # 1. Load and Preprocess Audio
    # Resample to 16kHz and pad/crop to exactly 30s
    audio, sr = librosa.load(audio_path, sr=16000)
    target_len = 16000 * 30
    if len(audio) > target_len:
        start = (len(audio) - target_len) // 2
        audio = audio[start : start + target_len]
    else:
        audio = np.pad(audio, (0, target_len - len(audio)))
        
    # 2. Extract Whisper Features
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
    with torch.no_grad():
        # Get last hidden state from encoder
        outputs = whisper(inputs.input_features.to(DEVICE))
        last_hidden = outputs.last_hidden_state # (1, 1500, 768)
        
        # 3. Apply Feature Pooling (Expert Model Logic)
        # Reshape to (1, 10 segments, 150 frames, 768 dim)
        feats = last_hidden.view(1, 10, 150, 768)
        
        mean_pool = torch.mean(feats, dim=2)        
        max_pool = torch.max(feats, dim=2).values   
        min_pool = torch.min(feats, dim=2).values   
        
        # Concat -> Flatten -> (23040,)
        concat = torch.cat([mean_pool, max_pool, min_pool], dim=2)
        embedding = concat.view(-1).unsqueeze(0) # Add batch dim

    # 4. Predict Scores
    with torch.no_grad():
        outputs = aesthetic_model(embedding)
    
    results = {k: v.item() for k, v in outputs.items()}
    
    # Calculate Average Global Score
    avg_score = sum(results.values()) / len(results)
    results["Overall_Aesthetics"] = avg_score
    
    return results

# Example Usage
if __name__ == "__main__":
    processor, whisper, model = load_models()
    
    # Replace with your audio file
    audio_file = "test_song.mp3" 
    
    if os.path.exists(audio_file):
        scores = predict_score(audio_file, processor, whisper, model)
        
        print("-" * 30)
        print(f"Aesthetics Analysis for {audio_file}")
        print("-" * 30)
        for metric, score in scores.items():
            print(f"{metric:<20}: {score:.2f} / 5.0")
        print("-" * 30)
    else:
        print("Please provide a valid audio file path.")
```