---
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: sentence-similarity
language:
- es
base_model:
- BSC-LT/MrBERT-es
tags:
- biomedical
- spanish
- bi-encoder
- sentence-transformers
- embeddings
- retrieval
---

# ALIA MrBERT Spanish Biomedical Embeddings Model

This repository contains **ALIA MrBERT Spanish Biomedical Embeddings**, a Spanish biomedical domain bi-encoder model for semantic similarity and information retrieval tasks. It is built upon [MrBERT-es](https://huggingface.co/BSC-LT/MrBERT-es), a bilingual (Spanish-English) foundational language model based on the ModernBERT architecture, and fine-tuned on domain-specific biomedical data using a Curriculum Learning strategy.

> [!WARNING]
> **DISCLAIMER:** This model is a domain-specific proof-of-concept designed to demonstrate retrieval capabilities in the Spanish biomedical domain.
> While optimized for this domain, results should be verified against official clinical sources and expert judgment. The model may fail in out-of-domain or adversarial inputs.

---

## Model Details

### Model Lineage

```
ModernBERT (architecture)
       ↓
  MrBERT-es (BSC-LT)
  Bilingual ES/EN encoder
  150M parameters
       ↓
  ALIA-MrBERT-es-biomedical-embeddings (SINAI)
  Biomedical domain fine-tuning
  Curriculum Learning + Hard Negatives
```

### Key Features

- 🔍 **Domain**: Spanish biomedical and healthcare texts
- 📐 **Architecture**: ModernBERT with Mean Pooling (bi-encoder)
- 📏 **Long context**: up to 8,192 tokens
- 🎓 **Training strategy**: Curriculum Learning (easy → medium → hard)
- ⚙️ **Negative mining**: Positive-Aware Hard Negative Mining (NVIDIA approach)

### Architecture

This model uses the same base architecture as [MrBERT-es](https://huggingface.co/BSC-LT/MrBERT-es), extended with a Mean Pooling layer for sentence-level embeddings:

|                              |               |
|------------------------------|:--------------|
| **Base Architecture**        | ModernBERT    |
| **Total Parameters**         | ~150M         |
| **Hidden size**              | 768           |
| **Intermediate size**        | 1,152         |
| **Attention heads**          | 12            |
| **Hidden layers**            | 22            |
| **Context length**           | 8,192 tokens  |
| **Vocabulary size**          | 51,200        |
| **Precision**                | bfloat16      |
| **Positional encoding**      | RoPE          |
| **Activation function**      | GeLU          |
| **Attention type**           | Mixed (global every 3 layers + sliding window) |
| **Pooling strategy**         | Mean Pooling  |

---

## Training

### Training Strategy: Curriculum Learning

The model was fine-tuned using a two-phase **Curriculum Learning** strategy and progressively increasing the difficulty of training examples thanks to [SINAI/ALIA-es-biomedical-triplets/train](https://huggingface.co/datasets/SINAI/ALIA-es-biomedical-triplets/viewer/hard-negatives/train):

| Phase | Epochs | Negative Type | Difficulty Progression |
|-------|:------:|:-------------:|:----------------------:|
| **Phase 1** | 6 | Random negatives | Easy → Medium → Hard |
| **Phase 2** | 3 | Hard negatives (mined) | Easy → Medium → Hard |
| **Total** | **9** | — | — |

**Phase 1 - Contrastive Learning with Random Negatives:**
Training uses triplets `{query, relevant_doc, [irrelevant_docs]}` with in-batch negatives. Examples are sorted by difficulty across 3 sub-phases (2 epochs each).

**Phase 2 - Advanced Refinement with Hard Negatives:**
Refinement using mined hard negatives with **Positive-Aware Mining** (NVIDIA approach) to avoid false negatives. A candidate is only considered a negative if:
```
score < score_positive - margin   (margin = 0.05)
```

### Hyperparameter Optimization

Before training, hyperparameter search was conducted using **[Optuna](https://optuna.org/)**  (20 trials, subsets of 5,000 examples):
- **Sampler**: TPESampler (Tree-structured Parzen Estimator)
- **Pruner**: MedianPruner
- **Storage**: SQLite for trial persistence

| Hyperparameter | Search Space |
|----------------|:------------:|
| Learning Rate | [1×10⁻⁶, 5×10⁻⁵] (log-uniform) |
| Warmup Ratio | [0.05, 0.2] |
| Weight Decay | [0.0, 0.1] |
| Mini Batch Size | {1, 4, 8, 12} |

### Final Training Hyperparameters

| Hyperparameter | Value | Description |
|----------------|:-----:|:------------|
| **Learning Rate** | 5×10⁻⁵ | Nominal learning rate |
| **Batch Size** | 128 | Global batch size |
| **Cache Mini-Batch** | 4 | For CachedMultipleNegativesRankingLoss |
| **Warmup Ratio** | 0.1595 | Linear LR warmup at start of each phase |
| **Weight Decay** | 0.02143 | L2 regularization |
| **Optimizer** | AdamW | Standard HuggingFace Trainer optimizer |
| **Precision** | bf16 | Bfloat16 for Ampere+ architectures |
| **Max Sequence Length** | 8,192 | Maximum tokens processed |

### Loss Function

- **CachedMultipleNegativesRankingLoss**: Enables training with large batches without VRAM overflow, by recalculating embeddings in smaller sub-batches (cache size: 4).

### Training Framework

| Component | Details |
|-----------|---------|
| **Library** | `sentence-transformers` |
| **Distributed** | DDP (Distributed Data Parallel) via `torchrun` |
| **Memory optimization** | Gradient Checkpointing |
| **Logging** | WandB (offline mode) |

---

## Intended Use

### Direct Use

This model is designed for semantic similarity and information retrieval tasks in the **Spanish biomedical domain**. Primary use cases include:
- **Semantic search**: Finding relevant biomedical documents from a query
- **RAG pipelines**: Retrieving biomedical context for grounded question answering
- **Biomedical document clustering**: Grouping similar biomedical texts by semantic content
- **Duplicate detection**: Identifying semantically similar biomedical passages

### Out-of-Scope Use

- General-domain retrieval (model is specialized for biomedical Spanish)
- Cross-lingual retrieval beyond Spanish
- Use as a generative model (this is an encoder-only model)
- Clinical decision support without human expert validation

---

## How to Use

### With `sentence-transformers`

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("SINAI/ALIA-MrBERT-es-biomedical-embeddings")

queries = ["¿Cuáles son los síntomas principales de la insuficiencia cardíaca?"]
documents = [
    "La insuficiencia cardíaca puede causar disnea, fatiga y edema periférico...",
    "El tratamiento inicial incluye control de la presión arterial y ajuste farmacológico...",
]

query_embeddings = model.encode(queries, prompt_name="query")
doc_embeddings = model.encode(documents)

scores = model.similarity(query_embeddings, doc_embeddings)
print(scores)
```

### With `transformers` (manual)

```python
import torch
from transformers import AutoTokenizer, AutoModel

model_name = "SINAI/ALIA-MrBERT-es-biomedical-embeddings"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

texts = ["¿Qué recomendaciones existen para el manejo de la hipertensión arterial?"]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**encoded)

embeddings = mean_pooling(outputs, encoded["attention_mask"])
print(embeddings.shape)  # (1, 768)
```

---

## Evaluation

The model was evaluated using the [MTEB](https://huggingface.co/mteb) (Massive Text Embedding Benchmark) framework, adapted for the legal domain. The main reported metric is **NDCG@10** (Normalized Discounted Cumulative Gain at k=10), which is the standard metric used in retrieval leaderboards and aligns with the metric reported in the MrBERT family.

An additional evaluation was made thanks to [ragas](https://github.com/vibrantlabsai/ragas) evaluation framework and [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) language model. These metrics are calculated by averaging on each pair puntuation from particular subsets of some of the following datasets.

### Evaluation Datasets

| Dataset | Category | Description |
|---------|----------|-------------|
| **QA** | Retrieval | Spanish subset of the MIRACL dataset in MTEB format ([jinaai/miracl-es](https://huggingface.co/datasets/jinaai/miracl-es)) |
| **STS** | STS | Combination of three Spanish STS datasets: PAWS-X ([google-research-datasets/paws-x](https://huggingface.co/datasets/google-research-datasets/paws-x)), STS22 ([mteb/sts22-crosslingual-sts](https://huggingface.co/datasets/mteb/sts22-crosslingual-sts)), and SemRel2024 ([SemRel/SemRel2024](https://huggingface.co/datasets/SemRel/SemRel2024)) |
| **CoWeSe** | Retrieval | Generated open-ended questions from the [CoWeSe](https://arxiv.org/abs/2109.07765) corpus ([chrisnb1/cowese-qa-dataset](https://huggingface.co/datasets/chrisnb1/cowese-qa-dataset)) |
| **AbSanitas** | Retrieval | Spanish biomedical information retrieval dataset built from biomedical texts collected from official academic repositories and open-access sources ([BSC-LT/AbSanitas](https://huggingface.co/datasets/BSC-LT/AbSanitas)) |
| **pairs1.5k** | Retrieval | Subset of 1.5K biomedical evaluation pairs (query + passage) located in [SINAI/ALIA-es-biomedical-triplets/test](https://huggingface.co/datasets/SINAI/ALIA-es-biomedical-triplets/viewer/evaluation/test). |
| **pairs3645** | Retrieval | Subset of 3,645 biomedical evaluation pairs (query + passage) located in [SINAI/ALIA-es-biomedical-triplets/test](https://huggingface.co/datasets/SINAI/ALIA-es-biomedical-triplets/viewer/evaluation/test). |

### Results (NDCG@10)

| Model | QA | STS | CoWeSe | AbSanitas | pairs1.5k | pairs3645 |
|-------|:------------:|:-------------:|:--------:|:--------:|:------------:|:-------------:|
| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 0.9839 | 0.4629 | 0.7974 | 0.7613 | 0.9775 | 0.9840 |
| [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) | **0.9869**| **0.5033** | **0.8807** | **0.8165** | **0.9921** | 0.9962 |
| [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 0.9419 | 0.4632 | 0.7262 | 0.4870 | 0.7871 | 0.8212 |
| **ALIA-MrBERT-es-biomedical-embeddings (ours)** | 0.9666 | 0.4932 | 0.8551 | 0.8124 | 0.9877 | **0.9966** |

[BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) and [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) are SOTA models with a size x4 larger than ours. Yet, our model shows comparable results.

### Results (ragas)

#### Subset: `triplets_queries1724_contexts3645 (pairs3645)`

| Model | [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/#context-precision_1) (avg) | [Context Utilization](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/#context-utilization) (avg) | [Context Relevance](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/nvidia_metrics/#context-relevance) (avg) |
| :--- | :---: | :---: | :---: |
| [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) | 0.9577 | 0.9582 | 0.9098 |
| **ALIA-MrBERT-es-biomedical-embeddings (ours)** | **0.9588** | **0.9589** | **0.9102** |

#### Subset: `cowese_queries661_contexts551`

| Model | [Context Relevance](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/nvidia_metrics/#context-relevance) (avg) |
| :--- | :---: |
| [**Qwen/Qwen3-Embedding-0.6B**](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) | **0.4231** |
| ALIA-MrBERT-es-biomedical-embeddings (ours) | 0.3932 |

#### Subset: `absanitas_queries800_contexts25192`

| Model | [Context Relevance](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/nvidia_metrics/#context-relevance) (avg) |
| :--- | :---: |
| [**Qwen/Qwen3-Embedding-0.6B**](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) | **0.9228** |
| ALIA-MrBERT-es-biomedical-embeddings (ours) | 0.9156 |

> **Note:** Each evaluation subset is named following the pattern `{dataset}_queries{N}_contexts{M}`, where `N` is the number of queries evaluated against `M` contexts taken from the datasets.

---

## Limitations and Biases

### Known Limitations

- **Domain specificity**: The model is optimized for Spanish biomedical texts. Performance may degrade on general-domain or other specialized texts.
- **Language**: Although MrBERT-es supports Spanish and English, this fine-tuned model focuses on Spanish biomedical content.
- **Clinical accuracy**: Semantic similarity does not guarantee medical correctness. Retrieved documents should be verified by qualified healthcare professionals.
- **Context length**: Despite supporting up to 8,192 tokens, very long documents may still require chunking strategies for optimal retrieval performance.

### Biases

- The model may reflect biases present in the biomedical corpora used for training.
- It may underperform on subdomains underrepresented in the source data.

---

## Additional Information

### License
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

### Citation

If you use this model in your research, please cite:

```bibtex
@misc{ALIA-MrBERT-es-biomedical-embeddings,
  title        = {ALIA MrBERT Spanish Biomedical Embeddings Model},
  author       = {SINAI Research Group, Universidad de Jaén},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/SINAI/ALIA-MrBERT-es-biomedical-embeddings}}
}
```

Please also cite the base model:

```bibtex
@misc{tamayo2026mrbertmodernmultilingualencoders,
      title={MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation}, 
      author={Daniel Tamayo and Iñaki Lacunza and Paula Rivera-Hidalgo and Severino Da Dalt and Javier Aula-Blasco and Aitor Gonzalez-Agirre and Marta Villegas},
      year={2026},
      eprint={2602.21379},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.21379}, 
}
```

### Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project [ALIA](https://alia.gob.es).

### Acknowledgments
This dataset has been generated thanks to [CEATIC](https://www.ujaen.es/centros/ceatic/) (
Centro de Estudios Avanzados en Tecnologías de la Información y de la Comunicación) – [UJA](http://www.ujaen.es/) (Universidad de Jaén) which provided the needed computational resources on its clusters.

---

**Contact:** [ALIA Project](https://www.alia.gob.es/) - [SINAI Research Group](https://sinai.ujaen.es) - [Universidad de Jaén](https://www.ujaen.es/)

**More Information:** [SINAI Research Group](https://sinai.ujaen.es) | [ALIA-UJA Project](https://github.com/sinai-uja/ALIA-UJA)