Instructions to use SINAI/ALIA-MrBERT-es-biomedical-embeddings with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use SINAI/ALIA-MrBERT-es-biomedical-embeddings with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("SINAI/ALIA-MrBERT-es-biomedical-embeddings") sentences = [ "Esa es una persona feliz", "Ese es un perro feliz", "Esa es una persona muy feliz", "Hoy es un día soleado" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
ALIA MrBERT Spanish Biomedical Embeddings Model
This repository contains ALIA MrBERT Spanish Biomedical Embeddings, a Spanish biomedical domain bi-encoder model for semantic similarity and information retrieval tasks. It is built upon MrBERT-es, a bilingual (Spanish-English) foundational language model based on the ModernBERT architecture, and fine-tuned on domain-specific biomedical data using a Curriculum Learning strategy.
DISCLAIMER: This model is a domain-specific proof-of-concept designed to demonstrate retrieval capabilities in the Spanish biomedical domain. While optimized for this domain, results should be verified against official clinical sources and expert judgment. The model may fail in out-of-domain or adversarial inputs.
Model Details
Model Lineage
ModernBERT (architecture)
↓
MrBERT-es (BSC-LT)
Bilingual ES/EN encoder
150M parameters
↓
ALIA-MrBERT-es-biomedical-embeddings (SINAI)
Biomedical domain fine-tuning
Curriculum Learning + Hard Negatives
Key Features
- 🔍 Domain: Spanish biomedical and healthcare texts
- 📐 Architecture: ModernBERT with Mean Pooling (bi-encoder)
- 📏 Long context: up to 8,192 tokens
- 🎓 Training strategy: Curriculum Learning (easy → medium → hard)
- ⚙️ Negative mining: Positive-Aware Hard Negative Mining (NVIDIA approach)
Architecture
This model uses the same base architecture as MrBERT-es, extended with a Mean Pooling layer for sentence-level embeddings:
| Base Architecture | ModernBERT |
| Total Parameters | ~150M |
| Hidden size | 768 |
| Intermediate size | 1,152 |
| Attention heads | 12 |
| Hidden layers | 22 |
| Context length | 8,192 tokens |
| Vocabulary size | 51,200 |
| Precision | bfloat16 |
| Positional encoding | RoPE |
| Activation function | GeLU |
| Attention type | Mixed (global every 3 layers + sliding window) |
| Pooling strategy | Mean Pooling |
Training
Training Strategy: Curriculum Learning
The model was fine-tuned using a two-phase Curriculum Learning strategy and progressively increasing the difficulty of training examples thanks to SINAI/ALIA-es-biomedical-triplets/train:
| Phase | Epochs | Negative Type | Difficulty Progression |
|---|---|---|---|
| Phase 1 | 6 | Random negatives | Easy → Medium → Hard |
| Phase 2 | 3 | Hard negatives (mined) | Easy → Medium → Hard |
| Total | 9 | — | — |
Phase 1 - Contrastive Learning with Random Negatives:
Training uses triplets {query, relevant_doc, [irrelevant_docs]} with in-batch negatives. Examples are sorted by difficulty across 3 sub-phases (2 epochs each).
Phase 2 - Advanced Refinement with Hard Negatives: Refinement using mined hard negatives with Positive-Aware Mining (NVIDIA approach) to avoid false negatives. A candidate is only considered a negative if:
score < score_positive - margin (margin = 0.05)
Hyperparameter Optimization
Before training, hyperparameter search was conducted using Optuna (20 trials, subsets of 5,000 examples):
- Sampler: TPESampler (Tree-structured Parzen Estimator)
- Pruner: MedianPruner
- Storage: SQLite for trial persistence
| Hyperparameter | Search Space |
|---|---|
| Learning Rate | [1×10⁻⁶, 5×10⁻⁵] (log-uniform) |
| Warmup Ratio | [0.05, 0.2] |
| Weight Decay | [0.0, 0.1] |
| Mini Batch Size | {1, 4, 8, 12} |
Final Training Hyperparameters
| Hyperparameter | Value | Description |
|---|---|---|
| Learning Rate | 5×10⁻⁵ | Nominal learning rate |
| Batch Size | 128 | Global batch size |
| Cache Mini-Batch | 4 | For CachedMultipleNegativesRankingLoss |
| Warmup Ratio | 0.1595 | Linear LR warmup at start of each phase |
| Weight Decay | 0.02143 | L2 regularization |
| Optimizer | AdamW | Standard HuggingFace Trainer optimizer |
| Precision | bf16 | Bfloat16 for Ampere+ architectures |
| Max Sequence Length | 8,192 | Maximum tokens processed |
Loss Function
- CachedMultipleNegativesRankingLoss: Enables training with large batches without VRAM overflow, by recalculating embeddings in smaller sub-batches (cache size: 4).
Training Framework
| Component | Details |
|---|---|
| Library | sentence-transformers |
| Distributed | DDP (Distributed Data Parallel) via torchrun |
| Memory optimization | Gradient Checkpointing |
| Logging | WandB (offline mode) |
Intended Use
Direct Use
This model is designed for semantic similarity and information retrieval tasks in the Spanish biomedical domain. Primary use cases include:
- Semantic search: Finding relevant biomedical documents from a query
- RAG pipelines: Retrieving biomedical context for grounded question answering
- Biomedical document clustering: Grouping similar biomedical texts by semantic content
- Duplicate detection: Identifying semantically similar biomedical passages
Out-of-Scope Use
- General-domain retrieval (model is specialized for biomedical Spanish)
- Cross-lingual retrieval beyond Spanish
- Use as a generative model (this is an encoder-only model)
- Clinical decision support without human expert validation
How to Use
With sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("SINAI/ALIA-MrBERT-es-biomedical-embeddings")
queries = ["¿Cuáles son los síntomas principales de la insuficiencia cardíaca?"]
documents = [
"La insuficiencia cardíaca puede causar disnea, fatiga y edema periférico...",
"El tratamiento inicial incluye control de la presión arterial y ajuste farmacológico...",
]
query_embeddings = model.encode(queries, prompt_name="query")
doc_embeddings = model.encode(documents)
scores = model.similarity(query_embeddings, doc_embeddings)
print(scores)
With transformers (manual)
import torch
from transformers import AutoTokenizer, AutoModel
model_name = "SINAI/ALIA-MrBERT-es-biomedical-embeddings"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
texts = ["¿Qué recomendaciones existen para el manejo de la hipertensión arterial?"]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**encoded)
embeddings = mean_pooling(outputs, encoded["attention_mask"])
print(embeddings.shape) # (1, 768)
Evaluation
The model was evaluated using the MTEB (Massive Text Embedding Benchmark) framework, adapted for the legal domain. The main reported metric is NDCG@10 (Normalized Discounted Cumulative Gain at k=10), which is the standard metric used in retrieval leaderboards and aligns with the metric reported in the MrBERT family.
An additional evaluation was made thanks to ragas evaluation framework and MiniMaxAI/MiniMax-M2.5 language model. These metrics are calculated by averaging on each pair puntuation from particular subsets of some of the following datasets.
Evaluation Datasets
| Dataset | Category | Description |
|---|---|---|
| QA | Retrieval | Spanish subset of the MIRACL dataset in MTEB format (jinaai/miracl-es) |
| STS | STS | Combination of three Spanish STS datasets: PAWS-X (google-research-datasets/paws-x), STS22 (mteb/sts22-crosslingual-sts), and SemRel2024 (SemRel/SemRel2024) |
| CoWeSe | Retrieval | Generated open-ended questions from the CoWeSe corpus (chrisnb1/cowese-qa-dataset) |
| AbSanitas | Retrieval | Spanish biomedical information retrieval dataset built from biomedical texts collected from official academic repositories and open-access sources (BSC-LT/AbSanitas) |
| pairs1.5k | Retrieval | Subset of 1.5K biomedical evaluation pairs (query + passage) located in SINAI/ALIA-es-biomedical-triplets/test. |
| pairs3645 | Retrieval | Subset of 3,645 biomedical evaluation pairs (query + passage) located in SINAI/ALIA-es-biomedical-triplets/test. |
Results (NDCG@10)
| Model | QA | STS | CoWeSe | AbSanitas | pairs1.5k | pairs3645 |
|---|---|---|---|---|---|---|
| BAAI/bge-m3 | 0.9839 | 0.4629 | 0.7974 | 0.7613 | 0.9775 | 0.9840 |
| Qwen/Qwen3-Embedding-0.6B | 0.9869 | 0.5033 | 0.8807 | 0.8165 | 0.9921 | 0.9962 |
| sentence-transformers/paraphrase-multilingual-mpnet-base-v2 | 0.9419 | 0.4632 | 0.7262 | 0.4870 | 0.7871 | 0.8212 |
| ALIA-MrBERT-es-biomedical-embeddings (ours) | 0.9666 | 0.4932 | 0.8551 | 0.8124 | 0.9877 | 0.9966 |
BAAI/bge-m3 and Qwen/Qwen3-Embedding-0.6B are SOTA models with a size x4 larger than ours. Yet, our model shows comparable results.
Results (ragas)
Subset: triplets_queries1724_contexts3645 (pairs3645)
| Model | Context Precision (avg) | Context Utilization (avg) | Context Relevance (avg) |
|---|---|---|---|
| Qwen/Qwen3-Embedding-0.6B | 0.9577 | 0.9582 | 0.9098 |
| ALIA-MrBERT-es-biomedical-embeddings (ours) | 0.9588 | 0.9589 | 0.9102 |
Subset: cowese_queries661_contexts551
| Model | Context Relevance (avg) |
|---|---|
| Qwen/Qwen3-Embedding-0.6B | 0.4231 |
| ALIA-MrBERT-es-biomedical-embeddings (ours) | 0.3932 |
Subset: absanitas_queries800_contexts25192
| Model | Context Relevance (avg) |
|---|---|
| Qwen/Qwen3-Embedding-0.6B | 0.9228 |
| ALIA-MrBERT-es-biomedical-embeddings (ours) | 0.9156 |
Note: Each evaluation subset is named following the pattern
{dataset}_queries{N}_contexts{M}, whereNis the number of queries evaluated againstMcontexts taken from the datasets.
Limitations and Biases
Known Limitations
- Domain specificity: The model is optimized for Spanish biomedical texts. Performance may degrade on general-domain or other specialized texts.
- Language: Although MrBERT-es supports Spanish and English, this fine-tuned model focuses on Spanish biomedical content.
- Clinical accuracy: Semantic similarity does not guarantee medical correctness. Retrieved documents should be verified by qualified healthcare professionals.
- Context length: Despite supporting up to 8,192 tokens, very long documents may still require chunking strategies for optimal retrieval performance.
Biases
- The model may reflect biases present in the biomedical corpora used for training.
- It may underperform on subdomains underrepresented in the source data.
Additional Information
License
Citation
If you use this model in your research, please cite:
@misc{ALIA-MrBERT-es-biomedical-embeddings,
title = {ALIA MrBERT Spanish Biomedical Embeddings Model},
author = {SINAI Research Group, Universidad de Jaén},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/SINAI/ALIA-MrBERT-es-biomedical-embeddings}}
}
Please also cite the base model:
@misc{tamayo2026mrbertmodernmultilingualencoders,
title={MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation},
author={Daniel Tamayo and Iñaki Lacunza and Paula Rivera-Hidalgo and Severino Da Dalt and Javier Aula-Blasco and Aitor Gonzalez-Agirre and Marta Villegas},
year={2026},
eprint={2602.21379},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.21379},
}
Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ALIA.
Acknowledgments
This dataset has been generated thanks to CEATIC ( Centro de Estudios Avanzados en Tecnologías de la Información y de la Comunicación) – UJA (Universidad de Jaén) which provided the needed computational resources on its clusters.
Contact: ALIA Project - SINAI Research Group - Universidad de Jaén
More Information: SINAI Research Group | ALIA-UJA Project
- Downloads last month
- 223