ALIA MrBERT Spanish Biomedical Embeddings Model

This repository contains ALIA MrBERT Spanish Biomedical Embeddings, a Spanish biomedical domain bi-encoder model for semantic similarity and information retrieval tasks. It is built upon MrBERT-es, a bilingual (Spanish-English) foundational language model based on the ModernBERT architecture, and fine-tuned on domain-specific biomedical data using a Curriculum Learning strategy.

DISCLAIMER: This model is a domain-specific proof-of-concept designed to demonstrate retrieval capabilities in the Spanish biomedical domain. While optimized for this domain, results should be verified against official clinical sources and expert judgment. The model may fail in out-of-domain or adversarial inputs.

Model Details

Model Lineage

ModernBERT (architecture)
       ↓
  MrBERT-es (BSC-LT)
  Bilingual ES/EN encoder
  150M parameters
       ↓
  ALIA-MrBERT-es-biomedical-embeddings (SINAI)
  Biomedical domain fine-tuning
  Curriculum Learning + Hard Negatives

Key Features

🔍 Domain: Spanish biomedical and healthcare texts
📐 Architecture: ModernBERT with Mean Pooling (bi-encoder)
📏 Long context: up to 8,192 tokens
🎓 Training strategy: Curriculum Learning (easy → medium → hard)
⚙️ Negative mining: Positive-Aware Hard Negative Mining (NVIDIA approach)

Architecture

This model uses the same base architecture as MrBERT-es, extended with a Mean Pooling layer for sentence-level embeddings:


Base Architecture	ModernBERT
Total Parameters	~150M
Hidden size	768
Intermediate size	1,152
Attention heads	12
Hidden layers	22
Context length	8,192 tokens
Vocabulary size	51,200
Precision	bfloat16
Positional encoding	RoPE
Activation function	GeLU
Attention type	Mixed (global every 3 layers + sliding window)
Pooling strategy	Mean Pooling

Training

Training Strategy: Curriculum Learning

The model was fine-tuned using a two-phase Curriculum Learning strategy and progressively increasing the difficulty of training examples thanks to SINAI/ALIA-es-biomedical-triplets/train:

Phase	Epochs	Negative Type	Difficulty Progression
Phase 1	6	Random negatives	Easy → Medium → Hard
Phase 2	3	Hard negatives (mined)	Easy → Medium → Hard
Total	9	—	—

Phase 1 - Contrastive Learning with Random Negatives: Training uses triplets {query, relevant_doc, [irrelevant_docs]} with in-batch negatives. Examples are sorted by difficulty across 3 sub-phases (2 epochs each).

Phase 2 - Advanced Refinement with Hard Negatives: Refinement using mined hard negatives with Positive-Aware Mining (NVIDIA approach) to avoid false negatives. A candidate is only considered a negative if:

score < score_positive - margin   (margin = 0.05)

Hyperparameter Optimization

Before training, hyperparameter search was conducted using Optuna (20 trials, subsets of 5,000 examples):

Sampler: TPESampler (Tree-structured Parzen Estimator)
Pruner: MedianPruner
Storage: SQLite for trial persistence

Hyperparameter	Search Space
Learning Rate	[1×10⁻⁶, 5×10⁻⁵] (log-uniform)
Warmup Ratio	[0.05, 0.2]
Weight Decay	[0.0, 0.1]
Mini Batch Size	{1, 4, 8, 12}

Final Training Hyperparameters

Hyperparameter	Value	Description
Learning Rate	5×10⁻⁵	Nominal learning rate
Batch Size	128	Global batch size
Cache Mini-Batch	4	For CachedMultipleNegativesRankingLoss
Warmup Ratio	0.1595	Linear LR warmup at start of each phase
Weight Decay	0.02143	L2 regularization
Optimizer	AdamW	Standard HuggingFace Trainer optimizer
Precision	bf16	Bfloat16 for Ampere+ architectures
Max Sequence Length	8,192	Maximum tokens processed

Loss Function

CachedMultipleNegativesRankingLoss: Enables training with large batches without VRAM overflow, by recalculating embeddings in smaller sub-batches (cache size: 4).

Training Framework

Component	Details
Library	`sentence-transformers`
Distributed	DDP (Distributed Data Parallel) via `torchrun`
Memory optimization	Gradient Checkpointing
Logging	WandB (offline mode)

Intended Use

Direct Use

This model is designed for semantic similarity and information retrieval tasks in the Spanish biomedical domain. Primary use cases include:

Semantic search: Finding relevant biomedical documents from a query
RAG pipelines: Retrieving biomedical context for grounded question answering
Biomedical document clustering: Grouping similar biomedical texts by semantic content
Duplicate detection: Identifying semantically similar biomedical passages

Out-of-Scope Use

General-domain retrieval (model is specialized for biomedical Spanish)
Cross-lingual retrieval beyond Spanish
Use as a generative model (this is an encoder-only model)
Clinical decision support without human expert validation

How to Use

With `sentence-transformers`

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("SINAI/ALIA-MrBERT-es-biomedical-embeddings")

queries = ["¿Cuáles son los síntomas principales de la insuficiencia cardíaca?"]
documents = [
    "La insuficiencia cardíaca puede causar disnea, fatiga y edema periférico...",
    "El tratamiento inicial incluye control de la presión arterial y ajuste farmacológico...",
]

query_embeddings = model.encode(queries, prompt_name="query")
doc_embeddings = model.encode(documents)

scores = model.similarity(query_embeddings, doc_embeddings)
print(scores)

With `transformers` (manual)

import torch
from transformers import AutoTokenizer, AutoModel

model_name = "SINAI/ALIA-MrBERT-es-biomedical-embeddings"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

texts = ["¿Qué recomendaciones existen para el manejo de la hipertensión arterial?"]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**encoded)

embeddings = mean_pooling(outputs, encoded["attention_mask"])
print(embeddings.shape)  # (1, 768)

Evaluation

The model was evaluated using the MTEB (Massive Text Embedding Benchmark) framework, adapted for the legal domain. The main reported metric is NDCG@10 (Normalized Discounted Cumulative Gain at k=10), which is the standard metric used in retrieval leaderboards and aligns with the metric reported in the MrBERT family.

An additional evaluation was made thanks to ragas evaluation framework and MiniMaxAI/MiniMax-M2.5 language model. These metrics are calculated by averaging on each pair puntuation from particular subsets of some of the following datasets.

Evaluation Datasets

Dataset	Category	Description
QA	Retrieval	Spanish subset of the MIRACL dataset in MTEB format (jinaai/miracl-es)
STS	STS	Combination of three Spanish STS datasets: PAWS-X (google-research-datasets/paws-x), STS22 (mteb/sts22-crosslingual-sts), and SemRel2024 (SemRel/SemRel2024)
CoWeSe	Retrieval	Generated open-ended questions from the CoWeSe corpus (chrisnb1/cowese-qa-dataset)
AbSanitas	Retrieval	Spanish biomedical information retrieval dataset built from biomedical texts collected from official academic repositories and open-access sources (BSC-LT/AbSanitas)
pairs1.5k	Retrieval	Subset of 1.5K biomedical evaluation pairs (query + passage) located in SINAI/ALIA-es-biomedical-triplets/test.
pairs3645	Retrieval	Subset of 3,645 biomedical evaluation pairs (query + passage) located in SINAI/ALIA-es-biomedical-triplets/test.

Results (NDCG@10)

Model	QA	STS	CoWeSe	AbSanitas	pairs1.5k	pairs3645
BAAI/bge-m3	0.9839	0.4629	0.7974	0.7613	0.9775	0.9840
Qwen/Qwen3-Embedding-0.6B	0.9869	0.5033	0.8807	0.8165	0.9921	0.9962
sentence-transformers/paraphrase-multilingual-mpnet-base-v2	0.9419	0.4632	0.7262	0.4870	0.7871	0.8212
ALIA-MrBERT-es-biomedical-embeddings (ours)	0.9666	0.4932	0.8551	0.8124	0.9877	0.9966

BAAI/bge-m3 and Qwen/Qwen3-Embedding-0.6B are SOTA models with a size x4 larger than ours. Yet, our model shows comparable results.

Results (ragas)

Subset: `triplets_queries1724_contexts3645 (pairs3645)`

Model	Context Precision (avg)	Context Utilization (avg)	Context Relevance (avg)
Qwen/Qwen3-Embedding-0.6B	0.9577	0.9582	0.9098
ALIA-MrBERT-es-biomedical-embeddings (ours)	0.9588	0.9589	0.9102

Subset: `cowese_queries661_contexts551`

Model	Context Relevance (avg)
Qwen/Qwen3-Embedding-0.6B	0.4231
ALIA-MrBERT-es-biomedical-embeddings (ours)	0.3932

Subset: `absanitas_queries800_contexts25192`

Model	Context Relevance (avg)
Qwen/Qwen3-Embedding-0.6B	0.9228
ALIA-MrBERT-es-biomedical-embeddings (ours)	0.9156

Note: Each evaluation subset is named following the pattern {dataset}_queries{N}_contexts{M}, where N is the number of queries evaluated against M contexts taken from the datasets.

Limitations and Biases

Known Limitations

Domain specificity: The model is optimized for Spanish biomedical texts. Performance may degrade on general-domain or other specialized texts.
Language: Although MrBERT-es supports Spanish and English, this fine-tuned model focuses on Spanish biomedical content.
Clinical accuracy: Semantic similarity does not guarantee medical correctness. Retrieved documents should be verified by qualified healthcare professionals.
Context length: Despite supporting up to 8,192 tokens, very long documents may still require chunking strategies for optimal retrieval performance.

Biases

The model may reflect biases present in the biomedical corpora used for training.
It may underperform on subdomains underrepresented in the source data.

Additional Information

License

Apache License, Version 2.0

Citation

If you use this model in your research, please cite:

@misc{ALIA-MrBERT-es-biomedical-embeddings,
  title        = {ALIA MrBERT Spanish Biomedical Embeddings Model},
  author       = {SINAI Research Group, Universidad de Jaén},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/SINAI/ALIA-MrBERT-es-biomedical-embeddings}}
}

Please also cite the base model:

@misc{tamayo2026mrbertmodernmultilingualencoders,
      title={MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation}, 
      author={Daniel Tamayo and Iñaki Lacunza and Paula Rivera-Hidalgo and Severino Da Dalt and Javier Aula-Blasco and Aitor Gonzalez-Agirre and Marta Villegas},
      year={2026},
      eprint={2602.21379},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.21379}, 
}

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ALIA.

Acknowledgments

This dataset has been generated thanks to CEATIC ( Centro de Estudios Avanzados en Tecnologías de la Información y de la Comunicación) – UJA (Universidad de Jaén) which provided the needed computational resources on its clusters.

Contact: ALIA Project - SINAI Research Group - Universidad de Jaén

More Information: SINAI Research Group | ALIA-UJA Project

Downloads last month: 223

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for SINAI/ALIA-MrBERT-es-biomedical-embeddings

Base model

BSC-LT/MrBERT

Finetuned

BSC-LT/MrBERT-es

Finetuned

(10)

this model

Collection including SINAI/ALIA-MrBERT-es-biomedical-embeddings

ALIA-es-biomedical resources

Collection

8 items • Updated about 5 hours ago

Papers for SINAI/ALIA-MrBERT-es-biomedical-embeddings

MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

Paper • 2602.21379 • Published Feb 24 • 1

Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

Paper • 2109.07765 • Published Sep 16, 2021