---
license: mit
tags:
- chemistry
- smiles
- tokenization
- dynamic-tokenization
- h-net
- hierarchical-networks
- molecular-representation
- polymer
- mamba
- transformer
datasets:
- PI1M
language:
- en
pipeline_tag: feature-extraction
---

# PI1M-1B

**H-Net model for dynamic SMILES tokenization**

PI1M polymer dataset, 1.05B bytes (~22 epochs), 10x concatenation, 1-stage architecture

## Model Details

| Property | Value |
|----------|-------|
| **Architecture** | H-Net (Hierarchical Network) |
| **Parameters** | ~350M |
| **Dataset** | PI1M |
| **Training Bytes** | 1.05B |
| **Training Epochs** | 22 |
| **Concatenation** | 10x SMILES per example |
| **Architecture Variant** | 1-stage |

### Architecture Layout

1-stage: `['m4', ['T22'], 'm4']`

- **Encoder**: 4 Mamba blocks for byte-level encoding
- **Core**: 22 Transformer blocks with boundary prediction
- **Decoder**: 4 Mamba blocks for final decoding

## Files

- `checkpoints/checkpoint_bytes_best.pt` - Best checkpoint (lowest validation loss)
- `checkpoints/checkpoint_epoch_*.pt` - Epoch checkpoints
- `metadata.json` - Training configuration and history
- `test_smiles.txt` - Test SMILES used during training
- `visualizations/` - Training evolution GIFs and prediction files

## Usage

```python
import torch
from pathlib import Path

# Load checkpoint
checkpoint_path = "checkpoints/checkpoint_bytes_best.pt"
checkpoint = torch.load(checkpoint_path, map_location="cpu")

# The checkpoint contains:
# - 'model_state_dict': Model weights
# - 'optimizer_state_dict': Optimizer state
# - 'epoch': Training epoch
# - 'metrics': Training metrics
# - 'cumulative_training_bytes': Total bytes processed

# Load into your H-Net model
# model.load_state_dict(checkpoint['model_state_dict'])
```

## Performance

### Tokenization Metrics (from paper)

| Metric | Value |
|--------|-------|
| Bits-per-byte (BPB) | 0.64 |
| Mean token length | 2.9 |

### Property Prediction (embeddings)

H-Net embeddings outperform RDKit descriptors on classification tasks:
- BBBP: 0.950 AUC (vs 0.927 for RDKit)
- HIV: 0.788 AUC (vs 0.760 for RDKit)

## Citation

```bibtex
@inproceedings{hnet_smiles_2026,
  title={Learning Chemical Grammar: Dynamic Tokenization for SMILES with Hierarchical Networks},
  author={Anonymous},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}
```

## Related Models

All models from the paper are available:

**Polymer (PI1M) Models:**
- [PI1M-68M](https://huggingface.co/jordiferrero/PI1M-68M) - 1 epoch, with concatenation
- [PI1M-340M](https://huggingface.co/jordiferrero/PI1M-340M) - 5 epochs, with concatenation  
- [PI1M-1B](https://huggingface.co/jordiferrero/PI1M-1B) - 22 epochs, with concatenation (best compression)
- [PI1M-nocat](https://huggingface.co/jordiferrero/PI1M-nocat) - 5 epochs, no concatenation
- [PI1M-2stg](https://huggingface.co/jordiferrero/PI1M-2stg) - 5 epochs, 2-stage architecture

**Molecular (MOSES) Models:**
- [MOSES-340M](https://huggingface.co/jordiferrero/MOSES-340M) - 5 epochs, with concatenation
- [MOSES-nocat](https://huggingface.co/jordiferrero/MOSES-nocat) - 5 epochs, no concatenation
- [MOSES-2stg](https://huggingface.co/jordiferrero/MOSES-2stg) - 5 epochs, 2-stage architecture

## License

MIT License