--- license: mit tags: - chemistry - smiles - tokenization - dynamic-tokenization - h-net - hierarchical-networks - molecular-representation - polymer - mamba - transformer datasets: - PI1M language: - en pipeline_tag: feature-extraction --- # PI1M-1B **H-Net model for dynamic SMILES tokenization** PI1M polymer dataset, 1.05B bytes (~22 epochs), 10x concatenation, 1-stage architecture ## Model Details | Property | Value | |----------|-------| | **Architecture** | H-Net (Hierarchical Network) | | **Parameters** | ~350M | | **Dataset** | PI1M | | **Training Bytes** | 1.05B | | **Training Epochs** | 22 | | **Concatenation** | 10x SMILES per example | | **Architecture Variant** | 1-stage | ### Architecture Layout 1-stage: `['m4', ['T22'], 'm4']` - **Encoder**: 4 Mamba blocks for byte-level encoding - **Core**: 22 Transformer blocks with boundary prediction - **Decoder**: 4 Mamba blocks for final decoding ## Files - `checkpoints/checkpoint_bytes_best.pt` - Best checkpoint (lowest validation loss) - `checkpoints/checkpoint_epoch_*.pt` - Epoch checkpoints - `metadata.json` - Training configuration and history - `test_smiles.txt` - Test SMILES used during training - `visualizations/` - Training evolution GIFs and prediction files ## Usage ```python import torch from pathlib import Path # Load checkpoint checkpoint_path = "checkpoints/checkpoint_bytes_best.pt" checkpoint = torch.load(checkpoint_path, map_location="cpu") # The checkpoint contains: # - 'model_state_dict': Model weights # - 'optimizer_state_dict': Optimizer state # - 'epoch': Training epoch # - 'metrics': Training metrics # - 'cumulative_training_bytes': Total bytes processed # Load into your H-Net model # model.load_state_dict(checkpoint['model_state_dict']) ``` ## Performance ### Tokenization Metrics (from paper) | Metric | Value | |--------|-------| | Bits-per-byte (BPB) | 0.64 | | Mean token length | 2.9 | ### Property Prediction (embeddings) H-Net embeddings outperform RDKit descriptors on classification tasks: - BBBP: 0.950 AUC (vs 0.927 for RDKit) - HIV: 0.788 AUC (vs 0.760 for RDKit) ## Citation ```bibtex @inproceedings{hnet_smiles_2026, title={Learning Chemical Grammar: Dynamic Tokenization for SMILES with Hierarchical Networks}, author={Anonymous}, booktitle={International Conference on Machine Learning (ICML)}, year={2026} } ``` ## Related Models All models from the paper are available: **Polymer (PI1M) Models:** - [PI1M-68M](https://huggingface.co/jordiferrero/PI1M-68M) - 1 epoch, with concatenation - [PI1M-340M](https://huggingface.co/jordiferrero/PI1M-340M) - 5 epochs, with concatenation - [PI1M-1B](https://huggingface.co/jordiferrero/PI1M-1B) - 22 epochs, with concatenation (best compression) - [PI1M-nocat](https://huggingface.co/jordiferrero/PI1M-nocat) - 5 epochs, no concatenation - [PI1M-2stg](https://huggingface.co/jordiferrero/PI1M-2stg) - 5 epochs, 2-stage architecture **Molecular (MOSES) Models:** - [MOSES-340M](https://huggingface.co/jordiferrero/MOSES-340M) - 5 epochs, with concatenation - [MOSES-nocat](https://huggingface.co/jordiferrero/MOSES-nocat) - 5 epochs, no concatenation - [MOSES-2stg](https://huggingface.co/jordiferrero/MOSES-2stg) - 5 epochs, 2-stage architecture ## License MIT License