Instructions to use Nhoodie/omni-dna-hgt-lora-best with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Nhoodie/omni-dna-hgt-lora-best with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/mnt/storage/hf_cache/omni-dna-multitask-hgt") model = PeftModel.from_pretrained(base_model, "Nhoodie/omni-dna-hgt-lora-best") - Notebooks
- Google Colab
- Kaggle
Omni-DNA-Multitask-1B HGT Detection LoRA (Step 1300 (best))
Model Description
QLoRA adapter for Omni-DNA-Multitask-1B fine-tuned for Horizontal Gene Transfer (HGT) detection.
Task: Binary classification - detect genomic islands (horizontally transferred genes).
- Training data: IslandViewer 4 (11,182 train / 2,796 eval, balanced)
- Format: DNA_sequence + HGT_detection_token + label (0 or 1)
- Novel token: Token 4117 (HGT_detection) added to vocabulary
Training Configuration
| Parameter | Value |
|---|---|
| LoRA Rank | 64 |
| LoRA Alpha | 128 |
| Target Modules | att_proj, attn_out, ff_proj, ff_out |
| LoRA Dropout | 0.05 |
| Quantization | 4-bit NF4 (QLoRA) |
| Trainable Params | ~46.1M |
| Learning Rate | 2e-4 (cosine) |
| Batch Size | 8 (grad accum 4) |
Best Performance (Step 1300, Epoch 3.72)
| Metric | Value |
|---|---|
| AUC | 0.8736 |
| AP | 0.8735 |
| F1 | 0.6857 |
| Accuracy | 0.736 |
AUC Trajectory
| Step | Epoch | AUC | F1 |
|---|---|---|---|
| 100 | 0.29 | 0.6146 | 0.6967 |
| 200 | 0.57 | 0.7620 | 0.7399 |
| 300 | 0.86 | 0.8101 | 0.7069 |
| 400 | 1.14 | 0.8244 | 0.7172 |
| 500 | 1.43 | 0.8359 | 0.6844 |
| 600 | 1.72 | 0.8381 | 0.6697 |
| 700 | 2.00 | 0.8166 | 0.5854 |
| 800 | 2.29 | 0.8015 | 0.3593 |
| 900 | 2.57 | 0.8224 | 0.7445 |
| 1000 | 2.86 | 0.8213 | 0.5407 |
| 1100 | 3.14 | 0.8227 | 0.7000 |
| 1200 | 3.43 | 0.7214 | 0.3933 |
| 1300 | 3.72 | 0.8736 | 0.6857 BEST |
| 1400 | 4.00 | 0.8267 | 0.7701 |
| 1500 | 4.29 | 0.8389 | 0.7795 |
| 1600 | 4.57 | 0.8048 | 0.5260 |
| 1700 | 4.86 | 0.7646 | 0.4451 |
| 1800 | 5.14 | 0.7948 | 0.4070 |
| 1900 | 5.43 | 0.8115 | 0.5714 |
| 2000 | 5.72 | 0.7920 | 0.5168 |
| 2100 | 6.00 | 0.7985 | 0.5528 |
| 2200 | 6.29 | 0.7355 | 0.3631 |
| 2300 | 6.57 | 0.7689 | 0.3873 |
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained('yahmaachi/omni-dna-multitask-1b')
tokenizer = AutoTokenizer.from_pretrained('yahmaachi/omni-dna-multitask-1b')
tokenizer.add_tokens(['HGT_detection'])
base.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(base, 'Nhoodie/omni-dna-hgt-lora-best')
model.eval()
dna = 'ATGCGATCGATCGATCGATC...' # your sequence
inputs = tokenizer(dna + 'HGT_detection', return_tensors='pt')
with torch.no_grad():
logits = model(**inputs).logits[:, -1, :]
prob = torch.softmax(logits, dim=-1)
# token 4097 = 1 (HGT), token 4096 = 0 (not HGT)
hgt_prob = prob[0, 4097].item()
print(f'HGT probability: {hgt_prob:.4f}')
Training Notes
The training exhibited an interesting collapse-recovery pattern: after initial overfitting around epoch 2-3.4, the novel token (HGT_detection, #4117) underwent a representational reorganization, leading to a new best AUC of 0.8736 at step 1300.
Steps 1300 and 1500 are sibling models - different representational equilibria rather than descendant relationships. Step 1300 optimizes ranking (AUC), step 1500 optimizes classification (F1).
License
MIT
- Downloads last month
- -