Omni-DNA-Multitask-1B HGT Detection LoRA (Step 1300 (best))

Model Description

QLoRA adapter for Omni-DNA-Multitask-1B fine-tuned for Horizontal Gene Transfer (HGT) detection.

Task: Binary classification - detect genomic islands (horizontally transferred genes).

Training data: IslandViewer 4 (11,182 train / 2,796 eval, balanced)
Format: DNA_sequence + HGT_detection_token + label (0 or 1)
Novel token: Token 4117 (HGT_detection) added to vocabulary

Training Configuration

Parameter	Value
LoRA Rank	64
LoRA Alpha	128
Target Modules	att_proj, attn_out, ff_proj, ff_out
LoRA Dropout	0.05
Quantization	4-bit NF4 (QLoRA)
Trainable Params	~46.1M
Learning Rate	2e-4 (cosine)
Batch Size	8 (grad accum 4)

Best Performance (Step 1300, Epoch 3.72)

Metric	Value
AUC	0.8736
AP	0.8735
F1	0.6857
Accuracy	0.736

AUC Trajectory

Step	Epoch	AUC	F1
100	0.29	0.6146	0.6967
200	0.57	0.7620	0.7399
300	0.86	0.8101	0.7069
400	1.14	0.8244	0.7172
500	1.43	0.8359	0.6844
600	1.72	0.8381	0.6697
700	2.00	0.8166	0.5854
800	2.29	0.8015	0.3593
900	2.57	0.8224	0.7445
1000	2.86	0.8213	0.5407
1100	3.14	0.8227	0.7000
1200	3.43	0.7214	0.3933
1300	3.72	0.8736	0.6857 BEST
1400	4.00	0.8267	0.7701
1500	4.29	0.8389	0.7795
1600	4.57	0.8048	0.5260
1700	4.86	0.7646	0.4451
1800	5.14	0.7948	0.4070
1900	5.43	0.8115	0.5714
2000	5.72	0.7920	0.5168
2100	6.00	0.7985	0.5528
2200	6.29	0.7355	0.3631
2300	6.57	0.7689	0.3873

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained('yahmaachi/omni-dna-multitask-1b')
tokenizer = AutoTokenizer.from_pretrained('yahmaachi/omni-dna-multitask-1b')
tokenizer.add_tokens(['HGT_detection'])
base.resize_token_embeddings(len(tokenizer))

model = PeftModel.from_pretrained(base, 'Nhoodie/omni-dna-hgt-lora-best')
model.eval()

dna = 'ATGCGATCGATCGATCGATC...'  # your sequence
inputs = tokenizer(dna + 'HGT_detection', return_tensors='pt')
with torch.no_grad():
    logits = model(**inputs).logits[:, -1, :]
    prob = torch.softmax(logits, dim=-1)
    # token 4097 = 1 (HGT), token 4096 = 0 (not HGT)
    hgt_prob = prob[0, 4097].item()
print(f'HGT probability: {hgt_prob:.4f}')

Training Notes

The training exhibited an interesting collapse-recovery pattern: after initial overfitting around epoch 2-3.4, the novel token (HGT_detection, #4117) underwent a representational reorganization, leading to a new best AUC of 0.8736 at step 1300.

Steps 1300 and 1500 are sibling models - different representational equilibria rather than descendant relationships. Step 1300 optimizes ranking (AUC), step 1500 optimizes classification (F1).

License

MIT

Downloads last month: -