Arabic Document Extractor — Qwen2.5-VL-3B + QLoRA

🏭 Purpose: Extract structured data from Arabic PDF work orders, invoices, tables, and documents for factory automation.

Model Details

Attribute	Value
Base Model	Qwen/Qwen2.5-VL-3B-Instruct
Method	QLoRA (4-bit NF4) SFT via TRL
LoRA	rank=16, alpha=32, all-linear (vision + language)
Training Recipe	Based on QARI-OCR — SOTA Arabic OCR
Hyperparams	lr=2e-4, batch=8 (eff.), 2 epochs, linear schedule, AdamW

Training Data

Dataset	Samples	Task
Misraj/Misraj-DocOCR	~thousands	Arabic document → Markdown
Misraj/KITAB_pdf_to_markdown_reviewed	~hundreds	Expert-reviewed PDF → Markdown
ahmedheakl/arocrbench_tables	~hundreds	Arabic tables → structured JSON

Capabilities

✅ Arabic OCR — Read printed Arabic text from scanned documents
✅ Structured Extraction — Extract key-value pairs as JSON from work orders
✅ Table Extraction — Convert Arabic financial/data tables to structured JSON
✅ Markdown Conversion — Convert Arabic PDFs to formatted Markdown
✅ Bilingual — Handles mixed Arabic/English documents

Quick Start

Installation

pip install transformers peft torch qwen-vl-utils Pillow

Inference

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from PIL import Image
import torch

# Load base + adapter
base = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "optiviseapp/arabic-doc-extractor-qwen25vl-3b")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

# Extract from work order
image = Image.open("work_order.png").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "استخرج جميع البيانات من أمر العمل هذا بصيغة JSON"}
    ],
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
from qwen_vl_utils import process_vision_info
image_inputs, _ = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=2000)
result = processor.batch_decode(
    [o[len(i):] for i, o in zip(inputs.input_ids, output)],
    skip_special_tokens=True
)[0]
print(result)

Work Order Extraction Prompt (Arabic)

استخرج جميع المعلومات من هذه الوثيقة بصيغة JSON منظمة تشمل:
- رقم_الأمر، التاريخ، القسم، الوردية
- اسم_العامل، المهمة، الأولوية، الحالة

Training

Run Training

pip install transformers trl torch datasets trackio accelerate peft bitsandbytes qwen-vl-utils

# Set your HF token
export HF_TOKEN=your_token_here

# Run training (needs 24GB+ GPU — A10G, A6000, or A100)
python train.py

Via HF Jobs

huggingface-cli jobs run train.py \
  --hardware a10g-large \
  --timeout 6h \
  --dependencies transformers trl torch datasets trackio accelerate peft bitsandbytes qwen-vl-utils

Hardware Requirements

Stage	GPU VRAM	Recommended
Training (QLoRA)	16-24 GB	A10G, A6000, RTX 4090
Inference (4-bit)	6-8 GB	RTX 3060+, T4
Inference (bf16)	12-16 GB	A10G, RTX 4090

🏗️ Factory Integration

For your factory automation platform:

PDF Upload → Convert pages to images (pdf2image library)
Extract → Run this model on each page with work order prompt
Parse JSON → Feed structured data to your shift assignment system
Assign → Auto-assign shifts based on extracted work order fields

from pdf2image import convert_from_path

# Convert uploaded PDF
pages = convert_from_path("uploaded_work_order.pdf", dpi=200)

# Extract from each page
for page in pages:
    result = extract_from_image(model, processor, page, task="work_order")
    work_order_data = json.loads(result)
    # Feed to your shift assignment system
    assign_shifts(work_order_data)

Improving Results

For best results on YOUR specific work orders:

Collect 100-500 annotated examples of your actual work orders with JSON ground truth
Add them to the training data and re-run fine-tuning
Use the QARI synthetic pipeline: Render your work order HTML templates → PDF → images with Arabic text variations

Related Models & References

Model	CER	WER	Notes
QARI-OCR v0.2	0.061	0.160	SOTA open-source Arabic OCR
AIN-7B	—	0.28	Best Arabic multimodal (7B)
Baseer	—	0.25	Best doc-to-markdown
This model	TBD	TBD	Specialized for structured extraction

Paper: QARI-OCR (arXiv:2506.02295)
Paper: AIN (arXiv:2502.00094)
TRL Docs: SFT VLM Training