penpal-quality-assurance

A small MLX ResNet that scores a 256×256 grayscale handwriting raster on [0, 1]: 1 = legible, human-style handwriting, 0 = corrupted or illegible output. Trained to filter synthetic handwriting produced by Graves-style generative models before it's used downstream.

~36k parameters (channels 4 / 8 / 16 / 32 / 32)
Single-file safetensors weights
Apple Silicon / MLX

Inputs

Shape [B, 256, 256, 1] (MLX NHWC), float32
0.0 = background, 1.0 = ink
The renderer in render.py (or graves_handwriting_mlx.quality.render_strokes) fits each stroke bbox isotropically into the canvas with 12 px padding

Output

Raw logits. Apply mx.sigmoid for a probability in [0, 1].

Usage

With the graves-handwriting-mlx package installed:

import mlx.core as mx
from graves_handwriting_mlx.quality import QualityClassifier, render_strokes

model = QualityClassifier.from_pretrained("breitburg/penpal-quality-assurance")

# `strokes` is the project's nested word -> stroke -> point schema
image = render_strokes(strokes)                       # [256, 256, 1]
score = mx.sigmoid(model(mx.array(image)[None]))[0]   # float in [0, 1]

Without the package, download the weights directly:

from huggingface_hub import hf_hub_download
weights_path = hf_hub_download("breitburg/penpal-quality-assurance", "weights.safetensors")

Training data

Real (label 1.0) and corrupted-synthetic (label 0.0) strokes are rasterized through the same renderer so the classifier cannot use rendering style as a shortcut.

Positive — real human handwriting strokes (IAM-OnDB-derived collections)
Negative — strokes generated by the Graves model with internal state corruption applied during sampling (attention κ scale, attention β floor, hidden-state Gaussian noise) in a 10 / 70 / 20 mixture of very mild / mild / gibberish corruption ranges
Mid (label 0.5) — clean samples from breitburg/penpal, which sit between the real and corrupted clusters

Loss is BCE-with-logits over the soft {0.0, 0.5, 1.0} labels.

Evaluation

Distribution of scores on 500 random rows from each source:

Source	Mean	Median	p10	p25	p75	p90	≥0.3	≥0.5	≥0.7	≥0.9
held-out real handwriting	0.675	0.669	0.390	0.500	0.881	0.969	96.4 %	75.0 %	46.0 %	22.8 %
`breitburg/penpal` (clean synthetic)	0.418	0.396	0.321	0.352	0.452	0.529	100 %	13.8 %	3.0 %	0.6 %

The lowest-scoring penpal rows are genuinely degraded; the highest- scoring rows look indistinguishable from real handwriting. A residual length / scale bias exists (longer texts render smaller and tend to score lower) — acceptable for filtering, but worth knowing.

Suggested thresholds

0.3 — lenient: keeps essentially all of penpal, drops only the obvious failures
0.5 — balanced: drops ~86 % of penpal, keeps 75 % of real
0.7 — strict: keeps only confidently human-looking rows (~46 % of real)

Files

weights.safetensors — trained parameters
config.json — architecture widths and input contract
model.py — QualityClassifier / BasicBlock reference implementation
render.py — render_strokes for stroke → 256×256 raster

License

MIT.

Downloads last month: 54

MLX

Hardware compatibility

Quantized

breitburg
/

penpal-quality-assurance