penpal-quality-assurance

A small MLX ResNet that scores a 256Γ—256 grayscale handwriting raster on [0, 1]: 1 = legible, human-style handwriting, 0 = corrupted or illegible output. Trained to filter synthetic handwriting produced by Graves-style generative models before it's used downstream.

  • ~36k parameters (channels 4 / 8 / 16 / 32 / 32)
  • Single-file safetensors weights
  • Apple Silicon / MLX

Inputs

  • Shape [B, 256, 256, 1] (MLX NHWC), float32
  • 0.0 = background, 1.0 = ink
  • The renderer in render.py (or graves_handwriting_mlx.quality.render_strokes) fits each stroke bbox isotropically into the canvas with 12 px padding

Output

Raw logits. Apply mx.sigmoid for a probability in [0, 1].

Usage

With the graves-handwriting-mlx package installed:

import mlx.core as mx
from graves_handwriting_mlx.quality import QualityClassifier, render_strokes

model = QualityClassifier.from_pretrained("breitburg/penpal-quality-assurance")

# `strokes` is the project's nested word -> stroke -> point schema
image = render_strokes(strokes)                       # [256, 256, 1]
score = mx.sigmoid(model(mx.array(image)[None]))[0]   # float in [0, 1]

Without the package, download the weights directly:

from huggingface_hub import hf_hub_download
weights_path = hf_hub_download("breitburg/penpal-quality-assurance", "weights.safetensors")

Training data

Real (label 1.0) and corrupted-synthetic (label 0.0) strokes are rasterized through the same renderer so the classifier cannot use rendering style as a shortcut.

  • Positive β€” real human handwriting strokes (IAM-OnDB-derived collections)
  • Negative β€” strokes generated by the Graves model with internal state corruption applied during sampling (attention ΞΊ scale, attention Ξ² floor, hidden-state Gaussian noise) in a 10 / 70 / 20 mixture of very mild / mild / gibberish corruption ranges
  • Mid (label 0.5) β€” clean samples from breitburg/penpal, which sit between the real and corrupted clusters

Loss is BCE-with-logits over the soft {0.0, 0.5, 1.0} labels.

Evaluation

Distribution of scores on 500 random rows from each source:

Source Mean Median p10 p25 p75 p90 β‰₯0.3 β‰₯0.5 β‰₯0.7 β‰₯0.9
held-out real handwriting 0.675 0.669 0.390 0.500 0.881 0.969 96.4 % 75.0 % 46.0 % 22.8 %
breitburg/penpal (clean synthetic) 0.418 0.396 0.321 0.352 0.452 0.529 100 % 13.8 % 3.0 % 0.6 %

The lowest-scoring penpal rows are genuinely degraded; the highest- scoring rows look indistinguishable from real handwriting. A residual length / scale bias exists (longer texts render smaller and tend to score lower) β€” acceptable for filtering, but worth knowing.

Suggested thresholds

  • 0.3 β€” lenient: keeps essentially all of penpal, drops only the obvious failures
  • 0.5 β€” balanced: drops ~86 % of penpal, keeps 75 % of real
  • 0.7 β€” strict: keeps only confidently human-looking rows (~46 % of real)

Files

  • weights.safetensors β€” trained parameters
  • config.json β€” architecture widths and input contract
  • model.py β€” QualityClassifier / BasicBlock reference implementation
  • render.py β€” render_strokes for stroke β†’ 256Γ—256 raster

License

MIT.

Downloads last month
54
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train breitburg/penpal-quality-assurance