YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Cross-Person Generalization in Egocentric Video + IMU Activity Recognition

A benchmark study scaling from a 13K-clip pilot to 500K clips from the Egocentric-1M dataset — ~1,000 workers, ~3,750 hours of paired ego video + 200Hz IMU.

Author: Shubham Rasal
Status: 🚧 Proposal — Experiments Pending

TL;DR

How do egocentric video + IMU activity recognition models fail when moving from seen to unseen workers? We benchmark 11+ models across IMU-only, video-only, and multimodal setups using 5-fold cross-person evaluation at a scale 50-80× larger than any prior ego+IMU study.

Research Questions

  1. Cross-person gap: How much does accuracy degrade from within-person → cross-person evaluation?
  2. Modality robustness: Which modality (video vs. IMU vs. fused) degrades less under person shift?
  3. Distillation transfer: Does video→IMU distillation (COMODO) improve cross-person robustness?
  4. Hard case prediction: Can IMU statistics predict which workers will cause generalization failure?
  5. Self-supervised pre-training: Does EVI-MAE pre-training on all workers close the cross-person gap?
  6. Scaling laws (new at 500K): How does cross-person accuracy scale from 13K → 50K → 100K → 250K → 500K clips?

Dataset

Pilot (Now) Full Scale
Workers 100 ~1,000+
Clips 12,997 500,000
Duration ~97.5 hours ~3,750 hours
Size 1.22 TiB ~47 TiB
IMU 200Hz acc + gyro 200Hz acc + gyro
Labels working / break / not-on-person working / break / not-on-person
Domain Industrial / workplace Industrial / workplace

Baselines

Category Models
IMU-only DeepConvLSTM, Attend & Discriminate, CrossHAR, MOMENT-small, Mantis
Video-only TimeSformer-Base, VideoMAE ViT-B, SlowFast R50
Distillation COMODO (MOMENT), COMODO (Mantis), IMU2CLIP
Self-supervised EVI-MAE, IMU-only MAE

Compute Budget

Resource Hours Cost
CPU 160h $96
A10G GPU (24 GB) 1,705h $3,410
A100x4 GPU (4× 80 GB) 150h $2,400
Subtotal 2,015h $5,906
30% contingency +$1,772
Total ~$7,678

Storage: ~5.4 TiB active (238 GiB IMU + 22 GiB embeddings + 4.8 TiB selective video + 100 GiB checkpoints)

Timeline

16 weeks in two stages:

  • Weeks 1-4: Pilot validation on 13K clips (fast iteration, pipeline debug)
  • Weeks 5-14: Full-scale experiments on 500K clips
  • Weeks 15-16: Synthesis and paper write-up

See full proposal for detailed week-by-week breakdown.

Key References

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for ShubhamRasal/cross-person-ego-imu-benchmark