YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Cross-Person Generalization in Egocentric Video + IMU Activity Recognition
A benchmark study scaling from a 13K-clip pilot to 500K clips from the Egocentric-1M dataset — ~1,000 workers, ~3,750 hours of paired ego video + 200Hz IMU.
Author: Shubham Rasal
Status: 🚧 Proposal — Experiments Pending
TL;DR
How do egocentric video + IMU activity recognition models fail when moving from seen to unseen workers? We benchmark 11+ models across IMU-only, video-only, and multimodal setups using 5-fold cross-person evaluation at a scale 50-80× larger than any prior ego+IMU study.
Research Questions
- Cross-person gap: How much does accuracy degrade from within-person → cross-person evaluation?
- Modality robustness: Which modality (video vs. IMU vs. fused) degrades less under person shift?
- Distillation transfer: Does video→IMU distillation (COMODO) improve cross-person robustness?
- Hard case prediction: Can IMU statistics predict which workers will cause generalization failure?
- Self-supervised pre-training: Does EVI-MAE pre-training on all workers close the cross-person gap?
- Scaling laws (new at 500K): How does cross-person accuracy scale from 13K → 50K → 100K → 250K → 500K clips?
Dataset
| Pilot (Now) | Full Scale | |
|---|---|---|
| Workers | 100 | ~1,000+ |
| Clips | 12,997 | 500,000 |
| Duration | ~97.5 hours | ~3,750 hours |
| Size | 1.22 TiB | ~47 TiB |
| IMU | 200Hz acc + gyro | 200Hz acc + gyro |
| Labels | working / break / not-on-person | working / break / not-on-person |
| Domain | Industrial / workplace | Industrial / workplace |
Baselines
| Category | Models |
|---|---|
| IMU-only | DeepConvLSTM, Attend & Discriminate, CrossHAR, MOMENT-small, Mantis |
| Video-only | TimeSformer-Base, VideoMAE ViT-B, SlowFast R50 |
| Distillation | COMODO (MOMENT), COMODO (Mantis), IMU2CLIP |
| Self-supervised | EVI-MAE, IMU-only MAE |
Compute Budget
| Resource | Hours | Cost |
|---|---|---|
| CPU | 160h | $96 |
| A10G GPU (24 GB) | 1,705h | $3,410 |
| A100x4 GPU (4× 80 GB) | 150h | $2,400 |
| Subtotal | 2,015h | $5,906 |
| 30% contingency | +$1,772 | |
| Total | ~$7,678 |
Storage: ~5.4 TiB active (238 GiB IMU + 22 GiB embeddings + 4.8 TiB selective video + 100 GiB checkpoints)
Timeline
16 weeks in two stages:
- Weeks 1-4: Pilot validation on 13K clips (fast iteration, pipeline debug)
- Weeks 5-14: Full-scale experiments on 500K clips
- Weeks 15-16: Synthesis and paper write-up
See full proposal for detailed week-by-week breakdown.
Key References
- COMODO — Video→IMU distillation
- EVI-MAE — Multimodal masked autoencoder
- Generalizable HAR Survey — Cross-person/device/position taxonomy
- IMU-Video OOD HAR — Cross-modal SSL for OOD transfer
- MoBind — Hierarchical IMU-pose alignment
License
MIT