Cross-Person Generalization in Egocentric Video + IMU Activity Recognition

A benchmark study scaling from a 13K-clip pilot to 500K clips from the Egocentric-1M dataset — ~1,000 workers, ~3,750 hours of paired ego video + 200Hz IMU.

Author: Shubham Rasal
Status: 🚧 Proposal — Experiments Pending

TL;DR

How do egocentric video + IMU activity recognition models fail when moving from seen to unseen workers? We benchmark 11+ models across IMU-only, video-only, and multimodal setups using 5-fold cross-person evaluation at a scale 50-80× larger than any prior ego+IMU study.

Research Questions

Cross-person gap: How much does accuracy degrade from within-person → cross-person evaluation?
Modality robustness: Which modality (video vs. IMU vs. fused) degrades less under person shift?
Distillation transfer: Does video→IMU distillation (COMODO) improve cross-person robustness?
Hard case prediction: Can IMU statistics predict which workers will cause generalization failure?
Self-supervised pre-training: Does EVI-MAE pre-training on all workers close the cross-person gap?
Scaling laws (new at 500K): How does cross-person accuracy scale from 13K → 50K → 100K → 250K → 500K clips?

Dataset

	Pilot (Now)	Full Scale
Workers	100	~1,000+
Clips	12,997	500,000
Duration	~97.5 hours	~3,750 hours
Size	1.22 TiB	~47 TiB
IMU	200Hz acc + gyro	200Hz acc + gyro
Labels	working / break / not-on-person	working / break / not-on-person
Domain	Industrial / workplace	Industrial / workplace

Baselines

Category	Models
IMU-only	DeepConvLSTM, Attend & Discriminate, CrossHAR, MOMENT-small, Mantis
Video-only	TimeSformer-Base, VideoMAE ViT-B, SlowFast R50
Distillation	COMODO (MOMENT), COMODO (Mantis), IMU2CLIP
Self-supervised	EVI-MAE, IMU-only MAE

Compute Budget

Resource	Hours	Cost
CPU	160h	$96
A10G GPU (24 GB)	1,705h	$3,410
A100x4 GPU (4× 80 GB)	150h	$2,400
Subtotal	2,015h	$5,906
30% contingency		+$1,772
Total		~$7,678

Storage: ~5.4 TiB active (238 GiB IMU + 22 GiB embeddings + 4.8 TiB selective video + 100 GiB checkpoints)

Timeline

16 weeks in two stages:

Weeks 1-4: Pilot validation on 13K clips (fast iteration, pipeline debug)
Weeks 5-14: Full-scale experiments on 500K clips
Weeks 15-16: Synthesis and paper write-up

See full proposal for detailed week-by-week breakdown.

Key References

COMODO — Video→IMU distillation
EVI-MAE — Multimodal masked autoencoder
Generalizable HAR Survey — Cross-person/device/position taxonomy
IMU-Video OOD HAR — Cross-modal SSL for OOD transfer
MoBind — Hierarchical IMU-pose alignment

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for ShubhamRasal/cross-person-ego-imu-benchmark

MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

Paper • 2602.19004 • Published Feb 22 • 1

Towards Generalizable Human Activity Recognition: A Survey

Paper • 2508.12213 • Published Aug 17, 2025

Improving Out-of-distribution Human Activity Recognition via IMU-Video Cross-modal Representation Learning

Paper • 2507.13482 • Published Jul 17, 2025

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

Paper • 2503.07259 • Published Mar 10, 2025

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

Paper • 2407.06628 • Published Jul 9, 2024