Scale proposal to 500K datapoints from Egocentric-1M dataset — updated budgets, timeline, storage, new scaling experiment

cac9a3f verified about 1 month ago

preview code

raw

history blame contribute delete

25 kB

Cross-Person Generalization in Egocentric Video + IMU Activity Recognition: A Benchmark Study at 500K Scale

Author: Shubham Rasal — shubhamrasal.com
Date: April 2026
Status: Proposal Draft

1. Motivation & Problem Statement

Egocentric video and body-worn IMU sensors are the sensing backbone of AR glasses, smartwatches, and workplace wearables. Activity recognition from these modalities has made rapid progress — EVI-MAE reaches 87.96 mAP on CMU-MMAC, COMODO achieves 59.1% top-1 on Ego4D with IMU alone, and MoBind achieves state-of-the-art cross-modal retrieval. But all of these results are reported on within-distribution splits where training and test data come from the same population of users.

The real-world failure mode is cross-person generalization. When you train on workers A–H and deploy to worker J, performance degrades — often catastrophically. The recent survey on generalizable HAR (Cai et al., 2025) formalizes this as:

U_train ∩ U_test = ∅, p(X, y | U_train) ≠ p(X, y | U_test)

This is the setting that matters for any commercial wearable deployment, yet it remains systematically understudied because existing ego+IMU datasets have too few subjects (CMU-MMAC: 12, WEAR: 18, MMEA: unknown) to support meaningful cross-person evaluation at scale.

The Opportunity

We will use 500K clips from the Egocentric-1M dataset — drawing from a pool of ~1,000 workers performing industrial/workplace tasks. This is orders of magnitude larger than any existing ego+IMU benchmark and enables studies that are statistically impossible on current datasets.

Additionally, we have early access to a 13K-clip, 100-worker pilot subset (the Egocentric Native Compression dataset) for rapid prototyping and pipeline validation before scaling to 500K.

Property	Pilot (Available Now)	Full Scale (500K)
Workers	100	~1,000+ (estimated)
Clips	12,997	500,000
Duration	~97.5 hours	~3,750 hours
Total size	1.22 TiB	~47 TiB
IMU data	6.18 GiB	~238 GiB
5s windows	~65K	~2.5M
IMU rate	200 Hz (acc + gyro)	200 Hz (acc + gyro)
Activity labels	working / break / not-on-person	working / break / not-on-person
Domain	Industrial / workplace	Industrial / workplace

From preliminary exploration of the pilot dataset, I've observed consistent differences in motion patterns across workers performing similar tasks — exactly the distribution shift that drives generalization failures. With ~1,000 workers and 500K clips, we can study this at a scale that no prior work has approached.

2. Research Questions

How much does cross-person distribution shift degrade ego+IMU activity recognition?
Quantify the gap between within-person and cross-person accuracy across IMU-only, video-only, and multimodal setups.
Which modality is more robust to person shift — video or IMU?
IMU captures body dynamics (person-specific); video captures visual scene (potentially person-invariant). Which degrades less under cross-person shift?
Does video→IMU distillation improve cross-person robustness?
COMODO distills video knowledge into IMU encoders for same-distribution settings. Does the distilled representation also transfer better to unseen workers?
Can we predict which workers will be "hard cases" before deployment?
Use IMU statistics (spectral features, signal energy, temporal patterns) to predict which unseen workers will cause generalization failure — enabling targeted data collection or adaptation.
Does self-supervised pre-training on all workers (unlabeled) close the cross-person gap?
EVI-MAE-style pre-training uses unlabeled data from all workers. Does this shared representation space reduce person-specific biases?
What are the scaling laws for cross-person generalization? (New at 500K scale)
How does downstream cross-person accuracy change as we increase training data from 13K → 50K → 100K → 250K → 500K clips? Is there a data scaling law for cross-person robustness?

3. Dataset Overview

3.1 Pilot Dataset (Available Now)

Source: gs://build-ai-egocentric-native-compression/

Metric	Value
Total workers	100 (selected from 150, 103 complete)
Total clips	12,997
Total duration	~97.5 hours
Total size	1.22 TiB (1,240 GiB video + 6.18 GiB IMU)
Video per clip	~100 MiB MP4, ~27 seconds
IMU per clip	~500 KiB JSONL, ~5,400 samples at 200Hz

Pilot worker statistics (from index.csv):

Statistic	Clips per Worker
Min	76
Q1 (25th percentile)	95
Median	129.5
Q3 (75th percentile)	164
Max	203
Mean ± Std	130.0 ± 36.3

Activity state distribution (per-worker metadata):

State	Mean Fraction	Std	Min	Max
Working	0.843	0.110	0.250	0.957
Taking break	0.123	0.110	0.014	0.721
Not on person	0.034	0.016	0.000	0.058
Worn (working + break)	0.966	0.016	0.942	1.000

3.2 Full Dataset (500K Target)

Metric	Estimated Value
Total workers	~1,000+
Total clips	500,000
Total duration	~~3,750 hours (~~156 days)
Video data	~46.9 TiB
IMU data	~238 GiB
5s windows	~2,500,000
IMU window tensor (float32)	~56 GiB

3.3 Notable Pilot Workers (Hard Cases)

Worker	Working %	Break %	Off-body %	Clips	Notes
worker_039	25.0%	72.1%	2.9%	135	Extreme outlier — mostly on break
worker_085	37.3%	57.5%	5.2%	149	Majority break time
worker_019	57.0%	41.2%	1.8%	110	Near-equal work/break split
worker_087	62.1%	32.5%	5.3%	164	High break + high off-body

At 500K scale with ~1,000 workers, we expect 10× more behavioral diversity — more outliers, richer distribution tails, and harder generalization challenges.

3.4 Curation

First and last 2 clips trimmed per worker (removes startup/shutdown artifacts)
Audio stripped from all videos
IMU format: JSONL with t_us (microseconds), acc [x,y,z] (m/s²), gyro [x,y,z] (rad/s)

4. Experimental Design

4.1 Two-Stage Approach: Pilot → Full Scale

Stage 1 (Weeks 1-4): Pilot on 13K clips

Validate all pipelines, debug models, establish baseline numbers
Run all experiments on 100-worker pilot for fast iteration (~hours per experiment)
Identify which methods are worth scaling

Stage 2 (Weeks 5-14): Scale to 500K clips

Run the winning methods at full scale
Execute the data scaling experiment (unique contribution at this scale)
Full cross-person benchmark at ~1,000-worker scale

4.2 Cross-Person Evaluation Protocol

Primary protocol: 5-Fold Leave-20%-Workers-Out (L20%WO)

Split workers into 5 non-overlapping folds (20% each). For each fold:

Train: 60% of workers
Validation: 20% of workers
Test: 20% of workers

	Pilot (100 workers)	Full (est. ~1,000 workers)
Train	60 workers (~7,800 clips)	~~600 workers (~~300K clips)
Val	20 workers (~2,600 clips)	~~200 workers (~~100K clips)
Test	20 workers (~2,600 clips)	~~200 workers (~~100K clips)

Report mean ± std across 5 folds.

Secondary protocols:

Leave-1-Worker-Out (L1WO): On a representative subset of 100 workers (from the full 1,000). Train on 99, test on 1 × 100. Measures per-worker difficulty.
Worker scaling curve: Train on {100, 200, 400, 600, 800} workers, always test on the same held-out 200. Answers: "how many workers do you need?"
Data scaling curve: Train best model at {13K, 50K, 100K, 250K, 500K} clips. Answers: "how much data do you need?"

4.3 Task Definition

3-class activity state classification:

Working (84.3% of clips in pilot)
Taking break (12.3%)
Not on person (3.4%)

Window protocol (following COMODO/Ego4D standard):

5-second non-overlapping windows
IMU: 200Hz × 5s = 1,000 timesteps × 6 channels (acc_xyz + gyro_xyz)
Video: 10 FPS × 5s = 50 frames, resized to 224×224
Per 27s clip: 5 windows → **2.5M total windows at 500K scale**

4.4 Metrics

Metric	Why
Macro F1	Primary metric. Accounts for class imbalance (3.4% off-body would be invisible in accuracy)
Accuracy @1	Standard comparison with COMODO/EVI-MAE baselines
Per-worker F1 variance	σ across workers — directly measures generalization stability
Generalization gap	Δ between within-person and cross-person F1
Per-class F1	Breakdown across working/break/not-worn

4.5 Baseline Comparisons

Experiment A: IMU-Only Models (5 baselines)

Model	Type	Params	Reference
DeepConvLSTM	Supervised CNN+LSTM	~2M	Ordóñez & Roggen, 2016
Attend & Discriminate	Supervised attention	~3M	Abedin et al., 2021
CrossHAR	Cross-dataset transfer	~5M	Hong et al., 2024
MOMENT-small	Pre-trained time-series foundation	~40M	Goswami et al., 2024
Mantis	Pre-trained time-series	~15M	Liu et al., 2024

All evaluated in both within-person (random split) and cross-person (L20%WO) settings. The gap between the two is the headline result.

Experiment B: Video-Only Models (3 baselines)

Model	Type	Params	Reference
TimeSformer-Base	Transformer (K400 pretrained)	121M	Bertasius et al., 2021
VideoMAE ViT-B	MAE pre-trained Transformer	86M	Tong et al., 2022
SlowFast R50	Dual-pathway CNN	34M	Feichtenhofer et al., 2019

Experiment C: Cross-Modal Distillation

Method	Setup	Reference
COMODO (MOMENT student)	Frozen TimeSformer → MOMENT IMU encoder	Chen et al., 2025
COMODO (Mantis student)	Frozen TimeSformer → Mantis IMU encoder	Chen et al., 2025
IMU2CLIP	Joint video-IMU-text embedding	Moon et al., 2023

Key question: Does the distilled IMU model generalize better to unseen workers than the supervised IMU model?

Experiment D: Self-Supervised Pre-training

Method	Setup	Reference
EVI-MAE	Joint video+IMU MAE pre-training → fine-tune	Zhang et al., 2024
IMU-only MAE	IMU spectrogram MAE → fine-tune	Adapted from EVI-MAE

Pre-train on all workers (unlabeled, self-supervised), then fine-tune on train workers with labels, evaluate on held-out test workers. Compare with same architecture trained from scratch.

Experiment E: Hard Case Analysis

For a representative subset of workers, compute:

IMU statistics: RMS acceleration, gyro variance, spectral centroid, signal entropy, jerk magnitude, dominant frequency
Temporal statistics: autocorrelation decay, stationarity (ADF test), transition rate between high/low activity
Model performance: per-worker F1 from L1WO

Then:

Correlate IMU statistics with model difficulty (per-worker F1)
Cluster workers by IMU feature similarity → do clusters predict generalization difficulty?
Predict difficulty: train a regression model (IMU stats → expected F1) — can we flag hard workers before deployment?

Experiment F: Data Scaling Laws (New at 500K)

Train the best-performing model from Experiments A-D at increasing data scales:

Scale	Clips	Est. Workers	Est. Hours
13K	12,997	100	97.5h
50K	50,000	~385	375h
100K	100,000	~770	750h
250K	250,000	~1,000	1,875h
500K	500,000	~1,000	3,750h

Measure: cross-person Macro F1 vs. training data size. Is there a scaling law? Does it plateau?

5. Technical Approach

5.1 IMU Preprocessing Pipeline

Raw JSONL (200Hz, acc+gyro)
    → Parse, validate timestamps, interpolate gaps
    → 5-second non-overlapping windows (1000 × 6)
    → Two representations:
        (a) Raw time series [1000, 6] for Mantis/MOMENT/DeepConvLSTM
        (b) STFT spectrogram [160, 128, 6] for EVI-MAE/IMU-MAE
    → Per-channel z-normalization (computed on training set only)

At 500K scale: 2.5M windows × 6,000 floats = ~56 GiB processed tensor. Fits comfortably in RAM on a cpu-upgrade instance. Preprocessing takes ~80h CPU but is embarrassingly parallel.

Already built: I have the 200Hz windowing and video sync pipeline from earlier work on the pilot dataset.

5.2 Video Preprocessing Pipeline

MP4 (27s clips, ~100 MiB each)
    → Decode at 10 FPS → 270 frames per clip
    → 5-second windows → 50 frames per window
    → Resize to 224×224, normalize (ImageNet stats)
    → Pre-compute video encoder embeddings (store to disk)

At 500K scale: Pre-computing video embeddings is critical. We do NOT download all 47 TiB of video. Instead:

Stream video clips on-demand during embedding extraction
TimeSformer-B produces 768-dim vectors → 2.5M windows × 768 × 4 bytes = ~7.3 GiB per model. Easily stored.
Embedding extraction: ~200h on A10G (parallelizable across 4 GPUs → ~50h wall-clock)

5.3 COMODO Distillation Protocol

Following Chen et al. (2025) exactly:

Video teacher: TimeSformer-Base, frozen, K400-pretrained. MLP projector (hidden=2048, out=128)
IMU student: MOMENT-small or Mantis, trainable. MLP projector (hidden=2048, out=128)
Loss: Cross-entropy on similarity distribution with dynamic FIFO instance queue
Hyperparameters: 20 epochs, batch 128, lr=3e-4, τ_v=0.1, τ_x=0.05
Hardware: A10G GPU (24 GB VRAM) — scales linearly with data, ~12h per student per fold at 500K

5.4 EVI-MAE Protocol

Following Zhang et al. (2024):

Video encoder: ViT-Base (patch 16×16), 90% masking ratio
IMU encoder: Transformer on STFT spectrogram patches (16×16), 75% masking ratio
Fusion: Single-layer Transformer
Pre-training loss: MAE reconstruction (α=1, β=10) + contrastive (γ=0.01, τ=0.05)
Pre-train: 100 epochs on all workers (fewer epochs needed — each epoch sees 38× more data)
Fine-tune: 50 epochs on train split with labels
Hardware: A100x4 (4× 80GB) for pre-training (~6 days), A10G for fine-tuning

6. Timeline (16 Weeks)

Stage 1: Pilot Validation (Weeks 1-4)

Week	Phase	Deliverable	Compute
1	Data preparation	IMU windowing, video embedding extraction, EDA notebook on pilot	CPU: 12h, A10G: 10h
2	IMU baselines (pilot)	5 IMU models × 5 folds on 13K clips. Validate pipeline, establish baselines	A10G: 20h
3	Video + COMODO (pilot)	Video baselines + COMODO distillation on 13K clips	A10G: 30h
4	EVI-MAE (pilot) + triage	Self-supervised pre-training on pilot. Select methods to scale	A10G: 40h

Stage 2: Full Scale (Weeks 5-14)

Week	Phase	Deliverable	Compute
5-6	500K data ingestion	Download 238 GiB IMU. Stream-extract video embeddings for 500K clips (3 models)	CPU: 88h, A10G: 600h
7-8	IMU baselines (500K)	5 IMU models × 5 folds at full scale	A10G: 250h
9-10	Video + COMODO (500K)	Video baselines + COMODO distillation at full scale	A10G: 240h
11-12	EVI-MAE (500K)	Self-supervised pre-training on all workers (~6 days on A100x4) + fine-tuning	A100x4: 150h, A10G: 40h
13	Hard case analysis	Per-worker difficulty on 100-worker subset, IMU feature correlation	A10G: 100h, CPU: 40h
14	Scaling experiments	Data scaling curve (13K→500K), worker scaling curve (100→800)	A10G: 375h

Stage 3: Synthesis (Weeks 15-16)

Week	Phase	Deliverable	Compute
15	Analysis	Compile results, statistical significance tests, generate figures	CPU: 20h
16	Write-up	Paper draft (8-page main + appendix)	—

Total wall-clock: 16 weeks
Active compute window: Weeks 1-14 (14 weeks of experiments)

7. Compute Budget

7.1 Summary

Resource	Hours	Unit Cost	Total
CPU (preprocessing, analysis)	160h	$0.60/h	$96
A10G GPU (24 GB)	1,705h	$2.00/h	$3,410
A100x4 GPU (4× 80 GB)	150h	$16.00/h	$2,400
Subtotal	2,015h		$5,906
30% contingency			+$1,772
Total			~$7,678

7.2 Phase Breakdown

Phase	Resource	Hours	Cost	Notes
1. Pilot validation	A10G	100h	$200	Weeks 1-4, fast iteration
2. Data ingestion + embedding extraction	CPU + A10G	88h + 600h	$1,253	Parallelizable across 4 GPUs
3. IMU baselines (500K)	A10G	250h	$500	5 models × 5 folds
4. Video + COMODO (500K)	A10G	240h	$480	3 video models + 2 distillation students
5. EVI-MAE pre-training	A100x4	150h	$2,400	~6 days continuous
5b. EVI-MAE fine-tuning	A10G	40h	$80	5 folds
6. Hard case analysis	A10G + CPU	100h + 40h	$224	L1WO on 100-worker subset
7. Scaling experiments	A10G	375h	$750	5 scale points × 5 folds
8. Synthesis	CPU	20h	$12	Figures, stats

7.3 Storage Requirements

Data	Size	Notes
Raw IMU data (all 500K clips)	~238 GiB	Download once, keep in project storage
Processed IMU windows (float32)	~56 GiB	2.5M windows × [1000, 6]
Video embeddings (3 models)	~22 GiB	2.5M windows × 768-dim × 3 encoders
Selective raw video (10% for EVI-MAE)	~4.8 TiB	Stream during pre-training; cache hot subset
STFT spectrograms (for EVI-MAE)	~200 GiB	2.5M × [160, 128, 6]
Model checkpoints & logs	~100 GiB	~50 experiments × ~2 GiB each
Total active storage	~5.4 TiB

Key insight: We do NOT need all 47 TiB of video. For 80% of experiments (IMU baselines, COMODO, hard case analysis), only the 238 GiB of IMU data + 22 GiB of pre-computed video embeddings are needed. Only EVI-MAE pre-training requires raw video, and we stream it rather than pre-downloading.

7.4 Comparison: Pilot vs. Full Scale

	Pilot (13K)	Full (500K)	Scale Factor
GPU-hours	273h	2,015h	7.4×
Cost	$654	$7,678	11.7×
Timeline	10 weeks	16 weeks	1.6×
Storage	284 GiB	5.4 TiB	19×

The cost scales sub-linearly with data (11.7× cost for 38.5× more data) because: (a) embedding extraction amortizes across all experiments, (b) fewer epochs needed at larger data scale, (c) pilot experiments avoid wasting compute on methods that don't work.

8. Expected Contributions

8.1 Empirical Contributions

First cross-person generalization study on ego+IMU at 1,000-worker, 500K-clip scale. Prior work: CMU-MMAC (12 subjects), WEAR (18), MMEA (unknown), Ego4D (mixed, not controlled). We are 50-80× larger in subject count.
Modality robustness comparison under person shift. No prior work has compared video vs. IMU vs. multimodal degradation under controlled cross-person shift on the same dataset.
Data scaling laws for cross-person generalization. First systematic study of how cross-person accuracy scales with training data (13K → 500K). This has direct practical value: how much data do you need to collect?
Worker difficulty prediction from IMU statistics. If successful, this enables proactive data collection — target workers whose motion patterns are underrepresented.
Cross-modal distillation under distribution shift. COMODO was validated on same-distribution splits. We test the open question: does distilled knowledge transfer across people?

8.2 Benchmark Contribution

Standardized splits (5-fold L20%WO) and evaluation protocol for 1,000-worker ego+IMU
Baseline results for 11+ models across 3 modality configurations at 500K scale
Per-worker difficulty scores for hard case analysis
Data scaling curves with fitted power laws
Open-source evaluation code and pre-computed features

8.3 Practical Contribution

Guidelines for "how many workers to collect" (worker scaling curve)
Guidelines for "how many clips to collect" (data scaling curve)
Method for flagging hard-to-generalize deployment scenarios
Evidence for whether IMU-only deployment (privacy-preserving) sacrifices cross-person robustness vs. video

9. Relation to Existing Work

Paper	What They Did	What We Add
COMODO (Chen et al., 2025)	Video→IMU distillation on Ego4D/EgoExo4D/MMEA	Test distillation under cross-person shift; ~1,000 workers, 500K clips
EVI-MAE (Zhang et al., 2024)	Multimodal MAE on CMU-MMAC (12 subj.) and WEAR (18 subj.)	Scale to ~1,000 workers; evaluate cross-person transfer of MAE representations
Generalizable HAR Survey (Cai et al., 2025)	Taxonomy of cross-person/device/position HAR	Provide the large-scale empirical study their survey explicitly calls for
IMU-Video OOD HAR (Cheshmi et al., 2025)	Cross-modal SSL for OOD transfer	Study OOD as cross-person (same task, different people) rather than cross-domain
ColloSSL (Jain et al., 2022)	Multi-device contrastive SSL for HAR	Apply cross-device ideas to cross-person setting with paired video+IMU
UniMTS (Zhang et al., 2024)	Text-aligned IMU pre-training for zero-shot HAR	Potential extension: text-conditioned cross-person generalization
MIAM (Mehta et al., 2025)	Industrial assembly monitoring (RGB+depth+IMU, 8 volunteers)	Same domain but ~1,000 workers vs. 8; formal cross-person protocol

10. Risk Mitigation

Risk	Likelihood	Mitigation
500K dataset access delayed	Medium	Start with 13K pilot (Weeks 1-4). All pipelines and pilot results are independently valuable
Video streaming bottleneck at 47 TiB	Medium	Pre-compute embeddings (only 22 GiB). Only EVI-MAE needs raw video; use selective 10% subset
3-class task too easy (>95% accuracy)	Medium — "working" is 84% majority class	Use macro-F1 (penalizes majority bias); explore finer-grained temporal segmentation at scale
EVI-MAE training unstable on A100x4	Low	Validated on pilot first (Week 4). Follow published hyperparameters exactly; reduce lr if needed
Compute budget overrun	Medium	30% contingency ($1,772). Pilot phase identifies which methods to scale — skip underperformers
Insufficient cross-person variance	Very Low — already confirmed in pilot	1,000 workers guarantees far more diversity than 100

11. About the Researcher

Shubham Rasal — shubhamrasal.com

I've already worked with this dataset in its earlier form (the compression challenge). Specifically:

Built the IMU (200Hz) windowing and video synchronization pipeline
Explored frequency-domain and temporal features from the IMU signals
Observed the cross-worker motion pattern differences that motivate this proposal
Profiled the full dataset structure (100 workers, 12,997 clips, 1.22 TiB)

This prior work means I can move directly to model training on the pilot without pipeline development overhead, and scale to 500K with confidence in the data pipeline.

12. Key References

Chen, B. et al. "COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric HAR." arXiv:2503.07259 (2025).
Zhang, M. et al. "Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition." arXiv:2407.06628 (2024).
Cai, Y. et al. "Towards Generalizable Human Activity Recognition: A Survey." arXiv:2508.12213 (2025).
Cheshmi, S. et al. "Improving OOD HAR via IMU-Video Cross-modal Representation Learning." arXiv:2507.13482 (2025).
Moon, S. et al. "IMU2CLIP: Multimodal Contrastive Learning for IMU Motion Sensors." arXiv:2210.14395 (2023).
Zhang, X. et al. "UniMTS: Unified Pre-training for Motion Time Series." arXiv:2410.19818 (2024).
Mehta, N. et al. "A Multimodal Dataset for Enhancing Industrial Task Monitoring." arXiv:2501.05936 (2025).
Jain, Y. et al. "ColloSSL: Collaborative Self-Supervised Learning for HAR." arXiv:2202.00758 (2022).
Grauman, K. et al. "Ego4D: Around the World in 3,000 Hours of Egocentric Video." arXiv:2110.07058 (2022).
Grauman, K. et al. "Ego-Exo4D: Understanding Skilled Human Activity." arXiv:2311.18259 (2023).

End of proposal.