ShubhamRasal

Scale proposal to 500K datapoints from Egocentric-1M dataset — updated budgets, timeline, storage, new scaling experiment

cac9a3f verified about 1 month ago

preview code

raw

history blame contribute delete

25 kB

	# Cross-Person Generalization in Egocentric Video + IMU Activity Recognition: A Benchmark Study at 500K Scale

	Author: Shubham Rasal — [shubhamrasal.com](https://shubhamrasal.com)
	Date: April 2026
	Status: Proposal Draft

	---

	## 1. Motivation & Problem Statement

	Egocentric video and body-worn IMU sensors are the sensing backbone of AR glasses, smartwatches, and workplace wearables. Activity recognition from these modalities has made rapid progress — EVI-MAE reaches 87.96 mAP on CMU-MMAC, COMODO achieves 59.1% top-1 on Ego4D with IMU alone, and MoBind achieves state-of-the-art cross-modal retrieval. But all of these results are reported on within-distribution splits where training and test data come from the same population of users.

	The real-world failure mode is cross-person generalization. When you train on workers A–H and deploy to worker J, performance degrades — often catastrophically. The recent survey on generalizable HAR (Cai et al., 2025) formalizes this as:

	> U_train ∩ U_test = ∅, p(X, y \| U_train) ≠ p(X, y \| U_test)

	This is the setting that matters for any commercial wearable deployment, yet it remains systematically understudied because existing ego+IMU datasets have too few subjects (CMU-MMAC: 12, WEAR: 18, MMEA: unknown) to support meaningful cross-person evaluation at scale.

	### The Opportunity

	We will use 500K clips from the Egocentric-1M dataset — drawing from a pool of ~1,000 workers performing industrial/workplace tasks. This is orders of magnitude larger than any existing ego+IMU benchmark and enables studies that are statistically impossible on current datasets.

	Additionally, we have early access to a 13K-clip, 100-worker pilot subset (the Egocentric Native Compression dataset) for rapid prototyping and pipeline validation before scaling to 500K.

	\| Property \| Pilot (Available Now) \| Full Scale (500K) \|
	\|----------\|----------------------\|-------------------\|
	\| Workers \| 100 \| ~1,000+ (estimated) \|
	\| Clips \| 12,997 \| 500,000 \|
	\| Duration \| ~97.5 hours \| ~3,750 hours \|
	\| Total size \| 1.22 TiB \| ~47 TiB \|
	\| IMU data \| 6.18 GiB \| ~238 GiB \|
	\| 5s windows \| ~65K \| ~2.5M \|
	\| IMU rate \| 200 Hz (acc + gyro) \| 200 Hz (acc + gyro) \|
	\| Activity labels \| working / break / not-on-person \| working / break / not-on-person \|
	\| Domain \| Industrial / workplace \| Industrial / workplace \|

	From preliminary exploration of the pilot dataset, I've observed consistent differences in motion patterns across workers performing similar tasks — exactly the distribution shift that drives generalization failures. With ~1,000 workers and 500K clips, we can study this at a scale that no prior work has approached.

	---

	## 2. Research Questions

	1. How much does cross-person distribution shift degrade ego+IMU activity recognition?
	Quantify the gap between within-person and cross-person accuracy across IMU-only, video-only, and multimodal setups.

	2. Which modality is more robust to person shift — video or IMU?
	IMU captures body dynamics (person-specific); video captures visual scene (potentially person-invariant). Which degrades less under cross-person shift?

	3. Does video→IMU distillation improve cross-person robustness?
	COMODO distills video knowledge into IMU encoders for same-distribution settings. Does the distilled representation also transfer better to unseen workers?

	4. Can we predict which workers will be "hard cases" before deployment?
	Use IMU statistics (spectral features, signal energy, temporal patterns) to predict which unseen workers will cause generalization failure — enabling targeted data collection or adaptation.

	5. Does self-supervised pre-training on all workers (unlabeled) close the cross-person gap?
	EVI-MAE-style pre-training uses unlabeled data from all workers. Does this shared representation space reduce person-specific biases?

	6. What are the scaling laws for cross-person generalization? (New at 500K scale)
	How does downstream cross-person accuracy change as we increase training data from 13K → 50K → 100K → 250K → 500K clips? Is there a data scaling law for cross-person robustness?

	---

	## 3. Dataset Overview

	### 3.1 Pilot Dataset (Available Now)

	Source: `gs://build-ai-egocentric-native-compression/`

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Total workers \| 100 (selected from 150, 103 complete) \|
	\| Total clips \| 12,997 \|
	\| Total duration \| ~97.5 hours \|
	\| Total size \| 1.22 TiB (1,240 GiB video + 6.18 GiB IMU) \|
	\| Video per clip \| ~100 MiB MP4, ~27 seconds \|
	\| IMU per clip \| ~500 KiB JSONL, ~5,400 samples at 200Hz \|

	Pilot worker statistics (from `index.csv`):

	\| Statistic \| Clips per Worker \|
	\|-----------\|-----------------\|
	\| Min \| 76 \|
	\| Q1 (25th percentile) \| 95 \|
	\| Median \| 129.5 \|
	\| Q3 (75th percentile) \| 164 \|
	\| Max \| 203 \|
	\| Mean ± Std \| 130.0 ± 36.3 \|

	Activity state distribution (per-worker metadata):

	\| State \| Mean Fraction \| Std \| Min \| Max \|
	\|-------\|---------------\|-----\|-----\|-----\|
	\| Working \| 0.843 \| 0.110 \| 0.250 \| 0.957 \|
	\| Taking break \| 0.123 \| 0.110 \| 0.014 \| 0.721 \|
	\| Not on person \| 0.034 \| 0.016 \| 0.000 \| 0.058 \|
	\| Worn (working + break) \| 0.966 \| 0.016 \| 0.942 \| 1.000 \|

	### 3.2 Full Dataset (500K Target)

	\| Metric \| Estimated Value \|
	\|--------\|-----------------\|
	\| Total workers \| ~1,000+ \|
	\| Total clips \| 500,000 \|
	\| Total duration \| ~3,750 hours (~156 days) \|
	\| Video data \| ~46.9 TiB \|
	\| IMU data \| ~238 GiB \|
	\| 5s windows \| ~2,500,000 \|
	\| IMU window tensor (float32) \| ~56 GiB \|

	### 3.3 Notable Pilot Workers (Hard Cases)

	\| Worker \| Working % \| Break % \| Off-body % \| Clips \| Notes \|
	\|--------\|-----------\|---------\|------------\|-------\|-------\|
	\| worker_039 \| 25.0% \| 72.1% \| 2.9% \| 135 \| Extreme outlier — mostly on break \|
	\| worker_085 \| 37.3% \| 57.5% \| 5.2% \| 149 \| Majority break time \|
	\| worker_019 \| 57.0% \| 41.2% \| 1.8% \| 110 \| Near-equal work/break split \|
	\| worker_087 \| 62.1% \| 32.5% \| 5.3% \| 164 \| High break + high off-body \|

	At 500K scale with ~1,000 workers, we expect 10× more behavioral diversity — more outliers, richer distribution tails, and harder generalization challenges.

	### 3.4 Curation

	- First and last 2 clips trimmed per worker (removes startup/shutdown artifacts)
	- Audio stripped from all videos
	- IMU format: JSONL with `t_us` (microseconds), `acc` [x,y,z] (m/s²), `gyro` [x,y,z] (rad/s)

	---

	## 4. Experimental Design

	### 4.1 Two-Stage Approach: Pilot → Full Scale

	Stage 1 (Weeks 1-4): Pilot on 13K clips
	- Validate all pipelines, debug models, establish baseline numbers
	- Run all experiments on 100-worker pilot for fast iteration (~hours per experiment)
	- Identify which methods are worth scaling

	Stage 2 (Weeks 5-14): Scale to 500K clips
	- Run the winning methods at full scale
	- Execute the data scaling experiment (unique contribution at this scale)
	- Full cross-person benchmark at ~1,000-worker scale

	### 4.2 Cross-Person Evaluation Protocol

	Primary protocol: 5-Fold Leave-20%-Workers-Out (L20%WO)

	Split workers into 5 non-overlapping folds (20% each). For each fold:
	- Train: 60% of workers
	- Validation: 20% of workers
	- Test: 20% of workers

	\| \| Pilot (100 workers) \| Full (est. ~1,000 workers) \|
	\|---\|---\|---\|
	\| Train \| 60 workers (~7,800 clips) \| ~600 workers (~300K clips) \|
	\| Val \| 20 workers (~2,600 clips) \| ~200 workers (~100K clips) \|
	\| Test \| 20 workers (~2,600 clips) \| ~200 workers (~100K clips) \|

	Report mean ± std across 5 folds.

	Secondary protocols:
	- Leave-1-Worker-Out (L1WO): On a representative subset of 100 workers (from the full 1,000). Train on 99, test on 1 × 100. Measures per-worker difficulty.
	- Worker scaling curve: Train on {100, 200, 400, 600, 800} workers, always test on the same held-out 200. Answers: "how many workers do you need?"
	- Data scaling curve: Train best model at {13K, 50K, 100K, 250K, 500K} clips. Answers: "how much data do you need?"

	### 4.3 Task Definition

	3-class activity state classification:
	- Working (84.3% of clips in pilot)
	- Taking break (12.3%)
	- Not on person (3.4%)

	Window protocol (following COMODO/Ego4D standard):
	- 5-second non-overlapping windows
	- IMU: 200Hz × 5s = 1,000 timesteps × 6 channels (acc_xyz + gyro_xyz)
	- Video: 10 FPS × 5s = 50 frames, resized to 224×224
	- Per 27s clip: ~5 windows → ~2.5M total windows at 500K scale

	### 4.4 Metrics

	\| Metric \| Why \|
	\|--------\|-----\|
	\| Macro F1 \| Primary metric. Accounts for class imbalance (3.4% off-body would be invisible in accuracy) \|
	\| Accuracy @1 \| Standard comparison with COMODO/EVI-MAE baselines \|
	\| Per-worker F1 variance \| σ across workers — directly measures generalization stability \|
	\| Generalization gap \| Δ between within-person and cross-person F1 \|
	\| Per-class F1 \| Breakdown across working/break/not-worn \|

	### 4.5 Baseline Comparisons

	#### Experiment A: IMU-Only Models (5 baselines)

	\| Model \| Type \| Params \| Reference \|
	\|-------\|------\|--------\|-----------\|
	\| DeepConvLSTM \| Supervised CNN+LSTM \| ~2M \| Ordóñez & Roggen, 2016 \|
	\| Attend & Discriminate \| Supervised attention \| ~3M \| Abedin et al., 2021 \|
	\| CrossHAR \| Cross-dataset transfer \| ~5M \| Hong et al., 2024 \|
	\| MOMENT-small \| Pre-trained time-series foundation \| ~40M \| Goswami et al., 2024 \|
	\| Mantis \| Pre-trained time-series \| ~15M \| Liu et al., 2024 \|

	All evaluated in both within-person (random split) and cross-person (L20%WO) settings. The gap between the two is the headline result.

	#### Experiment B: Video-Only Models (3 baselines)

	\| Model \| Type \| Params \| Reference \|
	\|-------\|------\|--------\|-----------\|
	\| TimeSformer-Base \| Transformer (K400 pretrained) \| 121M \| Bertasius et al., 2021 \|
	\| VideoMAE ViT-B \| MAE pre-trained Transformer \| 86M \| Tong et al., 2022 \|
	\| SlowFast R50 \| Dual-pathway CNN \| 34M \| Feichtenhofer et al., 2019 \|

	#### Experiment C: Cross-Modal Distillation

	\| Method \| Setup \| Reference \|
	\|--------\|-------\|-----------\|
	\| COMODO (MOMENT student) \| Frozen TimeSformer → MOMENT IMU encoder \| Chen et al., 2025 \|
	\| COMODO (Mantis student) \| Frozen TimeSformer → Mantis IMU encoder \| Chen et al., 2025 \|
	\| IMU2CLIP \| Joint video-IMU-text embedding \| Moon et al., 2023 \|

	Key question: Does the distilled IMU model generalize better to unseen workers than the supervised IMU model?

	#### Experiment D: Self-Supervised Pre-training

	\| Method \| Setup \| Reference \|
	\|--------\|-------\|-----------\|
	\| EVI-MAE \| Joint video+IMU MAE pre-training → fine-tune \| Zhang et al., 2024 \|
	\| IMU-only MAE \| IMU spectrogram MAE → fine-tune \| Adapted from EVI-MAE \|

	Pre-train on all workers (unlabeled, self-supervised), then fine-tune on train workers with labels, evaluate on held-out test workers. Compare with same architecture trained from scratch.

	#### Experiment E: Hard Case Analysis

	For a representative subset of workers, compute:
	- IMU statistics: RMS acceleration, gyro variance, spectral centroid, signal entropy, jerk magnitude, dominant frequency
	- Temporal statistics: autocorrelation decay, stationarity (ADF test), transition rate between high/low activity
	- Model performance: per-worker F1 from L1WO

	Then:
	1. Correlate IMU statistics with model difficulty (per-worker F1)
	2. Cluster workers by IMU feature similarity → do clusters predict generalization difficulty?
	3. Predict difficulty: train a regression model (IMU stats → expected F1) — can we flag hard workers before deployment?

	#### Experiment F: Data Scaling Laws (New at 500K)

	Train the best-performing model from Experiments A-D at increasing data scales:

	\| Scale \| Clips \| Est. Workers \| Est. Hours \|
	\|-------\|-------\|-------------\|------------\|
	\| 13K \| 12,997 \| 100 \| 97.5h \|
	\| 50K \| 50,000 \| ~385 \| 375h \|
	\| 100K \| 100,000 \| ~770 \| 750h \|
	\| 250K \| 250,000 \| ~1,000 \| 1,875h \|
	\| 500K \| 500,000 \| ~1,000 \| 3,750h \|

	Measure: cross-person Macro F1 vs. training data size. Is there a scaling law? Does it plateau?

	---

	## 5. Technical Approach

	### 5.1 IMU Preprocessing Pipeline

	```
	Raw JSONL (200Hz, acc+gyro)
	→ Parse, validate timestamps, interpolate gaps
	→ 5-second non-overlapping windows (1000 × 6)
	→ Two representations:
	(a) Raw time series [1000, 6] for Mantis/MOMENT/DeepConvLSTM
	(b) STFT spectrogram [160, 128, 6] for EVI-MAE/IMU-MAE
	→ Per-channel z-normalization (computed on training set only)
	```

	At 500K scale: 2.5M windows × 6,000 floats = ~56 GiB processed tensor. Fits comfortably in RAM on a cpu-upgrade instance. Preprocessing takes ~80h CPU but is embarrassingly parallel.

	Already built: I have the 200Hz windowing and video sync pipeline from earlier work on the pilot dataset.

	### 5.2 Video Preprocessing Pipeline

	```
	MP4 (27s clips, ~100 MiB each)
	→ Decode at 10 FPS → 270 frames per clip
	→ 5-second windows → 50 frames per window
	→ Resize to 224×224, normalize (ImageNet stats)
	→ Pre-compute video encoder embeddings (store to disk)
	```

	At 500K scale: Pre-computing video embeddings is critical. We do NOT download all 47 TiB of video. Instead:
	- Stream video clips on-demand during embedding extraction
	- TimeSformer-B produces 768-dim vectors → 2.5M windows × 768 × 4 bytes = ~7.3 GiB per model. Easily stored.
	- Embedding extraction: ~200h on A10G (parallelizable across 4 GPUs → ~50h wall-clock)

	### 5.3 COMODO Distillation Protocol

	Following Chen et al. (2025) exactly:
	- Video teacher: TimeSformer-Base, frozen, K400-pretrained. MLP projector (hidden=2048, out=128)
	- IMU student: MOMENT-small or Mantis, trainable. MLP projector (hidden=2048, out=128)
	- Loss: Cross-entropy on similarity distribution with dynamic FIFO instance queue
	- Hyperparameters: 20 epochs, batch 128, lr=3e-4, τ_v=0.1, τ_x=0.05
	- Hardware: A10G GPU (24 GB VRAM) — scales linearly with data, ~12h per student per fold at 500K

	### 5.4 EVI-MAE Protocol

	Following Zhang et al. (2024):
	- Video encoder: ViT-Base (patch 16×16), 90% masking ratio
	- IMU encoder: Transformer on STFT spectrogram patches (16×16), 75% masking ratio
	- Fusion: Single-layer Transformer
	- Pre-training loss: MAE reconstruction (α=1, β=10) + contrastive (γ=0.01, τ=0.05)
	- Pre-train: 100 epochs on all workers (fewer epochs needed — each epoch sees 38× more data)
	- Fine-tune: 50 epochs on train split with labels
	- Hardware: A100x4 (4× 80GB) for pre-training (~6 days), A10G for fine-tuning

	---

	## 6. Timeline (16 Weeks)

	### Stage 1: Pilot Validation (Weeks 1-4)

	\| Week \| Phase \| Deliverable \| Compute \|
	\|------\|-------\|-------------\|---------\|
	\| 1 \| Data preparation \| IMU windowing, video embedding extraction, EDA notebook on pilot \| CPU: 12h, A10G: 10h \|
	\| 2 \| IMU baselines (pilot) \| 5 IMU models × 5 folds on 13K clips. Validate pipeline, establish baselines \| A10G: 20h \|
	\| 3 \| Video + COMODO (pilot) \| Video baselines + COMODO distillation on 13K clips \| A10G: 30h \|
	\| 4 \| EVI-MAE (pilot) + triage \| Self-supervised pre-training on pilot. Select methods to scale \| A10G: 40h \|

	### Stage 2: Full Scale (Weeks 5-14)

	\| Week \| Phase \| Deliverable \| Compute \|
	\|------\|-------\|-------------\|---------\|
	\| 5-6 \| 500K data ingestion \| Download 238 GiB IMU. Stream-extract video embeddings for 500K clips (3 models) \| CPU: 88h, A10G: 600h \|
	\| 7-8 \| IMU baselines (500K) \| 5 IMU models × 5 folds at full scale \| A10G: 250h \|
	\| 9-10 \| Video + COMODO (500K) \| Video baselines + COMODO distillation at full scale \| A10G: 240h \|
	\| 11-12 \| EVI-MAE (500K) \| Self-supervised pre-training on all workers (~6 days on A100x4) + fine-tuning \| A100x4: 150h, A10G: 40h \|
	\| 13 \| Hard case analysis \| Per-worker difficulty on 100-worker subset, IMU feature correlation \| A10G: 100h, CPU: 40h \|
	\| 14 \| Scaling experiments \| Data scaling curve (13K→500K), worker scaling curve (100→800) \| A10G: 375h \|

	### Stage 3: Synthesis (Weeks 15-16)

	\| Week \| Phase \| Deliverable \| Compute \|
	\|------\|-------\|-------------\|---------\|
	\| 15 \| Analysis \| Compile results, statistical significance tests, generate figures \| CPU: 20h \|
	\| 16 \| Write-up \| Paper draft (8-page main + appendix) \| — \|

	Total wall-clock: 16 weeks
	Active compute window: Weeks 1-14 (14 weeks of experiments)

	---

	## 7. Compute Budget

	### 7.1 Summary

	\| Resource \| Hours \| Unit Cost \| Total \|
	\|----------\|-------\|-----------\|-------\|
	\| CPU (preprocessing, analysis) \| 160h \| $0.60/h \| $96 \|
	\| A10G GPU (24 GB) \| 1,705h \| $2.00/h \| $3,410 \|
	\| A100x4 GPU (4× 80 GB) \| 150h \| $16.00/h \| $2,400 \|
	\| Subtotal \| 2,015h \| \| $5,906 \|
	\| 30% contingency \| \| \| +$1,772 \|
	\| Total \| \| \| ~$7,678 \|

	### 7.2 Phase Breakdown

	\| Phase \| Resource \| Hours \| Cost \| Notes \|
	\|-------\|----------\|-------\|------\|-------\|
	\| 1. Pilot validation \| A10G \| 100h \| $200 \| Weeks 1-4, fast iteration \|
	\| 2. Data ingestion + embedding extraction \| CPU + A10G \| 88h + 600h \| $1,253 \| Parallelizable across 4 GPUs \|
	\| 3. IMU baselines (500K) \| A10G \| 250h \| $500 \| 5 models × 5 folds \|
	\| 4. Video + COMODO (500K) \| A10G \| 240h \| $480 \| 3 video models + 2 distillation students \|
	\| 5. EVI-MAE pre-training \| A100x4 \| 150h \| $2,400 \| ~6 days continuous \|
	\| 5b. EVI-MAE fine-tuning \| A10G \| 40h \| $80 \| 5 folds \|
	\| 6. Hard case analysis \| A10G + CPU \| 100h + 40h \| $224 \| L1WO on 100-worker subset \|
	\| 7. Scaling experiments \| A10G \| 375h \| $750 \| 5 scale points × 5 folds \|
	\| 8. Synthesis \| CPU \| 20h \| $12 \| Figures, stats \|

	### 7.3 Storage Requirements

	\| Data \| Size \| Notes \|
	\|------\|------\|-------\|
	\| Raw IMU data (all 500K clips) \| ~238 GiB \| Download once, keep in project storage \|
	\| Processed IMU windows (float32) \| ~56 GiB \| 2.5M windows × [1000, 6] \|
	\| Video embeddings (3 models) \| ~22 GiB \| 2.5M windows × 768-dim × 3 encoders \|
	\| Selective raw video (10% for EVI-MAE) \| ~4.8 TiB \| Stream during pre-training; cache hot subset \|
	\| STFT spectrograms (for EVI-MAE) \| ~200 GiB \| 2.5M × [160, 128, 6] \|
	\| Model checkpoints & logs \| ~100 GiB \| ~50 experiments × ~2 GiB each \|
	\| Total active storage \| ~5.4 TiB \| \|

	Key insight: We do NOT need all 47 TiB of video. For 80% of experiments (IMU baselines, COMODO, hard case analysis), only the 238 GiB of IMU data + 22 GiB of pre-computed video embeddings are needed. Only EVI-MAE pre-training requires raw video, and we stream it rather than pre-downloading.

	### 7.4 Comparison: Pilot vs. Full Scale

	\| \| Pilot (13K) \| Full (500K) \| Scale Factor \|
	\|---\|---\|---\|---\|
	\| GPU-hours \| 273h \| 2,015h \| 7.4× \|
	\| Cost \| $654 \| $7,678 \| 11.7× \|
	\| Timeline \| 10 weeks \| 16 weeks \| 1.6× \|
	\| Storage \| 284 GiB \| 5.4 TiB \| 19× \|

	The cost scales sub-linearly with data (11.7× cost for 38.5× more data) because: (a) embedding extraction amortizes across all experiments, (b) fewer epochs needed at larger data scale, (c) pilot experiments avoid wasting compute on methods that don't work.

	---

	## 8. Expected Contributions

	### 8.1 Empirical Contributions

	1. First cross-person generalization study on ego+IMU at 1,000-worker, 500K-clip scale. Prior work: CMU-MMAC (12 subjects), WEAR (18), MMEA (unknown), Ego4D (mixed, not controlled). We are 50-80× larger in subject count.

	2. Modality robustness comparison under person shift. No prior work has compared video vs. IMU vs. multimodal degradation under controlled cross-person shift on the same dataset.

	3. Data scaling laws for cross-person generalization. First systematic study of how cross-person accuracy scales with training data (13K → 500K). This has direct practical value: how much data do you need to collect?

	4. Worker difficulty prediction from IMU statistics. If successful, this enables proactive data collection — target workers whose motion patterns are underrepresented.

	5. Cross-modal distillation under distribution shift. COMODO was validated on same-distribution splits. We test the open question: does distilled knowledge transfer across people?

	### 8.2 Benchmark Contribution

	- Standardized splits (5-fold L20%WO) and evaluation protocol for 1,000-worker ego+IMU
	- Baseline results for 11+ models across 3 modality configurations at 500K scale
	- Per-worker difficulty scores for hard case analysis
	- Data scaling curves with fitted power laws
	- Open-source evaluation code and pre-computed features

	### 8.3 Practical Contribution

	- Guidelines for "how many workers to collect" (worker scaling curve)
	- Guidelines for "how many clips to collect" (data scaling curve)
	- Method for flagging hard-to-generalize deployment scenarios
	- Evidence for whether IMU-only deployment (privacy-preserving) sacrifices cross-person robustness vs. video

	---

	## 9. Relation to Existing Work

	\| Paper \| What They Did \| What We Add \|
	\|-------\|---------------\|-------------\|
	\| COMODO (Chen et al., 2025) \| Video→IMU distillation on Ego4D/EgoExo4D/MMEA \| Test distillation under cross-person shift; ~1,000 workers, 500K clips \|
	\| EVI-MAE (Zhang et al., 2024) \| Multimodal MAE on CMU-MMAC (12 subj.) and WEAR (18 subj.) \| Scale to ~1,000 workers; evaluate cross-person transfer of MAE representations \|
	\| Generalizable HAR Survey (Cai et al., 2025) \| Taxonomy of cross-person/device/position HAR \| Provide the large-scale empirical study their survey explicitly calls for \|
	\| IMU-Video OOD HAR (Cheshmi et al., 2025) \| Cross-modal SSL for OOD transfer \| Study OOD as cross-person (same task, different people) rather than cross-domain \|
	\| ColloSSL (Jain et al., 2022) \| Multi-device contrastive SSL for HAR \| Apply cross-device ideas to cross-person setting with paired video+IMU \|
	\| UniMTS (Zhang et al., 2024) \| Text-aligned IMU pre-training for zero-shot HAR \| Potential extension: text-conditioned cross-person generalization \|
	\| MIAM (Mehta et al., 2025) \| Industrial assembly monitoring (RGB+depth+IMU, 8 volunteers) \| Same domain but ~1,000 workers vs. 8; formal cross-person protocol \|

	---

	## 10. Risk Mitigation

	\| Risk \| Likelihood \| Mitigation \|
	\|------\|-----------\|------------\|
	\| 500K dataset access delayed \| Medium \| Start with 13K pilot (Weeks 1-4). All pipelines and pilot results are independently valuable \|
	\| Video streaming bottleneck at 47 TiB \| Medium \| Pre-compute embeddings (only 22 GiB). Only EVI-MAE needs raw video; use selective 10% subset \|
	\| 3-class task too easy (>95% accuracy) \| Medium — "working" is 84% majority class \| Use macro-F1 (penalizes majority bias); explore finer-grained temporal segmentation at scale \|
	\| EVI-MAE training unstable on A100x4 \| Low \| Validated on pilot first (Week 4). Follow published hyperparameters exactly; reduce lr if needed \|
	\| Compute budget overrun \| Medium \| 30% contingency ($1,772). Pilot phase identifies which methods to scale — skip underperformers \|
	\| Insufficient cross-person variance \| Very Low — already confirmed in pilot \| 1,000 workers guarantees far more diversity than 100 \|

	---

	## 11. About the Researcher

	Shubham Rasal — [shubhamrasal.com](https://shubhamrasal.com)

	I've already worked with this dataset in its earlier form (the compression challenge). Specifically:
	- Built the IMU (200Hz) windowing and video synchronization pipeline
	- Explored frequency-domain and temporal features from the IMU signals
	- Observed the cross-worker motion pattern differences that motivate this proposal
	- Profiled the full dataset structure (100 workers, 12,997 clips, 1.22 TiB)

	This prior work means I can move directly to model training on the pilot without pipeline development overhead, and scale to 500K with confidence in the data pipeline.

	---

	## 12. Key References

	1. Chen, B. et al. "COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric HAR." arXiv:2503.07259 (2025).
	2. Zhang, M. et al. "Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition." arXiv:2407.06628 (2024).
	3. Cai, Y. et al. "Towards Generalizable Human Activity Recognition: A Survey." arXiv:2508.12213 (2025).
	4. Cheshmi, S. et al. "Improving OOD HAR via IMU-Video Cross-modal Representation Learning." arXiv:2507.13482 (2025).
	5. Moon, S. et al. "IMU2CLIP: Multimodal Contrastive Learning for IMU Motion Sensors." arXiv:2210.14395 (2023).
	6. Zhang, X. et al. "UniMTS: Unified Pre-training for Motion Time Series." arXiv:2410.19818 (2024).
	7. Mehta, N. et al. "A Multimodal Dataset for Enhancing Industrial Task Monitoring." arXiv:2501.05936 (2025).
	8. Jain, Y. et al. "ColloSSL: Collaborative Self-Supervised Learning for HAR." arXiv:2202.00758 (2022).
	9. Grauman, K. et al. "Ego4D: Around the World in 3,000 Hours of Egocentric Video." arXiv:2110.07058 (2022).
	10. Grauman, K. et al. "Ego-Exo4D: Understanding Skilled Human Activity." arXiv:2311.18259 (2023).

	---

	End of proposal.