Scale proposal to 500K datapoints from Egocentric-1M dataset — updated budgets, timeline, storage, new scaling experiment
cac9a3f verified | # Cross-Person Generalization in Egocentric Video + IMU Activity Recognition: A Benchmark Study at 500K Scale | |
| **Author:** Shubham Rasal — [shubhamrasal.com](https://shubhamrasal.com) | |
| **Date:** April 2026 | |
| **Status:** Proposal Draft | |
| --- | |
| ## 1. Motivation & Problem Statement | |
| Egocentric video and body-worn IMU sensors are the sensing backbone of AR glasses, smartwatches, and workplace wearables. Activity recognition from these modalities has made rapid progress — EVI-MAE reaches 87.96 mAP on CMU-MMAC, COMODO achieves 59.1% top-1 on Ego4D with IMU alone, and MoBind achieves state-of-the-art cross-modal retrieval. But all of these results are reported on **within-distribution** splits where training and test data come from the same population of users. | |
| **The real-world failure mode is cross-person generalization.** When you train on workers A–H and deploy to worker J, performance degrades — often catastrophically. The recent survey on generalizable HAR (Cai et al., 2025) formalizes this as: | |
| > U_train ∩ U_test = ∅, p(X, y | U_train) ≠ p(X, y | U_test) | |
| This is the setting that matters for any commercial wearable deployment, yet it remains systematically understudied because existing ego+IMU datasets have too few subjects (CMU-MMAC: 12, WEAR: 18, MMEA: unknown) to support meaningful cross-person evaluation at scale. | |
| ### The Opportunity | |
| We will use **500K clips from the Egocentric-1M dataset** — drawing from a pool of ~1,000 workers performing industrial/workplace tasks. This is orders of magnitude larger than any existing ego+IMU benchmark and enables studies that are statistically impossible on current datasets. | |
| Additionally, we have early access to a **13K-clip, 100-worker pilot subset** (the Egocentric Native Compression dataset) for rapid prototyping and pipeline validation before scaling to 500K. | |
| | Property | Pilot (Available Now) | Full Scale (500K) | | |
| |----------|----------------------|-------------------| | |
| | Workers | 100 | ~1,000+ (estimated) | | |
| | Clips | 12,997 | **500,000** | | |
| | Duration | ~97.5 hours | **~3,750 hours** | | |
| | Total size | 1.22 TiB | **~47 TiB** | | |
| | IMU data | 6.18 GiB | **~238 GiB** | | |
| | 5s windows | ~65K | **~2.5M** | | |
| | IMU rate | 200 Hz (acc + gyro) | 200 Hz (acc + gyro) | | |
| | Activity labels | working / break / not-on-person | working / break / not-on-person | | |
| | Domain | Industrial / workplace | Industrial / workplace | | |
| From preliminary exploration of the pilot dataset, I've observed **consistent differences in motion patterns across workers performing similar tasks** — exactly the distribution shift that drives generalization failures. With ~1,000 workers and 500K clips, we can study this at a scale that no prior work has approached. | |
| --- | |
| ## 2. Research Questions | |
| 1. **How much does cross-person distribution shift degrade ego+IMU activity recognition?** | |
| Quantify the gap between within-person and cross-person accuracy across IMU-only, video-only, and multimodal setups. | |
| 2. **Which modality is more robust to person shift — video or IMU?** | |
| IMU captures body dynamics (person-specific); video captures visual scene (potentially person-invariant). Which degrades less under cross-person shift? | |
| 3. **Does video→IMU distillation improve cross-person robustness?** | |
| COMODO distills video knowledge into IMU encoders for same-distribution settings. Does the distilled representation also transfer better to unseen workers? | |
| 4. **Can we predict which workers will be "hard cases" before deployment?** | |
| Use IMU statistics (spectral features, signal energy, temporal patterns) to predict which unseen workers will cause generalization failure — enabling targeted data collection or adaptation. | |
| 5. **Does self-supervised pre-training on all workers (unlabeled) close the cross-person gap?** | |
| EVI-MAE-style pre-training uses unlabeled data from all workers. Does this shared representation space reduce person-specific biases? | |
| 6. **What are the scaling laws for cross-person generalization?** *(New at 500K scale)* | |
| How does downstream cross-person accuracy change as we increase training data from 13K → 50K → 100K → 250K → 500K clips? Is there a data scaling law for cross-person robustness? | |
| --- | |
| ## 3. Dataset Overview | |
| ### 3.1 Pilot Dataset (Available Now) | |
| **Source:** `gs://build-ai-egocentric-native-compression/` | |
| | Metric | Value | | |
| |--------|-------| | |
| | Total workers | 100 (selected from 150, 103 complete) | | |
| | Total clips | 12,997 | | |
| | Total duration | ~97.5 hours | | |
| | Total size | 1.22 TiB (1,240 GiB video + 6.18 GiB IMU) | | |
| | Video per clip | ~100 MiB MP4, ~27 seconds | | |
| | IMU per clip | ~500 KiB JSONL, ~5,400 samples at 200Hz | | |
| **Pilot worker statistics (from `index.csv`):** | |
| | Statistic | Clips per Worker | | |
| |-----------|-----------------| | |
| | Min | 76 | | |
| | Q1 (25th percentile) | 95 | | |
| | Median | 129.5 | | |
| | Q3 (75th percentile) | 164 | | |
| | Max | 203 | | |
| | Mean ± Std | 130.0 ± 36.3 | | |
| **Activity state distribution (per-worker metadata):** | |
| | State | Mean Fraction | Std | Min | Max | | |
| |-------|---------------|-----|-----|-----| | |
| | Working | 0.843 | 0.110 | 0.250 | 0.957 | | |
| | Taking break | 0.123 | 0.110 | 0.014 | 0.721 | | |
| | Not on person | 0.034 | 0.016 | 0.000 | 0.058 | | |
| | Worn (working + break) | 0.966 | 0.016 | 0.942 | 1.000 | | |
| ### 3.2 Full Dataset (500K Target) | |
| | Metric | Estimated Value | | |
| |--------|-----------------| | |
| | Total workers | ~1,000+ | | |
| | Total clips | **500,000** | | |
| | Total duration | **~3,750 hours (~156 days)** | | |
| | Video data | **~46.9 TiB** | | |
| | IMU data | **~238 GiB** | | |
| | 5s windows | **~2,500,000** | | |
| | IMU window tensor (float32) | **~56 GiB** | | |
| ### 3.3 Notable Pilot Workers (Hard Cases) | |
| | Worker | Working % | Break % | Off-body % | Clips | Notes | | |
| |--------|-----------|---------|------------|-------|-------| | |
| | worker_039 | 25.0% | 72.1% | 2.9% | 135 | Extreme outlier — mostly on break | | |
| | worker_085 | 37.3% | 57.5% | 5.2% | 149 | Majority break time | | |
| | worker_019 | 57.0% | 41.2% | 1.8% | 110 | Near-equal work/break split | | |
| | worker_087 | 62.1% | 32.5% | 5.3% | 164 | High break + high off-body | | |
| At 500K scale with ~1,000 workers, we expect 10× more behavioral diversity — more outliers, richer distribution tails, and harder generalization challenges. | |
| ### 3.4 Curation | |
| - First and last 2 clips trimmed per worker (removes startup/shutdown artifacts) | |
| - Audio stripped from all videos | |
| - IMU format: JSONL with `t_us` (microseconds), `acc` [x,y,z] (m/s²), `gyro` [x,y,z] (rad/s) | |
| --- | |
| ## 4. Experimental Design | |
| ### 4.1 Two-Stage Approach: Pilot → Full Scale | |
| **Stage 1 (Weeks 1-4): Pilot on 13K clips** | |
| - Validate all pipelines, debug models, establish baseline numbers | |
| - Run all experiments on 100-worker pilot for fast iteration (~hours per experiment) | |
| - Identify which methods are worth scaling | |
| **Stage 2 (Weeks 5-14): Scale to 500K clips** | |
| - Run the winning methods at full scale | |
| - Execute the data scaling experiment (unique contribution at this scale) | |
| - Full cross-person benchmark at ~1,000-worker scale | |
| ### 4.2 Cross-Person Evaluation Protocol | |
| **Primary protocol: 5-Fold Leave-20%-Workers-Out (L20%WO)** | |
| Split workers into 5 non-overlapping folds (20% each). For each fold: | |
| - **Train:** 60% of workers | |
| - **Validation:** 20% of workers | |
| - **Test:** 20% of workers | |
| | | Pilot (100 workers) | Full (est. ~1,000 workers) | | |
| |---|---|---| | |
| | Train | 60 workers (~7,800 clips) | ~600 workers (~300K clips) | | |
| | Val | 20 workers (~2,600 clips) | ~200 workers (~100K clips) | | |
| | Test | 20 workers (~2,600 clips) | ~200 workers (~100K clips) | | |
| Report mean ± std across 5 folds. | |
| **Secondary protocols:** | |
| - **Leave-1-Worker-Out (L1WO):** On a representative subset of 100 workers (from the full 1,000). Train on 99, test on 1 × 100. Measures per-worker difficulty. | |
| - **Worker scaling curve:** Train on {100, 200, 400, 600, 800} workers, always test on the same held-out 200. Answers: "how many workers do you need?" | |
| - **Data scaling curve:** Train best model at {13K, 50K, 100K, 250K, 500K} clips. Answers: "how much data do you need?" | |
| ### 4.3 Task Definition | |
| **3-class activity state classification:** | |
| - **Working** (84.3% of clips in pilot) | |
| - **Taking break** (12.3%) | |
| - **Not on person** (3.4%) | |
| **Window protocol** (following COMODO/Ego4D standard): | |
| - 5-second non-overlapping windows | |
| - IMU: 200Hz × 5s = 1,000 timesteps × 6 channels (acc_xyz + gyro_xyz) | |
| - Video: 10 FPS × 5s = 50 frames, resized to 224×224 | |
| - Per 27s clip: ~5 windows → **~2.5M total windows at 500K scale** | |
| ### 4.4 Metrics | |
| | Metric | Why | | |
| |--------|-----| | |
| | **Macro F1** | Primary metric. Accounts for class imbalance (3.4% off-body would be invisible in accuracy) | | |
| | **Accuracy @1** | Standard comparison with COMODO/EVI-MAE baselines | | |
| | **Per-worker F1 variance** | σ across workers — directly measures generalization stability | | |
| | **Generalization gap** | Δ between within-person and cross-person F1 | | |
| | **Per-class F1** | Breakdown across working/break/not-worn | | |
| ### 4.5 Baseline Comparisons | |
| #### Experiment A: IMU-Only Models (5 baselines) | |
| | Model | Type | Params | Reference | | |
| |-------|------|--------|-----------| | |
| | DeepConvLSTM | Supervised CNN+LSTM | ~2M | Ordóñez & Roggen, 2016 | | |
| | Attend & Discriminate | Supervised attention | ~3M | Abedin et al., 2021 | | |
| | CrossHAR | Cross-dataset transfer | ~5M | Hong et al., 2024 | | |
| | MOMENT-small | Pre-trained time-series foundation | ~40M | Goswami et al., 2024 | | |
| | Mantis | Pre-trained time-series | ~15M | Liu et al., 2024 | | |
| All evaluated in both **within-person** (random split) and **cross-person** (L20%WO) settings. The gap between the two is the headline result. | |
| #### Experiment B: Video-Only Models (3 baselines) | |
| | Model | Type | Params | Reference | | |
| |-------|------|--------|-----------| | |
| | TimeSformer-Base | Transformer (K400 pretrained) | 121M | Bertasius et al., 2021 | | |
| | VideoMAE ViT-B | MAE pre-trained Transformer | 86M | Tong et al., 2022 | | |
| | SlowFast R50 | Dual-pathway CNN | 34M | Feichtenhofer et al., 2019 | | |
| #### Experiment C: Cross-Modal Distillation | |
| | Method | Setup | Reference | | |
| |--------|-------|-----------| | |
| | COMODO (MOMENT student) | Frozen TimeSformer → MOMENT IMU encoder | Chen et al., 2025 | | |
| | COMODO (Mantis student) | Frozen TimeSformer → Mantis IMU encoder | Chen et al., 2025 | | |
| | IMU2CLIP | Joint video-IMU-text embedding | Moon et al., 2023 | | |
| **Key question:** Does the distilled IMU model generalize better to unseen workers than the supervised IMU model? | |
| #### Experiment D: Self-Supervised Pre-training | |
| | Method | Setup | Reference | | |
| |--------|-------|-----------| | |
| | EVI-MAE | Joint video+IMU MAE pre-training → fine-tune | Zhang et al., 2024 | | |
| | IMU-only MAE | IMU spectrogram MAE → fine-tune | Adapted from EVI-MAE | | |
| Pre-train on **all workers** (unlabeled, self-supervised), then fine-tune on train workers with labels, evaluate on held-out test workers. Compare with same architecture trained from scratch. | |
| #### Experiment E: Hard Case Analysis | |
| For a representative subset of workers, compute: | |
| - **IMU statistics:** RMS acceleration, gyro variance, spectral centroid, signal entropy, jerk magnitude, dominant frequency | |
| - **Temporal statistics:** autocorrelation decay, stationarity (ADF test), transition rate between high/low activity | |
| - **Model performance:** per-worker F1 from L1WO | |
| Then: | |
| 1. **Correlate** IMU statistics with model difficulty (per-worker F1) | |
| 2. **Cluster** workers by IMU feature similarity → do clusters predict generalization difficulty? | |
| 3. **Predict difficulty:** train a regression model (IMU stats → expected F1) — can we flag hard workers before deployment? | |
| #### Experiment F: Data Scaling Laws *(New at 500K)* | |
| Train the best-performing model from Experiments A-D at increasing data scales: | |
| | Scale | Clips | Est. Workers | Est. Hours | | |
| |-------|-------|-------------|------------| | |
| | 13K | 12,997 | 100 | 97.5h | | |
| | 50K | 50,000 | ~385 | 375h | | |
| | 100K | 100,000 | ~770 | 750h | | |
| | 250K | 250,000 | ~1,000 | 1,875h | | |
| | 500K | 500,000 | ~1,000 | 3,750h | | |
| Measure: cross-person Macro F1 vs. training data size. Is there a scaling law? Does it plateau? | |
| --- | |
| ## 5. Technical Approach | |
| ### 5.1 IMU Preprocessing Pipeline | |
| ``` | |
| Raw JSONL (200Hz, acc+gyro) | |
| → Parse, validate timestamps, interpolate gaps | |
| → 5-second non-overlapping windows (1000 × 6) | |
| → Two representations: | |
| (a) Raw time series [1000, 6] for Mantis/MOMENT/DeepConvLSTM | |
| (b) STFT spectrogram [160, 128, 6] for EVI-MAE/IMU-MAE | |
| → Per-channel z-normalization (computed on training set only) | |
| ``` | |
| **At 500K scale:** 2.5M windows × 6,000 floats = ~56 GiB processed tensor. Fits comfortably in RAM on a cpu-upgrade instance. Preprocessing takes ~80h CPU but is embarrassingly parallel. | |
| *Already built:* I have the 200Hz windowing and video sync pipeline from earlier work on the pilot dataset. | |
| ### 5.2 Video Preprocessing Pipeline | |
| ``` | |
| MP4 (27s clips, ~100 MiB each) | |
| → Decode at 10 FPS → 270 frames per clip | |
| → 5-second windows → 50 frames per window | |
| → Resize to 224×224, normalize (ImageNet stats) | |
| → Pre-compute video encoder embeddings (store to disk) | |
| ``` | |
| **At 500K scale:** Pre-computing video embeddings is critical. We do NOT download all 47 TiB of video. Instead: | |
| - Stream video clips on-demand during embedding extraction | |
| - TimeSformer-B produces 768-dim vectors → 2.5M windows × 768 × 4 bytes = **~7.3 GiB** per model. Easily stored. | |
| - Embedding extraction: ~200h on A10G (parallelizable across 4 GPUs → ~50h wall-clock) | |
| ### 5.3 COMODO Distillation Protocol | |
| Following Chen et al. (2025) exactly: | |
| - **Video teacher:** TimeSformer-Base, frozen, K400-pretrained. MLP projector (hidden=2048, out=128) | |
| - **IMU student:** MOMENT-small or Mantis, trainable. MLP projector (hidden=2048, out=128) | |
| - **Loss:** Cross-entropy on similarity distribution with dynamic FIFO instance queue | |
| - **Hyperparameters:** 20 epochs, batch 128, lr=3e-4, τ_v=0.1, τ_x=0.05 | |
| - **Hardware:** A10G GPU (24 GB VRAM) — scales linearly with data, ~12h per student per fold at 500K | |
| ### 5.4 EVI-MAE Protocol | |
| Following Zhang et al. (2024): | |
| - **Video encoder:** ViT-Base (patch 16×16), 90% masking ratio | |
| - **IMU encoder:** Transformer on STFT spectrogram patches (16×16), 75% masking ratio | |
| - **Fusion:** Single-layer Transformer | |
| - **Pre-training loss:** MAE reconstruction (α=1, β=10) + contrastive (γ=0.01, τ=0.05) | |
| - **Pre-train:** 100 epochs on all workers (fewer epochs needed — each epoch sees 38× more data) | |
| - **Fine-tune:** 50 epochs on train split with labels | |
| - **Hardware:** **A100x4 (4× 80GB)** for pre-training (~6 days), A10G for fine-tuning | |
| --- | |
| ## 6. Timeline (16 Weeks) | |
| ### Stage 1: Pilot Validation (Weeks 1-4) | |
| | Week | Phase | Deliverable | Compute | | |
| |------|-------|-------------|---------| | |
| | **1** | Data preparation | IMU windowing, video embedding extraction, EDA notebook on pilot | CPU: 12h, A10G: 10h | | |
| | **2** | IMU baselines (pilot) | 5 IMU models × 5 folds on 13K clips. Validate pipeline, establish baselines | A10G: 20h | | |
| | **3** | Video + COMODO (pilot) | Video baselines + COMODO distillation on 13K clips | A10G: 30h | | |
| | **4** | EVI-MAE (pilot) + triage | Self-supervised pre-training on pilot. Select methods to scale | A10G: 40h | | |
| ### Stage 2: Full Scale (Weeks 5-14) | |
| | Week | Phase | Deliverable | Compute | | |
| |------|-------|-------------|---------| | |
| | **5-6** | 500K data ingestion | Download 238 GiB IMU. Stream-extract video embeddings for 500K clips (3 models) | CPU: 88h, A10G: 600h | | |
| | **7-8** | IMU baselines (500K) | 5 IMU models × 5 folds at full scale | A10G: 250h | | |
| | **9-10** | Video + COMODO (500K) | Video baselines + COMODO distillation at full scale | A10G: 240h | | |
| | **11-12** | EVI-MAE (500K) | Self-supervised pre-training on all workers (~6 days on A100x4) + fine-tuning | A100x4: 150h, A10G: 40h | | |
| | **13** | Hard case analysis | Per-worker difficulty on 100-worker subset, IMU feature correlation | A10G: 100h, CPU: 40h | | |
| | **14** | Scaling experiments | Data scaling curve (13K→500K), worker scaling curve (100→800) | A10G: 375h | | |
| ### Stage 3: Synthesis (Weeks 15-16) | |
| | Week | Phase | Deliverable | Compute | | |
| |------|-------|-------------|---------| | |
| | **15** | Analysis | Compile results, statistical significance tests, generate figures | CPU: 20h | | |
| | **16** | Write-up | Paper draft (8-page main + appendix) | — | | |
| **Total wall-clock:** 16 weeks | |
| **Active compute window:** Weeks 1-14 (14 weeks of experiments) | |
| --- | |
| ## 7. Compute Budget | |
| ### 7.1 Summary | |
| | Resource | Hours | Unit Cost | Total | | |
| |----------|-------|-----------|-------| | |
| | CPU (preprocessing, analysis) | 160h | $0.60/h | $96 | | |
| | A10G GPU (24 GB) | 1,705h | $2.00/h | $3,410 | | |
| | A100x4 GPU (4× 80 GB) | 150h | $16.00/h | $2,400 | | |
| | **Subtotal** | **2,015h** | | **$5,906** | | |
| | **30% contingency** | | | **+$1,772** | | |
| | **Total** | | | **~$7,678** | | |
| ### 7.2 Phase Breakdown | |
| | Phase | Resource | Hours | Cost | Notes | | |
| |-------|----------|-------|------|-------| | |
| | **1. Pilot validation** | A10G | 100h | $200 | Weeks 1-4, fast iteration | | |
| | **2. Data ingestion + embedding extraction** | CPU + A10G | 88h + 600h | $1,253 | Parallelizable across 4 GPUs | | |
| | **3. IMU baselines (500K)** | A10G | 250h | $500 | 5 models × 5 folds | | |
| | **4. Video + COMODO (500K)** | A10G | 240h | $480 | 3 video models + 2 distillation students | | |
| | **5. EVI-MAE pre-training** | A100x4 | 150h | $2,400 | ~6 days continuous | | |
| | **5b. EVI-MAE fine-tuning** | A10G | 40h | $80 | 5 folds | | |
| | **6. Hard case analysis** | A10G + CPU | 100h + 40h | $224 | L1WO on 100-worker subset | | |
| | **7. Scaling experiments** | A10G | 375h | $750 | 5 scale points × 5 folds | | |
| | **8. Synthesis** | CPU | 20h | $12 | Figures, stats | | |
| ### 7.3 Storage Requirements | |
| | Data | Size | Notes | | |
| |------|------|-------| | |
| | Raw IMU data (all 500K clips) | ~238 GiB | Download once, keep in project storage | | |
| | Processed IMU windows (float32) | ~56 GiB | 2.5M windows × [1000, 6] | | |
| | Video embeddings (3 models) | ~22 GiB | 2.5M windows × 768-dim × 3 encoders | | |
| | Selective raw video (10% for EVI-MAE) | ~4.8 TiB | Stream during pre-training; cache hot subset | | |
| | STFT spectrograms (for EVI-MAE) | ~200 GiB | 2.5M × [160, 128, 6] | | |
| | Model checkpoints & logs | ~100 GiB | ~50 experiments × ~2 GiB each | | |
| | **Total active storage** | **~5.4 TiB** | | | |
| **Key insight:** We do NOT need all 47 TiB of video. For 80% of experiments (IMU baselines, COMODO, hard case analysis), only the **238 GiB of IMU data + 22 GiB of pre-computed video embeddings** are needed. Only EVI-MAE pre-training requires raw video, and we stream it rather than pre-downloading. | |
| ### 7.4 Comparison: Pilot vs. Full Scale | |
| | | Pilot (13K) | Full (500K) | Scale Factor | | |
| |---|---|---|---| | |
| | GPU-hours | 273h | 2,015h | 7.4× | | |
| | Cost | $654 | $7,678 | 11.7× | | |
| | Timeline | 10 weeks | 16 weeks | 1.6× | | |
| | Storage | 284 GiB | 5.4 TiB | 19× | | |
| The cost scales sub-linearly with data (11.7× cost for 38.5× more data) because: (a) embedding extraction amortizes across all experiments, (b) fewer epochs needed at larger data scale, (c) pilot experiments avoid wasting compute on methods that don't work. | |
| --- | |
| ## 8. Expected Contributions | |
| ### 8.1 Empirical Contributions | |
| 1. **First cross-person generalization study on ego+IMU at 1,000-worker, 500K-clip scale.** Prior work: CMU-MMAC (12 subjects), WEAR (18), MMEA (unknown), Ego4D (mixed, not controlled). We are **50-80× larger** in subject count. | |
| 2. **Modality robustness comparison under person shift.** No prior work has compared video vs. IMU vs. multimodal degradation under controlled cross-person shift on the same dataset. | |
| 3. **Data scaling laws for cross-person generalization.** First systematic study of how cross-person accuracy scales with training data (13K → 500K). This has direct practical value: how much data do you need to collect? | |
| 4. **Worker difficulty prediction from IMU statistics.** If successful, this enables proactive data collection — target workers whose motion patterns are underrepresented. | |
| 5. **Cross-modal distillation under distribution shift.** COMODO was validated on same-distribution splits. We test the open question: does distilled knowledge transfer across people? | |
| ### 8.2 Benchmark Contribution | |
| - Standardized splits (5-fold L20%WO) and evaluation protocol for 1,000-worker ego+IMU | |
| - Baseline results for 11+ models across 3 modality configurations at 500K scale | |
| - Per-worker difficulty scores for hard case analysis | |
| - Data scaling curves with fitted power laws | |
| - Open-source evaluation code and pre-computed features | |
| ### 8.3 Practical Contribution | |
| - Guidelines for "how many workers to collect" (worker scaling curve) | |
| - Guidelines for "how many clips to collect" (data scaling curve) | |
| - Method for flagging hard-to-generalize deployment scenarios | |
| - Evidence for whether IMU-only deployment (privacy-preserving) sacrifices cross-person robustness vs. video | |
| --- | |
| ## 9. Relation to Existing Work | |
| | Paper | What They Did | What We Add | | |
| |-------|---------------|-------------| | |
| | **COMODO** (Chen et al., 2025) | Video→IMU distillation on Ego4D/EgoExo4D/MMEA | Test distillation under cross-person shift; ~1,000 workers, 500K clips | | |
| | **EVI-MAE** (Zhang et al., 2024) | Multimodal MAE on CMU-MMAC (12 subj.) and WEAR (18 subj.) | Scale to ~1,000 workers; evaluate cross-person transfer of MAE representations | | |
| | **Generalizable HAR Survey** (Cai et al., 2025) | Taxonomy of cross-person/device/position HAR | Provide the large-scale empirical study their survey explicitly calls for | | |
| | **IMU-Video OOD HAR** (Cheshmi et al., 2025) | Cross-modal SSL for OOD transfer | Study OOD as cross-person (same task, different people) rather than cross-domain | | |
| | **ColloSSL** (Jain et al., 2022) | Multi-device contrastive SSL for HAR | Apply cross-device ideas to cross-person setting with paired video+IMU | | |
| | **UniMTS** (Zhang et al., 2024) | Text-aligned IMU pre-training for zero-shot HAR | Potential extension: text-conditioned cross-person generalization | | |
| | **MIAM** (Mehta et al., 2025) | Industrial assembly monitoring (RGB+depth+IMU, 8 volunteers) | Same domain but ~1,000 workers vs. 8; formal cross-person protocol | | |
| --- | |
| ## 10. Risk Mitigation | |
| | Risk | Likelihood | Mitigation | | |
| |------|-----------|------------| | |
| | 500K dataset access delayed | Medium | Start with 13K pilot (Weeks 1-4). All pipelines and pilot results are independently valuable | | |
| | Video streaming bottleneck at 47 TiB | Medium | Pre-compute embeddings (only 22 GiB). Only EVI-MAE needs raw video; use selective 10% subset | | |
| | 3-class task too easy (>95% accuracy) | Medium — "working" is 84% majority class | Use macro-F1 (penalizes majority bias); explore finer-grained temporal segmentation at scale | | |
| | EVI-MAE training unstable on A100x4 | Low | Validated on pilot first (Week 4). Follow published hyperparameters exactly; reduce lr if needed | | |
| | Compute budget overrun | Medium | 30% contingency ($1,772). Pilot phase identifies which methods to scale — skip underperformers | | |
| | Insufficient cross-person variance | Very Low — already confirmed in pilot | 1,000 workers guarantees far more diversity than 100 | | |
| --- | |
| ## 11. About the Researcher | |
| **Shubham Rasal** — [shubhamrasal.com](https://shubhamrasal.com) | |
| I've already worked with this dataset in its earlier form (the compression challenge). Specifically: | |
| - Built the **IMU (200Hz) windowing and video synchronization pipeline** | |
| - Explored **frequency-domain and temporal features** from the IMU signals | |
| - Observed the cross-worker motion pattern differences that motivate this proposal | |
| - Profiled the full dataset structure (100 workers, 12,997 clips, 1.22 TiB) | |
| This prior work means I can move directly to model training on the pilot without pipeline development overhead, and scale to 500K with confidence in the data pipeline. | |
| --- | |
| ## 12. Key References | |
| 1. Chen, B. et al. "COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric HAR." arXiv:2503.07259 (2025). | |
| 2. Zhang, M. et al. "Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition." arXiv:2407.06628 (2024). | |
| 3. Cai, Y. et al. "Towards Generalizable Human Activity Recognition: A Survey." arXiv:2508.12213 (2025). | |
| 4. Cheshmi, S. et al. "Improving OOD HAR via IMU-Video Cross-modal Representation Learning." arXiv:2507.13482 (2025). | |
| 5. Moon, S. et al. "IMU2CLIP: Multimodal Contrastive Learning for IMU Motion Sensors." arXiv:2210.14395 (2023). | |
| 6. Zhang, X. et al. "UniMTS: Unified Pre-training for Motion Time Series." arXiv:2410.19818 (2024). | |
| 7. Mehta, N. et al. "A Multimodal Dataset for Enhancing Industrial Task Monitoring." arXiv:2501.05936 (2025). | |
| 8. Jain, Y. et al. "ColloSSL: Collaborative Self-Supervised Learning for HAR." arXiv:2202.00758 (2022). | |
| 9. Grauman, K. et al. "Ego4D: Around the World in 3,000 Hours of Egocentric Video." arXiv:2110.07058 (2022). | |
| 10. Grauman, K. et al. "Ego-Exo4D: Understanding Skilled Human Activity." arXiv:2311.18259 (2023). | |
| --- | |
| *End of proposal.* | |