ShubhamRasal
/

cross-person-ego-imu-benchmark

Model card Files Files and versions

xet

Community

ShubhamRasal commited on Apr 26

Commit

cac9a3f

verified ·

1 Parent(s): f3b1201

Scale proposal to 500K datapoints from Egocentric-1M dataset — updated budgets, timeline, storage, new scaling experiment

Browse files

Files changed (1) hide show

PROPOSAL.md +194 -91

PROPOSAL.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# Cross-Person Generalization in Egocentric Video + IMU Activity Recognition: A Benchmark Study on the Egocentric-1M Dataset
 **Author:** Shubham Rasal — [shubhamrasal.com](https://shubhamrasal.com)
 **Date:** April 2026
@@ -18,18 +18,23 @@ This is the setting that matters for any commercial wearable deployment, yet it
 ### The Opportunity
-The **Egocentric Native Compression** dataset provides a unique opportunity to study this problem properly:
-| Property | Value | Why It Matters |
-|----------|-------|----------------|
-| Workers | **100** (distinct individuals) | Largest cross-person ego+IMU study possible |
-| Clips | **12,997** | Statistical power for fine-grained analysis |
-| Duration | **~97.5 hours** | 3× larger than CMU-MMAC + WEAR combined |
-| IMU rate | **200 Hz** (acc + gyro) | Matches Ego4D protocol; richer spectral content than 50-60Hz datasets |
-| Activity labels | working / break / not-on-person | Free supervision per worker (from metadata) |
-| Domain | Industrial / workplace | Underrepresented in ego research (most work uses cooking/sports) |
-From preliminary exploration of the dataset, I've observed **consistent differences in motion patterns across workers performing similar tasks** — exactly the distribution shift that drives generalization failures. With 100 workers, we can finally quantify this properly.
 ---
@@ -50,13 +55,16 @@ From preliminary exploration of the dataset, I've observed **consistent differen
 5. **Does self-supervised pre-training on all workers (unlabeled) close the cross-person gap?**
    EVI-MAE-style pre-training uses unlabeled data from all workers. Does this shared representation space reduce person-specific biases?
 ---
 ## 3. Dataset Overview
-**Source:** `gs://build-ai-egocentric-native-compression/`
-### 3.1 Scale
 | Metric | Value |
 |--------|-------|
@@ -67,7 +75,7 @@ From preliminary exploration of the dataset, I've observed **consistent differen
 | Video per clip | ~100 MiB MP4, ~27 seconds |
 | IMU per clip | ~500 KiB JSONL, ~5,400 samples at 200Hz |
-### 3.2 Worker Distribution
 | Statistic | Clips per Worker |
 |-----------|-----------------|
@@ -78,7 +86,7 @@ From preliminary exploration of the dataset, I've observed **consistent differen
 | Max | 203 |
 | Mean ± Std | 130.0 ± 36.3 |
-### 3.3 Activity State Distribution (per-worker metadata)
 | State | Mean Fraction | Std | Min | Max |
 |-------|---------------|-----|-----|-----|
@@ -87,7 +95,19 @@ From preliminary exploration of the dataset, I've observed **consistent differen
 | Not on person | 0.034 | 0.016 | 0.000 | 0.058 |
 | Worn (working + break) | 0.966 | 0.016 | 0.942 | 1.000 |
-### 3.4 Notable Workers (Hard Cases)
 | Worker | Working % | Break % | Off-body % | Clips | Notes |
 |--------|-----------|---------|------------|-------|-------|
@@ -96,9 +116,9 @@ From preliminary exploration of the dataset, I've observed **consistent differen
 | worker_019 | 57.0% | 41.2% | 1.8% | 110 | Near-equal work/break split |
 | worker_087 | 62.1% | 32.5% | 5.3% | 164 | High break + high off-body |
-These workers represent the distribution shift challenge: models trained on high-working-fraction workers will fail on these "break-heavy" workers.
-### 3.5 Curation
 - First and last 2 clips trimmed per worker (removes startup/shutdown artifacts)
 - Audio stripped from all videos
@@ -108,37 +128,54 @@ These workers represent the distribution shift challenge: models trained on high
 ## 4. Experimental Design
-### 4.1 Cross-Person Evaluation Protocol
-**Primary protocol: 5-Fold Leave-20-Workers-Out (L20WO)**
-Split 100 workers into 5 non-overlapping folds of 20 workers each. For each fold:
-- **Train:** 60 workers (~7,800 clips, ~58.5h)
-- **Validation:** 20 workers (~2,600 clips, ~19.5h)
-- **Test:** 20 workers (~2,600 clips, ~19.5h)
-Report mean ± std across 5 folds. This is more rigorous than single-split evaluation and accounts for fold variance.
 **Secondary protocols:**
-- **Leave-1-Worker-Out (L1WO):** Train on 99 workers, test on 1. Repeat for all 100. Measures per-worker difficulty.
-- **Scaling curve:** Train on {10, 20, 40, 60, 80} workers, always test on the same held-out 20. Answers: "how many workers do you need?"
-### 4.2 Task Definition
-**3-class activity state classification** from the per-worker metadata:
-- **Working** (84.3% of clips)
 - **Taking break** (12.3%)
 - **Not on person** (3.4%)
-This is a natural, practically-relevant task with class imbalance — reflecting real deployment conditions.
 **Window protocol** (following COMODO/Ego4D standard):
 - 5-second non-overlapping windows
 - IMU: 200Hz × 5s = 1,000 timesteps × 6 channels (acc_xyz + gyro_xyz)
 - Video: 10 FPS × 5s = 50 frames, resized to 224×224
-- Per 27s clip: ~5 windows → **~64,985 total windows**
-### 4.3 Metrics
 | Metric | Why |
 |--------|-----|
@@ -148,7 +185,7 @@ This is a natural, practically-relevant task with class imbalance — reflecting
 | **Generalization gap** | Δ between within-person and cross-person F1 |
 | **Per-class F1** | Breakdown across working/break/not-worn |
-### 4.4 Baseline Comparisons
 #### Experiment A: IMU-Only Models (5 baselines)
@@ -160,7 +197,7 @@ This is a natural, practically-relevant task with class imbalance — reflecting
 | MOMENT-small | Pre-trained time-series foundation | ~40M | Goswami et al., 2024 |
 | Mantis | Pre-trained time-series | ~15M | Liu et al., 2024 |
-All evaluated in both **within-person** (random split) and **cross-person** (L20WO) settings. The gap between the two is the headline result.
 #### Experiment B: Video-Only Models (3 baselines)
@@ -187,11 +224,11 @@ All evaluated in both **within-person** (random split) and **cross-person** (L20
 | EVI-MAE | Joint video+IMU MAE pre-training → fine-tune | Zhang et al., 2024 |
 | IMU-only MAE | IMU spectrogram MAE → fine-tune | Adapted from EVI-MAE |
-Pre-train on **all 100 workers** (unlabeled, self-supervised), then fine-tune on 60 train workers with labels, evaluate on 20 test workers. Compare with same architecture trained from scratch.
 #### Experiment E: Hard Case Analysis
-For each of 100 workers, compute:
 - **IMU statistics:** RMS acceleration, gyro variance, spectral centroid, signal entropy, jerk magnitude, dominant frequency
 - **Temporal statistics:** autocorrelation decay, stationarity (ADF test), transition rate between high/low activity
 - **Model performance:** per-worker F1 from L1WO
@@ -201,6 +238,20 @@ Then:
 2. **Cluster** workers by IMU feature similarity → do clusters predict generalization difficulty?
 3. **Predict difficulty:** train a regression model (IMU stats → expected F1) — can we flag hard workers before deployment?
 ---
 ## 5. Technical Approach
@@ -217,7 +268,9 @@ Raw JSONL (200Hz, acc+gyro)
     → Per-channel z-normalization (computed on training set only)
 ```
-*Already built:* I have the 200Hz windowing and video sync pipeline from earlier work on this dataset.
 ### 5.2 Video Preprocessing Pipeline
@@ -226,10 +279,13 @@ MP4 (27s clips, ~100 MiB each)
     → Decode at 10 FPS → 270 frames per clip
     → 5-second windows → 50 frames per window
     → Resize to 224×224, normalize (ImageNet stats)
-    → For COMODO: pre-compute TimeSformer embeddings (store to disk)
 ```
-**Storage optimization:** Pre-compute and cache video embeddings rather than re-encoding each epoch. TimeSformer-B produces 768-dim vectors → 65K windows × 768 × 4 bytes = ~200 MB. Trivial to store.
 ### 5.3 COMODO Distillation Protocol
@@ -238,7 +294,7 @@ Following Chen et al. (2025) exactly:
 - **IMU student:** MOMENT-small or Mantis, trainable. MLP projector (hidden=2048, out=128)
 - **Loss:** Cross-entropy on similarity distribution with dynamic FIFO instance queue
 - **Hyperparameters:** 20 epochs, batch 128, lr=3e-4, τ_v=0.1, τ_x=0.05
-- **Hardware:** Single A10G GPU (24 GB VRAM)
 ### 5.4 EVI-MAE Protocol
@@ -247,56 +303,97 @@ Following Zhang et al. (2024):
 - **IMU encoder:** Transformer on STFT spectrogram patches (16×16), 75% masking ratio
 - **Fusion:** Single-layer Transformer
 - **Pre-training loss:** MAE reconstruction (α=1, β=10) + contrastive (γ=0.01, τ=0.05)
-- **Pre-train:** 200 epochs on all 100 workers (unlabeled)
 - **Fine-tune:** 50 epochs on train split with labels
-- **Hardware:** A100-80GB for pre-training, A10G for fine-tuning
 ---
-## 6. Timeline (10 Weeks)
 | Week | Phase | Deliverable | Compute |
 |------|-------|-------------|---------|
-| **1** | Data preparation | IMU windowing, video frame extraction, EDA notebook, worker statistics dashboard | CPU: 12h |
-| **2** | IMU baselines | 5 IMU-only models × 5 folds = 25 runs. Within-person vs. cross-person gap quantified | T4: 50h |
-| **3** | Video baselines | Pre-compute video embeddings for all clips. 3 video models × 5 folds | A10G: 50h |
-| **4** | Video baselines + COMODO | Complete video experiments. Start COMODO distillation (2 students × 5 folds) | A10G: 60h |
-| **5** | EVI-MAE pre-training | Self-supervised pre-training on all 100 workers | A100: 36h |
-| **6** | EVI-MAE fine-tuning | Fine-tune + evaluate EVI-MAE across 5 folds | A10G: 15h |
-| **7** | Hard case analysis | Per-worker difficulty analysis, IMU feature extraction, correlation study | A10G: 15h + CPU: 10h |
-| **8** | Ablations & scaling curves | Worker scaling curve (10→80 train workers). Ablations on window size, IMU rate | A10G: 30h |
-| **9** | Synthesis & visualization | Compile all results, generate figures, statistical significance tests | CPU: 10h |
-| **10** | Write-up | Paper draft (8-page main + appendix) | — |
-**Total wall-clock:** 10 weeks
-**Active compute window:** Weeks 2-8 (7 weeks of experiments)
 ---
 ## 7. Compute Budget
 | Resource | Hours | Unit Cost | Total |
 |----------|-------|-----------|-------|
-| T4 GPU (16 GB) | 50h | $0.60/h | $30 |
-| A10G GPU (24 GB) | 155h | $2.00/h | $310 |
-| A100 GPU (80 GB) | 36h | $4.00/h | $144 |
-| CPU (preprocessing) | 32h | $0.60/h | $19 |
-| **Subtotal** | **273h** | | **$503** |
-| **30% contingency** | | | **+$151** |
-| **Total** | | | **~$654** |
-### Storage Requirements
-| Data | Size | Where |
-|------|------|-------|
-| IMU data (all workers) | 6.2 GiB | Download once, keep in project |
-| Video embeddings (pre-computed) | ~200 MiB | Generated in Week 3, reused everywhere |
-| Video frames (1 FPS samples for EDA) | ~7 GiB | Generated in Week 1 |
-| Full video (selective, 20 workers for EVI-MAE) | ~250 GiB | Download in Week 5 for MAE pre-training |
-| Model checkpoints & logs | ~20 GiB | Throughout |
-| **Total active storage** | **~284 GiB** | |
-**Note:** We do NOT need to download the full 1.22 TiB. For IMU experiments (Phases 2, 4), only 6.2 GiB of IMU data is needed. For video experiments (Phase 3), we pre-compute embeddings by streaming video in batches. Only Phase 5 (EVI-MAE) requires raw video, and we can work with a subset of workers.
 ---
@@ -304,24 +401,28 @@ Following Zhang et al. (2024):
 ### 8.1 Empirical Contributions
-1. **First systematic cross-person generalization study on ego+IMU at 100-worker scale.** Prior work: CMU-MMAC (12 subjects), WEAR (18), MMEA (unknown). We are 5-8× larger.
 2. **Modality robustness comparison under person shift.** No prior work has compared video vs. IMU vs. multimodal degradation under controlled cross-person shift on the same dataset.
-3. **Worker difficulty prediction from IMU statistics.** If successful, this enables proactive data collection — target workers whose motion patterns are underrepresented.
-4. **Cross-modal distillation under distribution shift.** COMODO was validated on same-distribution splits. We test the open question: does distilled knowledge transfer across people?
 ### 8.2 Benchmark Contribution
-- Standardized splits (5-fold L20WO) and evaluation protocol for 100-worker ego+IMU
-- Baseline results for 11+ models across 3 modality configurations
 - Per-worker difficulty scores for hard case analysis
 - Open-source evaluation code and pre-computed features
 ### 8.3 Practical Contribution
-- Guidelines for "how many workers to collect" (scaling curve)
 - Method for flagging hard-to-generalize deployment scenarios
 - Evidence for whether IMU-only deployment (privacy-preserving) sacrifices cross-person robustness vs. video
@@ -331,13 +432,13 @@ Following Zhang et al. (2024):
 | Paper | What They Did | What We Add |
 |-------|---------------|-------------|
-| **COMODO** (Chen et al., 2025) | Video→IMU distillation on Ego4D/EgoExo4D/MMEA | Test distillation under cross-person shift; 100 workers (vs. mixed populations) |
-| **EVI-MAE** (Zhang et al., 2024) | Multimodal MAE on CMU-MMAC (12 subj.) and WEAR (18 subj.) | Scale to 100 workers; evaluate cross-person transfer of MAE representations |
-| **Generalizable HAR Survey** (Cai et al., 2025) | Taxonomy of cross-person/device/position HAR | Provide the missing large-scale empirical study their survey calls for |
 | **IMU-Video OOD HAR** (Cheshmi et al., 2025) | Cross-modal SSL for OOD transfer | Study OOD as cross-person (same task, different people) rather than cross-domain |
 | **ColloSSL** (Jain et al., 2022) | Multi-device contrastive SSL for HAR | Apply cross-device ideas to cross-person setting with paired video+IMU |
 | **UniMTS** (Zhang et al., 2024) | Text-aligned IMU pre-training for zero-shot HAR | Potential extension: text-conditioned cross-person generalization |
-| **MIAM** (Mehta et al., 2025) | Industrial assembly monitoring (RGB+depth+IMU) | Same domain but 100 workers vs. 8; formal cross-person protocol |
 ---
@@ -345,11 +446,12 @@ Following Zhang et al. (2024):
 | Risk | Likelihood | Mitigation |
 |------|-----------|------------|
-| Video files unreadable (`.gstmp` format) | Low — already confirmed as playable MP4 in prior exploration | Fall back to IMU-only experiments (still novel at this scale) |
-| 3-class task too easy (>95% accuracy) | Medium — "working" is 84% majority class | Switch to finer-grained temporal segmentation or use macro-F1 to penalize majority-class bias |
-| EVI-MAE training unstable at scale | Low | Follow published hyperparameters exactly; ablate learning rate |
-| Insufficient cross-person variance | Low — already observed in prior exploration | IMU statistics analysis (Phase 6) will quantify; 100 workers provides ample variance |
-| Compute overrun | Medium | Contingency budget (+30%); IMU-only experiments need minimal compute |
 ---
@@ -361,8 +463,9 @@ I've already worked with this dataset in its earlier form (the compression chall
 - Built the **IMU (200Hz) windowing and video synchronization pipeline**
 - Explored **frequency-domain and temporal features** from the IMU signals
 - Observed the cross-worker motion pattern differences that motivate this proposal
-This prior work means I can move directly to model training without pipeline development overhead.
 ---

+# Cross-Person Generalization in Egocentric Video + IMU Activity Recognition: A Benchmark Study at 500K Scale
 **Author:** Shubham Rasal — [shubhamrasal.com](https://shubhamrasal.com)
 **Date:** April 2026
 ### The Opportunity
+We will use **500K clips from the Egocentric-1M dataset** — drawing from a pool of ~1,000 workers performing industrial/workplace tasks. This is orders of magnitude larger than any existing ego+IMU benchmark and enables studies that are statistically impossible on current datasets.
+Additionally, we have early access to a **13K-clip, 100-worker pilot subset** (the Egocentric Native Compression dataset) for rapid prototyping and pipeline validation before scaling to 500K.
+| Property | Pilot (Available Now) | Full Scale (500K) |
+|----------|----------------------|-------------------|
+| Workers | 100 | ~1,000+ (estimated) |
+| Clips | 12,997 | **500,000** |
+| Duration | ~97.5 hours | **~3,750 hours** |
+| Total size | 1.22 TiB | **~47 TiB** |
+| IMU data | 6.18 GiB | **~238 GiB** |
+| 5s windows | ~65K | **~2.5M** |
+| IMU rate | 200 Hz (acc + gyro) | 200 Hz (acc + gyro) |
+| Activity labels | working / break / not-on-person | working / break / not-on-person |
+| Domain | Industrial / workplace | Industrial / workplace |
+From preliminary exploration of the pilot dataset, I've observed **consistent differences in motion patterns across workers performing similar tasks** — exactly the distribution shift that drives generalization failures. With ~1,000 workers and 500K clips, we can study this at a scale that no prior work has approached.
 ---
 5. **Does self-supervised pre-training on all workers (unlabeled) close the cross-person gap?**
    EVI-MAE-style pre-training uses unlabeled data from all workers. Does this shared representation space reduce person-specific biases?
+6. **What are the scaling laws for cross-person generalization?** *(New at 500K scale)*
+   How does downstream cross-person accuracy change as we increase training data from 13K → 50K → 100K → 250K → 500K clips? Is there a data scaling law for cross-person robustness?
 ---
 ## 3. Dataset Overview
+### 3.1 Pilot Dataset (Available Now)
+**Source:** `gs://build-ai-egocentric-native-compression/`
 | Metric | Value |
 |--------|-------|
 | Video per clip | ~100 MiB MP4, ~27 seconds |
 | IMU per clip | ~500 KiB JSONL, ~5,400 samples at 200Hz |
+**Pilot worker statistics (from `index.csv`):**
 | Statistic | Clips per Worker |
 |-----------|-----------------|
 | Max | 203 |
 | Mean ± Std | 130.0 ± 36.3 |
+**Activity state distribution (per-worker metadata):**
 | State | Mean Fraction | Std | Min | Max |
 |-------|---------------|-----|-----|-----|
 | Not on person | 0.034 | 0.016 | 0.000 | 0.058 |
 | Worn (working + break) | 0.966 | 0.016 | 0.942 | 1.000 |
+### 3.2 Full Dataset (500K Target)
+| Metric | Estimated Value |
+|--------|-----------------|
+| Total workers | ~1,000+ |
+| Total clips | **500,000** |
+| Total duration | **~3,750 hours (~156 days)** |
+| Video data | **~46.9 TiB** |
+| IMU data | **~238 GiB** |
+| 5s windows | **~2,500,000** |
+| IMU window tensor (float32) | **~56 GiB** |
+### 3.3 Notable Pilot Workers (Hard Cases)
 | Worker | Working % | Break % | Off-body % | Clips | Notes |
 |--------|-----------|---------|------------|-------|-------|
 | worker_019 | 57.0% | 41.2% | 1.8% | 110 | Near-equal work/break split |
 | worker_087 | 62.1% | 32.5% | 5.3% | 164 | High break + high off-body |
+At 500K scale with ~1,000 workers, we expect 10× more behavioral diversity — more outliers, richer distribution tails, and harder generalization challenges.
+### 3.4 Curation
 - First and last 2 clips trimmed per worker (removes startup/shutdown artifacts)
 - Audio stripped from all videos
 ## 4. Experimental Design
+### 4.1 Two-Stage Approach: Pilot → Full Scale
+**Stage 1 (Weeks 1-4): Pilot on 13K clips**
+- Validate all pipelines, debug models, establish baseline numbers
+- Run all experiments on 100-worker pilot for fast iteration (~hours per experiment)
+- Identify which methods are worth scaling
+**Stage 2 (Weeks 5-14): Scale to 500K clips**
+- Run the winning methods at full scale
+- Execute the data scaling experiment (unique contribution at this scale)
+- Full cross-person benchmark at ~1,000-worker scale
+### 4.2 Cross-Person Evaluation Protocol
+**Primary protocol: 5-Fold Leave-20%-Workers-Out (L20%WO)**
+Split workers into 5 non-overlapping folds (20% each). For each fold:
+- **Train:** 60% of workers
+- **Validation:** 20% of workers
+- **Test:** 20% of workers
+| | Pilot (100 workers) | Full (est. ~1,000 workers) |
+|---|---|---|
+| Train | 60 workers (~7,800 clips) | ~600 workers (~300K clips) |
+| Val | 20 workers (~2,600 clips) | ~200 workers (~100K clips) |
+| Test | 20 workers (~2,600 clips) | ~200 workers (~100K clips) |
+Report mean ± std across 5 folds.
 **Secondary protocols:**
+- **Leave-1-Worker-Out (L1WO):** On a representative subset of 100 workers (from the full 1,000). Train on 99, test on 1 × 100. Measures per-worker difficulty.
+- **Worker scaling curve:** Train on {100, 200, 400, 600, 800} workers, always test on the same held-out 200. Answers: "how many workers do you need?"
+- **Data scaling curve:** Train best model at {13K, 50K, 100K, 250K, 500K} clips. Answers: "how much data do you need?"
+### 4.3 Task Definition
+**3-class activity state classification:**
+- **Working** (84.3% of clips in pilot)
 - **Taking break** (12.3%)
 - **Not on person** (3.4%)
 **Window protocol** (following COMODO/Ego4D standard):
 - 5-second non-overlapping windows
 - IMU: 200Hz × 5s = 1,000 timesteps × 6 channels (acc_xyz + gyro_xyz)
 - Video: 10 FPS × 5s = 50 frames, resized to 224×224
+- Per 27s clip: ~5 windows → **~2.5M total windows at 500K scale**
+### 4.4 Metrics
 | Metric | Why |
 |--------|-----|
 | **Generalization gap** | Δ between within-person and cross-person F1 |
 | **Per-class F1** | Breakdown across working/break/not-worn |
+### 4.5 Baseline Comparisons
 #### Experiment A: IMU-Only Models (5 baselines)
 | MOMENT-small | Pre-trained time-series foundation | ~40M | Goswami et al., 2024 |
 | Mantis | Pre-trained time-series | ~15M | Liu et al., 2024 |
+All evaluated in both **within-person** (random split) and **cross-person** (L20%WO) settings. The gap between the two is the headline result.
 #### Experiment B: Video-Only Models (3 baselines)
 | EVI-MAE | Joint video+IMU MAE pre-training → fine-tune | Zhang et al., 2024 |
 | IMU-only MAE | IMU spectrogram MAE → fine-tune | Adapted from EVI-MAE |
+Pre-train on **all workers** (unlabeled, self-supervised), then fine-tune on train workers with labels, evaluate on held-out test workers. Compare with same architecture trained from scratch.
 #### Experiment E: Hard Case Analysis
+For a representative subset of workers, compute:
 - **IMU statistics:** RMS acceleration, gyro variance, spectral centroid, signal entropy, jerk magnitude, dominant frequency
 - **Temporal statistics:** autocorrelation decay, stationarity (ADF test), transition rate between high/low activity
 - **Model performance:** per-worker F1 from L1WO
 2. **Cluster** workers by IMU feature similarity → do clusters predict generalization difficulty?
 3. **Predict difficulty:** train a regression model (IMU stats → expected F1) — can we flag hard workers before deployment?
+#### Experiment F: Data Scaling Laws *(New at 500K)*
+Train the best-performing model from Experiments A-D at increasing data scales:
+| Scale | Clips | Est. Workers | Est. Hours |
+|-------|-------|-------------|------------|
+| 13K | 12,997 | 100 | 97.5h |
+| 50K | 50,000 | ~385 | 375h |
+| 100K | 100,000 | ~770 | 750h |
+| 250K | 250,000 | ~1,000 | 1,875h |
+| 500K | 500,000 | ~1,000 | 3,750h |
+Measure: cross-person Macro F1 vs. training data size. Is there a scaling law? Does it plateau?
 ---
 ## 5. Technical Approach
     → Per-channel z-normalization (computed on training set only)
 ```
+**At 500K scale:** 2.5M windows × 6,000 floats = ~56 GiB processed tensor. Fits comfortably in RAM on a cpu-upgrade instance. Preprocessing takes ~80h CPU but is embarrassingly parallel.
+*Already built:* I have the 200Hz windowing and video sync pipeline from earlier work on the pilot dataset.
 ### 5.2 Video Preprocessing Pipeline
     → Decode at 10 FPS → 270 frames per clip
     → 5-second windows → 50 frames per window
     → Resize to 224×224, normalize (ImageNet stats)
+    → Pre-compute video encoder embeddings (store to disk)
 ```
+**At 500K scale:** Pre-computing video embeddings is critical. We do NOT download all 47 TiB of video. Instead:
+- Stream video clips on-demand during embedding extraction
+- TimeSformer-B produces 768-dim vectors → 2.5M windows × 768 × 4 bytes = **~7.3 GiB** per model. Easily stored.
+- Embedding extraction: ~200h on A10G (parallelizable across 4 GPUs → ~50h wall-clock)
 ### 5.3 COMODO Distillation Protocol
 - **IMU student:** MOMENT-small or Mantis, trainable. MLP projector (hidden=2048, out=128)
 - **Loss:** Cross-entropy on similarity distribution with dynamic FIFO instance queue
 - **Hyperparameters:** 20 epochs, batch 128, lr=3e-4, τ_v=0.1, τ_x=0.05
+- **Hardware:** A10G GPU (24 GB VRAM) — scales linearly with data, ~12h per student per fold at 500K
 ### 5.4 EVI-MAE Protocol
 - **IMU encoder:** Transformer on STFT spectrogram patches (16×16), 75% masking ratio
 - **Fusion:** Single-layer Transformer
 - **Pre-training loss:** MAE reconstruction (α=1, β=10) + contrastive (γ=0.01, τ=0.05)
+- **Pre-train:** 100 epochs on all workers (fewer epochs needed — each epoch sees 38× more data)
 - **Fine-tune:** 50 epochs on train split with labels
+- **Hardware:** **A100x4 (4× 80GB)** for pre-training (~6 days), A10G for fine-tuning
 ---
+## 6. Timeline (16 Weeks)
+### Stage 1: Pilot Validation (Weeks 1-4)
+| Week | Phase | Deliverable | Compute |
+|------|-------|-------------|---------|
+| **1** | Data preparation | IMU windowing, video embedding extraction, EDA notebook on pilot | CPU: 12h, A10G: 10h |
+| **2** | IMU baselines (pilot) | 5 IMU models × 5 folds on 13K clips. Validate pipeline, establish baselines | A10G: 20h |
+| **3** | Video + COMODO (pilot) | Video baselines + COMODO distillation on 13K clips | A10G: 30h |
+| **4** | EVI-MAE (pilot) + triage | Self-supervised pre-training on pilot. Select methods to scale | A10G: 40h |
+### Stage 2: Full Scale (Weeks 5-14)
 | Week | Phase | Deliverable | Compute |
 |------|-------|-------------|---------|
+| **5-6** | 500K data ingestion | Download 238 GiB IMU. Stream-extract video embeddings for 500K clips (3 models) | CPU: 88h, A10G: 600h |
+| **7-8** | IMU baselines (500K) | 5 IMU models × 5 folds at full scale | A10G: 250h |
+| **9-10** | Video + COMODO (500K) | Video baselines + COMODO distillation at full scale | A10G: 240h |
+| **11-12** | EVI-MAE (500K) | Self-supervised pre-training on all workers (~6 days on A100x4) + fine-tuning | A100x4: 150h, A10G: 40h |
+| **13** | Hard case analysis | Per-worker difficulty on 100-worker subset, IMU feature correlation | A10G: 100h, CPU: 40h |
+| **14** | Scaling experiments | Data scaling curve (13K→500K), worker scaling curve (100→800) | A10G: 375h |
+### Stage 3: Synthesis (Weeks 15-16)
+| Week | Phase | Deliverable | Compute |
+|------|-------|-------------|---------|
+| **15** | Analysis | Compile results, statistical significance tests, generate figures | CPU: 20h |
+| **16** | Write-up | Paper draft (8-page main + appendix) | — |
+**Total wall-clock:** 16 weeks
+**Active compute window:** Weeks 1-14 (14 weeks of experiments)
 ---
 ## 7. Compute Budget
+### 7.1 Summary
 | Resource | Hours | Unit Cost | Total |
 |----------|-------|-----------|-------|
+| CPU (preprocessing, analysis) | 160h | $0.60/h | $96 |
+| A10G GPU (24 GB) | 1,705h | $2.00/h | $3,410 |
+| A100x4 GPU (4× 80 GB) | 150h | $16.00/h | $2,400 |
+| **Subtotal** | **2,015h** | | **$5,906** |
+| **30% contingency** | | | **+$1,772** |
+| **Total** | | | **~$7,678** |
+### 7.2 Phase Breakdown
+| Phase | Resource | Hours | Cost | Notes |
+|-------|----------|-------|------|-------|
+| **1. Pilot validation** | A10G | 100h | $200 | Weeks 1-4, fast iteration |
+| **2. Data ingestion + embedding extraction** | CPU + A10G | 88h + 600h | $1,253 | Parallelizable across 4 GPUs |
+| **3. IMU baselines (500K)** | A10G | 250h | $500 | 5 models × 5 folds |
+| **4. Video + COMODO (500K)** | A10G | 240h | $480 | 3 video models + 2 distillation students |
+| **5. EVI-MAE pre-training** | A100x4 | 150h | $2,400 | ~6 days continuous |
+| **5b. EVI-MAE fine-tuning** | A10G | 40h | $80 | 5 folds |
+| **6. Hard case analysis** | A10G + CPU | 100h + 40h | $224 | L1WO on 100-worker subset |
+| **7. Scaling experiments** | A10G | 375h | $750 | 5 scale points × 5 folds |
+| **8. Synthesis** | CPU | 20h | $12 | Figures, stats |
+### 7.3 Storage Requirements
+| Data | Size | Notes |
+|------|------|-------|
+| Raw IMU data (all 500K clips) | ~238 GiB | Download once, keep in project storage |
+| Processed IMU windows (float32) | ~56 GiB | 2.5M windows × [1000, 6] |
+| Video embeddings (3 models) | ~22 GiB | 2.5M windows × 768-dim × 3 encoders |
+| Selective raw video (10% for EVI-MAE) | ~4.8 TiB | Stream during pre-training; cache hot subset |
+| STFT spectrograms (for EVI-MAE) | ~200 GiB | 2.5M × [160, 128, 6] |
+| Model checkpoints & logs | ~100 GiB | ~50 experiments × ~2 GiB each |
+| **Total active storage** | **~5.4 TiB** | |
+**Key insight:** We do NOT need all 47 TiB of video. For 80% of experiments (IMU baselines, COMODO, hard case analysis), only the **238 GiB of IMU data + 22 GiB of pre-computed video embeddings** are needed. Only EVI-MAE pre-training requires raw video, and we stream it rather than pre-downloading.
+### 7.4 Comparison: Pilot vs. Full Scale
+| | Pilot (13K) | Full (500K) | Scale Factor |
+|---|---|---|---|
+| GPU-hours | 273h | 2,015h | 7.4× |
+| Cost | $654 | $7,678 | 11.7× |
+| Timeline | 10 weeks | 16 weeks | 1.6× |
+| Storage | 284 GiB | 5.4 TiB | 19× |
+The cost scales sub-linearly with data (11.7× cost for 38.5× more data) because: (a) embedding extraction amortizes across all experiments, (b) fewer epochs needed at larger data scale, (c) pilot experiments avoid wasting compute on methods that don't work.
 ---
 ### 8.1 Empirical Contributions
+1. **First cross-person generalization study on ego+IMU at 1,000-worker, 500K-clip scale.** Prior work: CMU-MMAC (12 subjects), WEAR (18), MMEA (unknown), Ego4D (mixed, not controlled). We are **50-80× larger** in subject count.
 2. **Modality robustness comparison under person shift.** No prior work has compared video vs. IMU vs. multimodal degradation under controlled cross-person shift on the same dataset.
+3. **Data scaling laws for cross-person generalization.** First systematic study of how cross-person accuracy scales with training data (13K → 500K). This has direct practical value: how much data do you need to collect?
+4. **Worker difficulty prediction from IMU statistics.** If successful, this enables proactive data collection — target workers whose motion patterns are underrepresented.
+5. **Cross-modal distillation under distribution shift.** COMODO was validated on same-distribution splits. We test the open question: does distilled knowledge transfer across people?
 ### 8.2 Benchmark Contribution
+- Standardized splits (5-fold L20%WO) and evaluation protocol for 1,000-worker ego+IMU
+- Baseline results for 11+ models across 3 modality configurations at 500K scale
 - Per-worker difficulty scores for hard case analysis
+- Data scaling curves with fitted power laws
 - Open-source evaluation code and pre-computed features
 ### 8.3 Practical Contribution
+- Guidelines for "how many workers to collect" (worker scaling curve)
+- Guidelines for "how many clips to collect" (data scaling curve)
 - Method for flagging hard-to-generalize deployment scenarios
 - Evidence for whether IMU-only deployment (privacy-preserving) sacrifices cross-person robustness vs. video
 | Paper | What They Did | What We Add |
 |-------|---------------|-------------|
+| **COMODO** (Chen et al., 2025) | Video→IMU distillation on Ego4D/EgoExo4D/MMEA | Test distillation under cross-person shift; ~1,000 workers, 500K clips |
+| **EVI-MAE** (Zhang et al., 2024) | Multimodal MAE on CMU-MMAC (12 subj.) and WEAR (18 subj.) | Scale to ~1,000 workers; evaluate cross-person transfer of MAE representations |
+| **Generalizable HAR Survey** (Cai et al., 2025) | Taxonomy of cross-person/device/position HAR | Provide the large-scale empirical study their survey explicitly calls for |
 | **IMU-Video OOD HAR** (Cheshmi et al., 2025) | Cross-modal SSL for OOD transfer | Study OOD as cross-person (same task, different people) rather than cross-domain |
 | **ColloSSL** (Jain et al., 2022) | Multi-device contrastive SSL for HAR | Apply cross-device ideas to cross-person setting with paired video+IMU |
 | **UniMTS** (Zhang et al., 2024) | Text-aligned IMU pre-training for zero-shot HAR | Potential extension: text-conditioned cross-person generalization |
+| **MIAM** (Mehta et al., 2025) | Industrial assembly monitoring (RGB+depth+IMU, 8 volunteers) | Same domain but ~1,000 workers vs. 8; formal cross-person protocol |
 ---
 | Risk | Likelihood | Mitigation |
 |------|-----------|------------|
+| 500K dataset access delayed | Medium | Start with 13K pilot (Weeks 1-4). All pipelines and pilot results are independently valuable |
+| Video streaming bottleneck at 47 TiB | Medium | Pre-compute embeddings (only 22 GiB). Only EVI-MAE needs raw video; use selective 10% subset |
+| 3-class task too easy (>95% accuracy) | Medium — "working" is 84% majority class | Use macro-F1 (penalizes majority bias); explore finer-grained temporal segmentation at scale |
+| EVI-MAE training unstable on A100x4 | Low | Validated on pilot first (Week 4). Follow published hyperparameters exactly; reduce lr if needed |
+| Compute budget overrun | Medium | 30% contingency ($1,772). Pilot phase identifies which methods to scale — skip underperformers |
+| Insufficient cross-person variance | Very Low — already confirmed in pilot | 1,000 workers guarantees far more diversity than 100 |
 ---
 - Built the **IMU (200Hz) windowing and video synchronization pipeline**
 - Explored **frequency-domain and temporal features** from the IMU signals
 - Observed the cross-worker motion pattern differences that motivate this proposal
+- Profiled the full dataset structure (100 workers, 12,997 clips, 1.22 TiB)
+This prior work means I can move directly to model training on the pilot without pipeline development overhead, and scale to 500K with confidence in the data pipeline.
 ---