ShubhamRasal commited on
Commit
cac9a3f
·
verified ·
1 Parent(s): f3b1201

Scale proposal to 500K datapoints from Egocentric-1M dataset — updated budgets, timeline, storage, new scaling experiment

Browse files
Files changed (1) hide show
  1. PROPOSAL.md +194 -91
PROPOSAL.md CHANGED
@@ -1,4 +1,4 @@
1
- # Cross-Person Generalization in Egocentric Video + IMU Activity Recognition: A Benchmark Study on the Egocentric-1M Dataset
2
 
3
  **Author:** Shubham Rasal — [shubhamrasal.com](https://shubhamrasal.com)
4
  **Date:** April 2026
@@ -18,18 +18,23 @@ This is the setting that matters for any commercial wearable deployment, yet it
18
 
19
  ### The Opportunity
20
 
21
- The **Egocentric Native Compression** dataset provides a unique opportunity to study this problem properly:
22
 
23
- | Property | Value | Why It Matters |
24
- |----------|-------|----------------|
25
- | Workers | **100** (distinct individuals) | Largest cross-person ego+IMU study possible |
26
- | Clips | **12,997** | Statistical power for fine-grained analysis |
27
- | Duration | **~97.5 hours** | 3× larger than CMU-MMAC + WEAR combined |
28
- | IMU rate | **200 Hz** (acc + gyro) | Matches Ego4D protocol; richer spectral content than 50-60Hz datasets |
29
- | Activity labels | working / break / not-on-person | Free supervision per worker (from metadata) |
30
- | Domain | Industrial / workplace | Underrepresented in ego research (most work uses cooking/sports) |
31
 
32
- From preliminary exploration of the dataset, I've observed **consistent differences in motion patterns across workers performing similar tasks** — exactly the distribution shift that drives generalization failures. With 100 workers, we can finally quantify this properly.
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ---
35
 
@@ -50,13 +55,16 @@ From preliminary exploration of the dataset, I've observed **consistent differen
50
  5. **Does self-supervised pre-training on all workers (unlabeled) close the cross-person gap?**
51
  EVI-MAE-style pre-training uses unlabeled data from all workers. Does this shared representation space reduce person-specific biases?
52
 
 
 
 
53
  ---
54
 
55
  ## 3. Dataset Overview
56
 
57
- **Source:** `gs://build-ai-egocentric-native-compression/`
58
 
59
- ### 3.1 Scale
60
 
61
  | Metric | Value |
62
  |--------|-------|
@@ -67,7 +75,7 @@ From preliminary exploration of the dataset, I've observed **consistent differen
67
  | Video per clip | ~100 MiB MP4, ~27 seconds |
68
  | IMU per clip | ~500 KiB JSONL, ~5,400 samples at 200Hz |
69
 
70
- ### 3.2 Worker Distribution
71
 
72
  | Statistic | Clips per Worker |
73
  |-----------|-----------------|
@@ -78,7 +86,7 @@ From preliminary exploration of the dataset, I've observed **consistent differen
78
  | Max | 203 |
79
  | Mean ± Std | 130.0 ± 36.3 |
80
 
81
- ### 3.3 Activity State Distribution (per-worker metadata)
82
 
83
  | State | Mean Fraction | Std | Min | Max |
84
  |-------|---------------|-----|-----|-----|
@@ -87,7 +95,19 @@ From preliminary exploration of the dataset, I've observed **consistent differen
87
  | Not on person | 0.034 | 0.016 | 0.000 | 0.058 |
88
  | Worn (working + break) | 0.966 | 0.016 | 0.942 | 1.000 |
89
 
90
- ### 3.4 Notable Workers (Hard Cases)
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  | Worker | Working % | Break % | Off-body % | Clips | Notes |
93
  |--------|-----------|---------|------------|-------|-------|
@@ -96,9 +116,9 @@ From preliminary exploration of the dataset, I've observed **consistent differen
96
  | worker_019 | 57.0% | 41.2% | 1.8% | 110 | Near-equal work/break split |
97
  | worker_087 | 62.1% | 32.5% | 5.3% | 164 | High break + high off-body |
98
 
99
- These workers represent the distribution shift challenge: models trained on high-working-fraction workers will fail on these "break-heavy" workers.
100
 
101
- ### 3.5 Curation
102
 
103
  - First and last 2 clips trimmed per worker (removes startup/shutdown artifacts)
104
  - Audio stripped from all videos
@@ -108,37 +128,54 @@ These workers represent the distribution shift challenge: models trained on high
108
 
109
  ## 4. Experimental Design
110
 
111
- ### 4.1 Cross-Person Evaluation Protocol
 
 
 
 
 
 
 
 
 
 
112
 
113
- **Primary protocol: 5-Fold Leave-20-Workers-Out (L20WO)**
114
 
115
- Split 100 workers into 5 non-overlapping folds of 20 workers each. For each fold:
116
- - **Train:** 60 workers (~7,800 clips, ~58.5h)
117
- - **Validation:** 20 workers (~2,600 clips, ~19.5h)
118
- - **Test:** 20 workers (~2,600 clips, ~19.5h)
119
 
120
- Report mean ± std across 5 folds. This is more rigorous than single-split evaluation and accounts for fold variance.
 
 
 
 
 
 
 
 
 
 
 
121
 
122
  **Secondary protocols:**
123
- - **Leave-1-Worker-Out (L1WO):** Train on 99 workers, test on 1. Repeat for all 100. Measures per-worker difficulty.
124
- - **Scaling curve:** Train on {10, 20, 40, 60, 80} workers, always test on the same held-out 20. Answers: "how many workers do you need?"
 
125
 
126
- ### 4.2 Task Definition
127
 
128
- **3-class activity state classification** from the per-worker metadata:
129
- - **Working** (84.3% of clips)
130
  - **Taking break** (12.3%)
131
  - **Not on person** (3.4%)
132
 
133
- This is a natural, practically-relevant task with class imbalance — reflecting real deployment conditions.
134
-
135
  **Window protocol** (following COMODO/Ego4D standard):
136
  - 5-second non-overlapping windows
137
  - IMU: 200Hz × 5s = 1,000 timesteps × 6 channels (acc_xyz + gyro_xyz)
138
  - Video: 10 FPS × 5s = 50 frames, resized to 224×224
139
- - Per 27s clip: ~5 windows → **~64,985 total windows**
140
 
141
- ### 4.3 Metrics
142
 
143
  | Metric | Why |
144
  |--------|-----|
@@ -148,7 +185,7 @@ This is a natural, practically-relevant task with class imbalance — reflecting
148
  | **Generalization gap** | Δ between within-person and cross-person F1 |
149
  | **Per-class F1** | Breakdown across working/break/not-worn |
150
 
151
- ### 4.4 Baseline Comparisons
152
 
153
  #### Experiment A: IMU-Only Models (5 baselines)
154
 
@@ -160,7 +197,7 @@ This is a natural, practically-relevant task with class imbalance — reflecting
160
  | MOMENT-small | Pre-trained time-series foundation | ~40M | Goswami et al., 2024 |
161
  | Mantis | Pre-trained time-series | ~15M | Liu et al., 2024 |
162
 
163
- All evaluated in both **within-person** (random split) and **cross-person** (L20WO) settings. The gap between the two is the headline result.
164
 
165
  #### Experiment B: Video-Only Models (3 baselines)
166
 
@@ -187,11 +224,11 @@ All evaluated in both **within-person** (random split) and **cross-person** (L20
187
  | EVI-MAE | Joint video+IMU MAE pre-training → fine-tune | Zhang et al., 2024 |
188
  | IMU-only MAE | IMU spectrogram MAE → fine-tune | Adapted from EVI-MAE |
189
 
190
- Pre-train on **all 100 workers** (unlabeled, self-supervised), then fine-tune on 60 train workers with labels, evaluate on 20 test workers. Compare with same architecture trained from scratch.
191
 
192
  #### Experiment E: Hard Case Analysis
193
 
194
- For each of 100 workers, compute:
195
  - **IMU statistics:** RMS acceleration, gyro variance, spectral centroid, signal entropy, jerk magnitude, dominant frequency
196
  - **Temporal statistics:** autocorrelation decay, stationarity (ADF test), transition rate between high/low activity
197
  - **Model performance:** per-worker F1 from L1WO
@@ -201,6 +238,20 @@ Then:
201
  2. **Cluster** workers by IMU feature similarity → do clusters predict generalization difficulty?
202
  3. **Predict difficulty:** train a regression model (IMU stats → expected F1) — can we flag hard workers before deployment?
203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
  ---
205
 
206
  ## 5. Technical Approach
@@ -217,7 +268,9 @@ Raw JSONL (200Hz, acc+gyro)
217
  → Per-channel z-normalization (computed on training set only)
218
  ```
219
 
220
- *Already built:* I have the 200Hz windowing and video sync pipeline from earlier work on this dataset.
 
 
221
 
222
  ### 5.2 Video Preprocessing Pipeline
223
 
@@ -226,10 +279,13 @@ MP4 (27s clips, ~100 MiB each)
226
  → Decode at 10 FPS → 270 frames per clip
227
  → 5-second windows → 50 frames per window
228
  → Resize to 224×224, normalize (ImageNet stats)
229
- For COMODO: pre-compute TimeSformer embeddings (store to disk)
230
  ```
231
 
232
- **Storage optimization:** Pre-compute and cache video embeddings rather than re-encoding each epoch. TimeSformer-B produces 768-dim vectors 65K windows × 768 × 4 bytes = ~200 MB. Trivial to store.
 
 
 
233
 
234
  ### 5.3 COMODO Distillation Protocol
235
 
@@ -238,7 +294,7 @@ Following Chen et al. (2025) exactly:
238
  - **IMU student:** MOMENT-small or Mantis, trainable. MLP projector (hidden=2048, out=128)
239
  - **Loss:** Cross-entropy on similarity distribution with dynamic FIFO instance queue
240
  - **Hyperparameters:** 20 epochs, batch 128, lr=3e-4, τ_v=0.1, τ_x=0.05
241
- - **Hardware:** Single A10G GPU (24 GB VRAM)
242
 
243
  ### 5.4 EVI-MAE Protocol
244
 
@@ -247,56 +303,97 @@ Following Zhang et al. (2024):
247
  - **IMU encoder:** Transformer on STFT spectrogram patches (16×16), 75% masking ratio
248
  - **Fusion:** Single-layer Transformer
249
  - **Pre-training loss:** MAE reconstruction (α=1, β=10) + contrastive (γ=0.01, τ=0.05)
250
- - **Pre-train:** 200 epochs on all 100 workers (unlabeled)
251
  - **Fine-tune:** 50 epochs on train split with labels
252
- - **Hardware:** A100-80GB for pre-training, A10G for fine-tuning
253
 
254
  ---
255
 
256
- ## 6. Timeline (10 Weeks)
 
 
 
 
 
 
 
 
 
 
 
257
 
258
  | Week | Phase | Deliverable | Compute |
259
  |------|-------|-------------|---------|
260
- | **1** | Data preparation | IMU windowing, video frame extraction, EDA notebook, worker statistics dashboard | CPU: 12h |
261
- | **2** | IMU baselines | 5 IMU-only models × 5 folds = 25 runs. Within-person vs. cross-person gap quantified | T4: 50h |
262
- | **3** | Video baselines | Pre-compute video embeddings for all clips. 3 video models × 5 folds | A10G: 50h |
263
- | **4** | Video baselines + COMODO | Complete video experiments. Start COMODO distillation (2 students × 5 folds) | A10G: 60h |
264
- | **5** | EVI-MAE pre-training | Self-supervised pre-training on all 100 workers | A100: 36h |
265
- | **6** | EVI-MAE fine-tuning | Fine-tune + evaluate EVI-MAE across 5 folds | A10G: 15h |
266
- | **7** | Hard case analysis | Per-worker difficulty analysis, IMU feature extraction, correlation study | A10G: 15h + CPU: 10h |
267
- | **8** | Ablations & scaling curves | Worker scaling curve (10→80 train workers). Ablations on window size, IMU rate | A10G: 30h |
268
- | **9** | Synthesis & visualization | Compile all results, generate figures, statistical significance tests | CPU: 10h |
269
- | **10** | Write-up | Paper draft (8-page main + appendix) | |
270
-
271
- **Total wall-clock:** 10 weeks
272
- **Active compute window:** Weeks 2-8 (7 weeks of experiments)
 
 
 
273
 
274
  ---
275
 
276
  ## 7. Compute Budget
277
 
 
 
278
  | Resource | Hours | Unit Cost | Total |
279
  |----------|-------|-----------|-------|
280
- | T4 GPU (16 GB) | 50h | $0.60/h | $30 |
281
- | A10G GPU (24 GB) | 155h | $2.00/h | $310 |
282
- | A100 GPU (80 GB) | 36h | $4.00/h | $144 |
283
- | CPU (preprocessing) | 32h | $0.60/h | $19 |
284
- | **Subtotal** | **273h** | | **$503** |
285
- | **30% contingency** | | | **+$151** |
286
- | **Total** | | | **~$654** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
287
 
288
- ### Storage Requirements
289
 
290
- | Data | Size | Where |
291
- |------|------|-------|
292
- | IMU data (all workers) | 6.2 GiB | Download once, keep in project |
293
- | Video embeddings (pre-computed) | ~200 MiB | Generated in Week 3, reused everywhere |
294
- | Video frames (1 FPS samples for EDA) | ~7 GiB | Generated in Week 1 |
295
- | Full video (selective, 20 workers for EVI-MAE) | ~250 GiB | Download in Week 5 for MAE pre-training |
296
- | Model checkpoints & logs | ~20 GiB | Throughout |
297
- | **Total active storage** | **~284 GiB** | |
298
 
299
- **Note:** We do NOT need to download the full 1.22 TiB. For IMU experiments (Phases 2, 4), only 6.2 GiB of IMU data is needed. For video experiments (Phase 3), we pre-compute embeddings by streaming video in batches. Only Phase 5 (EVI-MAE) requires raw video, and we can work with a subset of workers.
300
 
301
  ---
302
 
@@ -304,24 +401,28 @@ Following Zhang et al. (2024):
304
 
305
  ### 8.1 Empirical Contributions
306
 
307
- 1. **First systematic cross-person generalization study on ego+IMU at 100-worker scale.** Prior work: CMU-MMAC (12 subjects), WEAR (18), MMEA (unknown). We are 5-8× larger.
308
 
309
  2. **Modality robustness comparison under person shift.** No prior work has compared video vs. IMU vs. multimodal degradation under controlled cross-person shift on the same dataset.
310
 
311
- 3. **Worker difficulty prediction from IMU statistics.** If successful, this enables proactive data collection target workers whose motion patterns are underrepresented.
 
 
312
 
313
- 4. **Cross-modal distillation under distribution shift.** COMODO was validated on same-distribution splits. We test the open question: does distilled knowledge transfer across people?
314
 
315
  ### 8.2 Benchmark Contribution
316
 
317
- - Standardized splits (5-fold L20WO) and evaluation protocol for 100-worker ego+IMU
318
- - Baseline results for 11+ models across 3 modality configurations
319
  - Per-worker difficulty scores for hard case analysis
 
320
  - Open-source evaluation code and pre-computed features
321
 
322
  ### 8.3 Practical Contribution
323
 
324
- - Guidelines for "how many workers to collect" (scaling curve)
 
325
  - Method for flagging hard-to-generalize deployment scenarios
326
  - Evidence for whether IMU-only deployment (privacy-preserving) sacrifices cross-person robustness vs. video
327
 
@@ -331,13 +432,13 @@ Following Zhang et al. (2024):
331
 
332
  | Paper | What They Did | What We Add |
333
  |-------|---------------|-------------|
334
- | **COMODO** (Chen et al., 2025) | Video→IMU distillation on Ego4D/EgoExo4D/MMEA | Test distillation under cross-person shift; 100 workers (vs. mixed populations) |
335
- | **EVI-MAE** (Zhang et al., 2024) | Multimodal MAE on CMU-MMAC (12 subj.) and WEAR (18 subj.) | Scale to 100 workers; evaluate cross-person transfer of MAE representations |
336
- | **Generalizable HAR Survey** (Cai et al., 2025) | Taxonomy of cross-person/device/position HAR | Provide the missing large-scale empirical study their survey calls for |
337
  | **IMU-Video OOD HAR** (Cheshmi et al., 2025) | Cross-modal SSL for OOD transfer | Study OOD as cross-person (same task, different people) rather than cross-domain |
338
  | **ColloSSL** (Jain et al., 2022) | Multi-device contrastive SSL for HAR | Apply cross-device ideas to cross-person setting with paired video+IMU |
339
  | **UniMTS** (Zhang et al., 2024) | Text-aligned IMU pre-training for zero-shot HAR | Potential extension: text-conditioned cross-person generalization |
340
- | **MIAM** (Mehta et al., 2025) | Industrial assembly monitoring (RGB+depth+IMU) | Same domain but 100 workers vs. 8; formal cross-person protocol |
341
 
342
  ---
343
 
@@ -345,11 +446,12 @@ Following Zhang et al. (2024):
345
 
346
  | Risk | Likelihood | Mitigation |
347
  |------|-----------|------------|
348
- | Video files unreadable (`.gstmp` format) | Low already confirmed as playable MP4 in prior exploration | Fall back to IMU-only experiments (still novel at this scale) |
349
- | 3-class task too easy (>95% accuracy) | Medium "working" is 84% majority class | Switch to finer-grained temporal segmentation or use macro-F1 to penalize majority-class bias |
350
- | EVI-MAE training unstable at scale | Low | Follow published hyperparameters exactly; ablate learning rate |
351
- | Insufficient cross-person variance | Low already observed in prior exploration | IMU statistics analysis (Phase 6) will quantify; 100 workers provides ample variance |
352
- | Compute overrun | Medium | Contingency budget (+30%); IMU-only experiments need minimal compute |
 
353
 
354
  ---
355
 
@@ -361,8 +463,9 @@ I've already worked with this dataset in its earlier form (the compression chall
361
  - Built the **IMU (200Hz) windowing and video synchronization pipeline**
362
  - Explored **frequency-domain and temporal features** from the IMU signals
363
  - Observed the cross-worker motion pattern differences that motivate this proposal
 
364
 
365
- This prior work means I can move directly to model training without pipeline development overhead.
366
 
367
  ---
368
 
 
1
+ # Cross-Person Generalization in Egocentric Video + IMU Activity Recognition: A Benchmark Study at 500K Scale
2
 
3
  **Author:** Shubham Rasal — [shubhamrasal.com](https://shubhamrasal.com)
4
  **Date:** April 2026
 
18
 
19
  ### The Opportunity
20
 
21
+ We will use **500K clips from the Egocentric-1M dataset** drawing from a pool of ~1,000 workers performing industrial/workplace tasks. This is orders of magnitude larger than any existing ego+IMU benchmark and enables studies that are statistically impossible on current datasets.
22
 
23
+ Additionally, we have early access to a **13K-clip, 100-worker pilot subset** (the Egocentric Native Compression dataset) for rapid prototyping and pipeline validation before scaling to 500K.
 
 
 
 
 
 
 
24
 
25
+ | Property | Pilot (Available Now) | Full Scale (500K) |
26
+ |----------|----------------------|-------------------|
27
+ | Workers | 100 | ~1,000+ (estimated) |
28
+ | Clips | 12,997 | **500,000** |
29
+ | Duration | ~97.5 hours | **~3,750 hours** |
30
+ | Total size | 1.22 TiB | **~47 TiB** |
31
+ | IMU data | 6.18 GiB | **~238 GiB** |
32
+ | 5s windows | ~65K | **~2.5M** |
33
+ | IMU rate | 200 Hz (acc + gyro) | 200 Hz (acc + gyro) |
34
+ | Activity labels | working / break / not-on-person | working / break / not-on-person |
35
+ | Domain | Industrial / workplace | Industrial / workplace |
36
+
37
+ From preliminary exploration of the pilot dataset, I've observed **consistent differences in motion patterns across workers performing similar tasks** — exactly the distribution shift that drives generalization failures. With ~1,000 workers and 500K clips, we can study this at a scale that no prior work has approached.
38
 
39
  ---
40
 
 
55
  5. **Does self-supervised pre-training on all workers (unlabeled) close the cross-person gap?**
56
  EVI-MAE-style pre-training uses unlabeled data from all workers. Does this shared representation space reduce person-specific biases?
57
 
58
+ 6. **What are the scaling laws for cross-person generalization?** *(New at 500K scale)*
59
+ How does downstream cross-person accuracy change as we increase training data from 13K → 50K → 100K → 250K → 500K clips? Is there a data scaling law for cross-person robustness?
60
+
61
  ---
62
 
63
  ## 3. Dataset Overview
64
 
65
+ ### 3.1 Pilot Dataset (Available Now)
66
 
67
+ **Source:** `gs://build-ai-egocentric-native-compression/`
68
 
69
  | Metric | Value |
70
  |--------|-------|
 
75
  | Video per clip | ~100 MiB MP4, ~27 seconds |
76
  | IMU per clip | ~500 KiB JSONL, ~5,400 samples at 200Hz |
77
 
78
+ **Pilot worker statistics (from `index.csv`):**
79
 
80
  | Statistic | Clips per Worker |
81
  |-----------|-----------------|
 
86
  | Max | 203 |
87
  | Mean ± Std | 130.0 ± 36.3 |
88
 
89
+ **Activity state distribution (per-worker metadata):**
90
 
91
  | State | Mean Fraction | Std | Min | Max |
92
  |-------|---------------|-----|-----|-----|
 
95
  | Not on person | 0.034 | 0.016 | 0.000 | 0.058 |
96
  | Worn (working + break) | 0.966 | 0.016 | 0.942 | 1.000 |
97
 
98
+ ### 3.2 Full Dataset (500K Target)
99
+
100
+ | Metric | Estimated Value |
101
+ |--------|-----------------|
102
+ | Total workers | ~1,000+ |
103
+ | Total clips | **500,000** |
104
+ | Total duration | **~3,750 hours (~156 days)** |
105
+ | Video data | **~46.9 TiB** |
106
+ | IMU data | **~238 GiB** |
107
+ | 5s windows | **~2,500,000** |
108
+ | IMU window tensor (float32) | **~56 GiB** |
109
+
110
+ ### 3.3 Notable Pilot Workers (Hard Cases)
111
 
112
  | Worker | Working % | Break % | Off-body % | Clips | Notes |
113
  |--------|-----------|---------|------------|-------|-------|
 
116
  | worker_019 | 57.0% | 41.2% | 1.8% | 110 | Near-equal work/break split |
117
  | worker_087 | 62.1% | 32.5% | 5.3% | 164 | High break + high off-body |
118
 
119
+ At 500K scale with ~1,000 workers, we expect 10× more behavioral diversity more outliers, richer distribution tails, and harder generalization challenges.
120
 
121
+ ### 3.4 Curation
122
 
123
  - First and last 2 clips trimmed per worker (removes startup/shutdown artifacts)
124
  - Audio stripped from all videos
 
128
 
129
  ## 4. Experimental Design
130
 
131
+ ### 4.1 Two-Stage Approach: Pilot → Full Scale
132
+
133
+ **Stage 1 (Weeks 1-4): Pilot on 13K clips**
134
+ - Validate all pipelines, debug models, establish baseline numbers
135
+ - Run all experiments on 100-worker pilot for fast iteration (~hours per experiment)
136
+ - Identify which methods are worth scaling
137
+
138
+ **Stage 2 (Weeks 5-14): Scale to 500K clips**
139
+ - Run the winning methods at full scale
140
+ - Execute the data scaling experiment (unique contribution at this scale)
141
+ - Full cross-person benchmark at ~1,000-worker scale
142
 
143
+ ### 4.2 Cross-Person Evaluation Protocol
144
 
145
+ **Primary protocol: 5-Fold Leave-20%-Workers-Out (L20%WO)**
 
 
 
146
 
147
+ Split workers into 5 non-overlapping folds (20% each). For each fold:
148
+ - **Train:** 60% of workers
149
+ - **Validation:** 20% of workers
150
+ - **Test:** 20% of workers
151
+
152
+ | | Pilot (100 workers) | Full (est. ~1,000 workers) |
153
+ |---|---|---|
154
+ | Train | 60 workers (~7,800 clips) | ~600 workers (~300K clips) |
155
+ | Val | 20 workers (~2,600 clips) | ~200 workers (~100K clips) |
156
+ | Test | 20 workers (~2,600 clips) | ~200 workers (~100K clips) |
157
+
158
+ Report mean ± std across 5 folds.
159
 
160
  **Secondary protocols:**
161
+ - **Leave-1-Worker-Out (L1WO):** On a representative subset of 100 workers (from the full 1,000). Train on 99, test on 1 × 100. Measures per-worker difficulty.
162
+ - **Worker scaling curve:** Train on {100, 200, 400, 600, 800} workers, always test on the same held-out 200. Answers: "how many workers do you need?"
163
+ - **Data scaling curve:** Train best model at {13K, 50K, 100K, 250K, 500K} clips. Answers: "how much data do you need?"
164
 
165
+ ### 4.3 Task Definition
166
 
167
+ **3-class activity state classification:**
168
+ - **Working** (84.3% of clips in pilot)
169
  - **Taking break** (12.3%)
170
  - **Not on person** (3.4%)
171
 
 
 
172
  **Window protocol** (following COMODO/Ego4D standard):
173
  - 5-second non-overlapping windows
174
  - IMU: 200Hz × 5s = 1,000 timesteps × 6 channels (acc_xyz + gyro_xyz)
175
  - Video: 10 FPS × 5s = 50 frames, resized to 224×224
176
+ - Per 27s clip: ~5 windows → **~2.5M total windows at 500K scale**
177
 
178
+ ### 4.4 Metrics
179
 
180
  | Metric | Why |
181
  |--------|-----|
 
185
  | **Generalization gap** | Δ between within-person and cross-person F1 |
186
  | **Per-class F1** | Breakdown across working/break/not-worn |
187
 
188
+ ### 4.5 Baseline Comparisons
189
 
190
  #### Experiment A: IMU-Only Models (5 baselines)
191
 
 
197
  | MOMENT-small | Pre-trained time-series foundation | ~40M | Goswami et al., 2024 |
198
  | Mantis | Pre-trained time-series | ~15M | Liu et al., 2024 |
199
 
200
+ All evaluated in both **within-person** (random split) and **cross-person** (L20%WO) settings. The gap between the two is the headline result.
201
 
202
  #### Experiment B: Video-Only Models (3 baselines)
203
 
 
224
  | EVI-MAE | Joint video+IMU MAE pre-training → fine-tune | Zhang et al., 2024 |
225
  | IMU-only MAE | IMU spectrogram MAE → fine-tune | Adapted from EVI-MAE |
226
 
227
+ Pre-train on **all workers** (unlabeled, self-supervised), then fine-tune on train workers with labels, evaluate on held-out test workers. Compare with same architecture trained from scratch.
228
 
229
  #### Experiment E: Hard Case Analysis
230
 
231
+ For a representative subset of workers, compute:
232
  - **IMU statistics:** RMS acceleration, gyro variance, spectral centroid, signal entropy, jerk magnitude, dominant frequency
233
  - **Temporal statistics:** autocorrelation decay, stationarity (ADF test), transition rate between high/low activity
234
  - **Model performance:** per-worker F1 from L1WO
 
238
  2. **Cluster** workers by IMU feature similarity → do clusters predict generalization difficulty?
239
  3. **Predict difficulty:** train a regression model (IMU stats → expected F1) — can we flag hard workers before deployment?
240
 
241
+ #### Experiment F: Data Scaling Laws *(New at 500K)*
242
+
243
+ Train the best-performing model from Experiments A-D at increasing data scales:
244
+
245
+ | Scale | Clips | Est. Workers | Est. Hours |
246
+ |-------|-------|-------------|------------|
247
+ | 13K | 12,997 | 100 | 97.5h |
248
+ | 50K | 50,000 | ~385 | 375h |
249
+ | 100K | 100,000 | ~770 | 750h |
250
+ | 250K | 250,000 | ~1,000 | 1,875h |
251
+ | 500K | 500,000 | ~1,000 | 3,750h |
252
+
253
+ Measure: cross-person Macro F1 vs. training data size. Is there a scaling law? Does it plateau?
254
+
255
  ---
256
 
257
  ## 5. Technical Approach
 
268
  → Per-channel z-normalization (computed on training set only)
269
  ```
270
 
271
+ **At 500K scale:** 2.5M windows × 6,000 floats = ~56 GiB processed tensor. Fits comfortably in RAM on a cpu-upgrade instance. Preprocessing takes ~80h CPU but is embarrassingly parallel.
272
+
273
+ *Already built:* I have the 200Hz windowing and video sync pipeline from earlier work on the pilot dataset.
274
 
275
  ### 5.2 Video Preprocessing Pipeline
276
 
 
279
  → Decode at 10 FPS → 270 frames per clip
280
  → 5-second windows → 50 frames per window
281
  → Resize to 224×224, normalize (ImageNet stats)
282
+ Pre-compute video encoder embeddings (store to disk)
283
  ```
284
 
285
+ **At 500K scale:** Pre-computing video embeddings is critical. We do NOT download all 47 TiB of video. Instead:
286
+ - Stream video clips on-demand during embedding extraction
287
+ - TimeSformer-B produces 768-dim vectors → 2.5M windows × 768 × 4 bytes = **~7.3 GiB** per model. Easily stored.
288
+ - Embedding extraction: ~200h on A10G (parallelizable across 4 GPUs → ~50h wall-clock)
289
 
290
  ### 5.3 COMODO Distillation Protocol
291
 
 
294
  - **IMU student:** MOMENT-small or Mantis, trainable. MLP projector (hidden=2048, out=128)
295
  - **Loss:** Cross-entropy on similarity distribution with dynamic FIFO instance queue
296
  - **Hyperparameters:** 20 epochs, batch 128, lr=3e-4, τ_v=0.1, τ_x=0.05
297
+ - **Hardware:** A10G GPU (24 GB VRAM) — scales linearly with data, ~12h per student per fold at 500K
298
 
299
  ### 5.4 EVI-MAE Protocol
300
 
 
303
  - **IMU encoder:** Transformer on STFT spectrogram patches (16×16), 75% masking ratio
304
  - **Fusion:** Single-layer Transformer
305
  - **Pre-training loss:** MAE reconstruction (α=1, β=10) + contrastive (γ=0.01, τ=0.05)
306
+ - **Pre-train:** 100 epochs on all workers (fewer epochs needed — each epoch sees 38× more data)
307
  - **Fine-tune:** 50 epochs on train split with labels
308
+ - **Hardware:** **A100x4 (4× 80GB)** for pre-training (~6 days), A10G for fine-tuning
309
 
310
  ---
311
 
312
+ ## 6. Timeline (16 Weeks)
313
+
314
+ ### Stage 1: Pilot Validation (Weeks 1-4)
315
+
316
+ | Week | Phase | Deliverable | Compute |
317
+ |------|-------|-------------|---------|
318
+ | **1** | Data preparation | IMU windowing, video embedding extraction, EDA notebook on pilot | CPU: 12h, A10G: 10h |
319
+ | **2** | IMU baselines (pilot) | 5 IMU models × 5 folds on 13K clips. Validate pipeline, establish baselines | A10G: 20h |
320
+ | **3** | Video + COMODO (pilot) | Video baselines + COMODO distillation on 13K clips | A10G: 30h |
321
+ | **4** | EVI-MAE (pilot) + triage | Self-supervised pre-training on pilot. Select methods to scale | A10G: 40h |
322
+
323
+ ### Stage 2: Full Scale (Weeks 5-14)
324
 
325
  | Week | Phase | Deliverable | Compute |
326
  |------|-------|-------------|---------|
327
+ | **5-6** | 500K data ingestion | Download 238 GiB IMU. Stream-extract video embeddings for 500K clips (3 models) | CPU: 88h, A10G: 600h |
328
+ | **7-8** | IMU baselines (500K) | 5 IMU models × 5 folds at full scale | A10G: 250h |
329
+ | **9-10** | Video + COMODO (500K) | Video baselines + COMODO distillation at full scale | A10G: 240h |
330
+ | **11-12** | EVI-MAE (500K) | Self-supervised pre-training on all workers (~6 days on A100x4) + fine-tuning | A100x4: 150h, A10G: 40h |
331
+ | **13** | Hard case analysis | Per-worker difficulty on 100-worker subset, IMU feature correlation | A10G: 100h, CPU: 40h |
332
+ | **14** | Scaling experiments | Data scaling curve (13K→500K), worker scaling curve (100→800) | A10G: 375h |
333
+
334
+ ### Stage 3: Synthesis (Weeks 15-16)
335
+
336
+ | Week | Phase | Deliverable | Compute |
337
+ |------|-------|-------------|---------|
338
+ | **15** | Analysis | Compile results, statistical significance tests, generate figures | CPU: 20h |
339
+ | **16** | Write-up | Paper draft (8-page main + appendix) | — |
340
+
341
+ **Total wall-clock:** 16 weeks
342
+ **Active compute window:** Weeks 1-14 (14 weeks of experiments)
343
 
344
  ---
345
 
346
  ## 7. Compute Budget
347
 
348
+ ### 7.1 Summary
349
+
350
  | Resource | Hours | Unit Cost | Total |
351
  |----------|-------|-----------|-------|
352
+ | CPU (preprocessing, analysis) | 160h | $0.60/h | $96 |
353
+ | A10G GPU (24 GB) | 1,705h | $2.00/h | $3,410 |
354
+ | A100x4 GPU (80 GB) | 150h | $16.00/h | $2,400 |
355
+ | **Subtotal** | **2,015h** | | **$5,906** |
356
+ | **30% contingency** | | | **+$1,772** |
357
+ | **Total** | | | **~$7,678** |
358
+
359
+ ### 7.2 Phase Breakdown
360
+
361
+ | Phase | Resource | Hours | Cost | Notes |
362
+ |-------|----------|-------|------|-------|
363
+ | **1. Pilot validation** | A10G | 100h | $200 | Weeks 1-4, fast iteration |
364
+ | **2. Data ingestion + embedding extraction** | CPU + A10G | 88h + 600h | $1,253 | Parallelizable across 4 GPUs |
365
+ | **3. IMU baselines (500K)** | A10G | 250h | $500 | 5 models × 5 folds |
366
+ | **4. Video + COMODO (500K)** | A10G | 240h | $480 | 3 video models + 2 distillation students |
367
+ | **5. EVI-MAE pre-training** | A100x4 | 150h | $2,400 | ~6 days continuous |
368
+ | **5b. EVI-MAE fine-tuning** | A10G | 40h | $80 | 5 folds |
369
+ | **6. Hard case analysis** | A10G + CPU | 100h + 40h | $224 | L1WO on 100-worker subset |
370
+ | **7. Scaling experiments** | A10G | 375h | $750 | 5 scale points × 5 folds |
371
+ | **8. Synthesis** | CPU | 20h | $12 | Figures, stats |
372
+
373
+ ### 7.3 Storage Requirements
374
+
375
+ | Data | Size | Notes |
376
+ |------|------|-------|
377
+ | Raw IMU data (all 500K clips) | ~238 GiB | Download once, keep in project storage |
378
+ | Processed IMU windows (float32) | ~56 GiB | 2.5M windows × [1000, 6] |
379
+ | Video embeddings (3 models) | ~22 GiB | 2.5M windows × 768-dim × 3 encoders |
380
+ | Selective raw video (10% for EVI-MAE) | ~4.8 TiB | Stream during pre-training; cache hot subset |
381
+ | STFT spectrograms (for EVI-MAE) | ~200 GiB | 2.5M × [160, 128, 6] |
382
+ | Model checkpoints & logs | ~100 GiB | ~50 experiments × ~2 GiB each |
383
+ | **Total active storage** | **~5.4 TiB** | |
384
 
385
+ **Key insight:** We do NOT need all 47 TiB of video. For 80% of experiments (IMU baselines, COMODO, hard case analysis), only the **238 GiB of IMU data + 22 GiB of pre-computed video embeddings** are needed. Only EVI-MAE pre-training requires raw video, and we stream it rather than pre-downloading.
386
 
387
+ ### 7.4 Comparison: Pilot vs. Full Scale
388
+
389
+ | | Pilot (13K) | Full (500K) | Scale Factor |
390
+ |---|---|---|---|
391
+ | GPU-hours | 273h | 2,015h | 7.4× |
392
+ | Cost | $654 | $7,678 | 11.7× |
393
+ | Timeline | 10 weeks | 16 weeks | 1.6× |
394
+ | Storage | 284 GiB | 5.4 TiB | 19× |
395
 
396
+ The cost scales sub-linearly with data (11.7× cost for 38. more data) because: (a) embedding extraction amortizes across all experiments, (b) fewer epochs needed at larger data scale, (c) pilot experiments avoid wasting compute on methods that don't work.
397
 
398
  ---
399
 
 
401
 
402
  ### 8.1 Empirical Contributions
403
 
404
+ 1. **First cross-person generalization study on ego+IMU at 1,000-worker, 500K-clip scale.** Prior work: CMU-MMAC (12 subjects), WEAR (18), MMEA (unknown), Ego4D (mixed, not controlled). We are **50-80× larger** in subject count.
405
 
406
  2. **Modality robustness comparison under person shift.** No prior work has compared video vs. IMU vs. multimodal degradation under controlled cross-person shift on the same dataset.
407
 
408
+ 3. **Data scaling laws for cross-person generalization.** First systematic study of how cross-person accuracy scales with training data (13K 500K). This has direct practical value: how much data do you need to collect?
409
+
410
+ 4. **Worker difficulty prediction from IMU statistics.** If successful, this enables proactive data collection — target workers whose motion patterns are underrepresented.
411
 
412
+ 5. **Cross-modal distillation under distribution shift.** COMODO was validated on same-distribution splits. We test the open question: does distilled knowledge transfer across people?
413
 
414
  ### 8.2 Benchmark Contribution
415
 
416
+ - Standardized splits (5-fold L20%WO) and evaluation protocol for 1,000-worker ego+IMU
417
+ - Baseline results for 11+ models across 3 modality configurations at 500K scale
418
  - Per-worker difficulty scores for hard case analysis
419
+ - Data scaling curves with fitted power laws
420
  - Open-source evaluation code and pre-computed features
421
 
422
  ### 8.3 Practical Contribution
423
 
424
+ - Guidelines for "how many workers to collect" (worker scaling curve)
425
+ - Guidelines for "how many clips to collect" (data scaling curve)
426
  - Method for flagging hard-to-generalize deployment scenarios
427
  - Evidence for whether IMU-only deployment (privacy-preserving) sacrifices cross-person robustness vs. video
428
 
 
432
 
433
  | Paper | What They Did | What We Add |
434
  |-------|---------------|-------------|
435
+ | **COMODO** (Chen et al., 2025) | Video→IMU distillation on Ego4D/EgoExo4D/MMEA | Test distillation under cross-person shift; ~1,000 workers, 500K clips |
436
+ | **EVI-MAE** (Zhang et al., 2024) | Multimodal MAE on CMU-MMAC (12 subj.) and WEAR (18 subj.) | Scale to ~1,000 workers; evaluate cross-person transfer of MAE representations |
437
+ | **Generalizable HAR Survey** (Cai et al., 2025) | Taxonomy of cross-person/device/position HAR | Provide the large-scale empirical study their survey explicitly calls for |
438
  | **IMU-Video OOD HAR** (Cheshmi et al., 2025) | Cross-modal SSL for OOD transfer | Study OOD as cross-person (same task, different people) rather than cross-domain |
439
  | **ColloSSL** (Jain et al., 2022) | Multi-device contrastive SSL for HAR | Apply cross-device ideas to cross-person setting with paired video+IMU |
440
  | **UniMTS** (Zhang et al., 2024) | Text-aligned IMU pre-training for zero-shot HAR | Potential extension: text-conditioned cross-person generalization |
441
+ | **MIAM** (Mehta et al., 2025) | Industrial assembly monitoring (RGB+depth+IMU, 8 volunteers) | Same domain but ~1,000 workers vs. 8; formal cross-person protocol |
442
 
443
  ---
444
 
 
446
 
447
  | Risk | Likelihood | Mitigation |
448
  |------|-----------|------------|
449
+ | 500K dataset access delayed | Medium | Start with 13K pilot (Weeks 1-4). All pipelines and pilot results are independently valuable |
450
+ | Video streaming bottleneck at 47 TiB | Medium | Pre-compute embeddings (only 22 GiB). Only EVI-MAE needs raw video; use selective 10% subset |
451
+ | 3-class task too easy (>95% accuracy) | Medium — "working" is 84% majority class | Use macro-F1 (penalizes majority bias); explore finer-grained temporal segmentation at scale |
452
+ | EVI-MAE training unstable on A100x4 | Low | Validated on pilot first (Week 4). Follow published hyperparameters exactly; reduce lr if needed |
453
+ | Compute budget overrun | Medium | 30% contingency ($1,772). Pilot phase identifies which methods to scale — skip underperformers |
454
+ | Insufficient cross-person variance | Very Low — already confirmed in pilot | 1,000 workers guarantees far more diversity than 100 |
455
 
456
  ---
457
 
 
463
  - Built the **IMU (200Hz) windowing and video synchronization pipeline**
464
  - Explored **frequency-domain and temporal features** from the IMU signals
465
  - Observed the cross-worker motion pattern differences that motivate this proposal
466
+ - Profiled the full dataset structure (100 workers, 12,997 clips, 1.22 TiB)
467
 
468
+ This prior work means I can move directly to model training on the pilot without pipeline development overhead, and scale to 500K with confidence in the data pipeline.
469
 
470
  ---
471