Scale proposal to 500K datapoints from Egocentric-1M dataset — updated budgets, timeline, storage, new scaling experiment
Browse files- PROPOSAL.md +194 -91
PROPOSAL.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
# Cross-Person Generalization in Egocentric Video + IMU Activity Recognition: A Benchmark Study
|
| 2 |
|
| 3 |
**Author:** Shubham Rasal — [shubhamrasal.com](https://shubhamrasal.com)
|
| 4 |
**Date:** April 2026
|
|
@@ -18,18 +18,23 @@ This is the setting that matters for any commercial wearable deployment, yet it
|
|
| 18 |
|
| 19 |
### The Opportunity
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|----------|-------|----------------|
|
| 25 |
-
| Workers | **100** (distinct individuals) | Largest cross-person ego+IMU study possible |
|
| 26 |
-
| Clips | **12,997** | Statistical power for fine-grained analysis |
|
| 27 |
-
| Duration | **~97.5 hours** | 3× larger than CMU-MMAC + WEAR combined |
|
| 28 |
-
| IMU rate | **200 Hz** (acc + gyro) | Matches Ego4D protocol; richer spectral content than 50-60Hz datasets |
|
| 29 |
-
| Activity labels | working / break / not-on-person | Free supervision per worker (from metadata) |
|
| 30 |
-
| Domain | Industrial / workplace | Underrepresented in ego research (most work uses cooking/sports) |
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
---
|
| 35 |
|
|
@@ -50,13 +55,16 @@ From preliminary exploration of the dataset, I've observed **consistent differen
|
|
| 50 |
5. **Does self-supervised pre-training on all workers (unlabeled) close the cross-person gap?**
|
| 51 |
EVI-MAE-style pre-training uses unlabeled data from all workers. Does this shared representation space reduce person-specific biases?
|
| 52 |
|
|
|
|
|
|
|
|
|
|
| 53 |
---
|
| 54 |
|
| 55 |
## 3. Dataset Overview
|
| 56 |
|
| 57 |
-
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
| Metric | Value |
|
| 62 |
|--------|-------|
|
|
@@ -67,7 +75,7 @@ From preliminary exploration of the dataset, I've observed **consistent differen
|
|
| 67 |
| Video per clip | ~100 MiB MP4, ~27 seconds |
|
| 68 |
| IMU per clip | ~500 KiB JSONL, ~5,400 samples at 200Hz |
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
| Statistic | Clips per Worker |
|
| 73 |
|-----------|-----------------|
|
|
@@ -78,7 +86,7 @@ From preliminary exploration of the dataset, I've observed **consistent differen
|
|
| 78 |
| Max | 203 |
|
| 79 |
| Mean ± Std | 130.0 ± 36.3 |
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
| State | Mean Fraction | Std | Min | Max |
|
| 84 |
|-------|---------------|-----|-----|-----|
|
|
@@ -87,7 +95,19 @@ From preliminary exploration of the dataset, I've observed **consistent differen
|
|
| 87 |
| Not on person | 0.034 | 0.016 | 0.000 | 0.058 |
|
| 88 |
| Worn (working + break) | 0.966 | 0.016 | 0.942 | 1.000 |
|
| 89 |
|
| 90 |
-
### 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
| Worker | Working % | Break % | Off-body % | Clips | Notes |
|
| 93 |
|--------|-----------|---------|------------|-------|-------|
|
|
@@ -96,9 +116,9 @@ From preliminary exploration of the dataset, I've observed **consistent differen
|
|
| 96 |
| worker_019 | 57.0% | 41.2% | 1.8% | 110 | Near-equal work/break split |
|
| 97 |
| worker_087 | 62.1% | 32.5% | 5.3% | 164 | High break + high off-body |
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
### 3.
|
| 102 |
|
| 103 |
- First and last 2 clips trimmed per worker (removes startup/shutdown artifacts)
|
| 104 |
- Audio stripped from all videos
|
|
@@ -108,37 +128,54 @@ These workers represent the distribution shift challenge: models trained on high
|
|
| 108 |
|
| 109 |
## 4. Experimental Design
|
| 110 |
|
| 111 |
-
### 4.1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
-
|
| 114 |
|
| 115 |
-
|
| 116 |
-
- **Train:** 60 workers (~7,800 clips, ~58.5h)
|
| 117 |
-
- **Validation:** 20 workers (~2,600 clips, ~19.5h)
|
| 118 |
-
- **Test:** 20 workers (~2,600 clips, ~19.5h)
|
| 119 |
|
| 120 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
**Secondary protocols:**
|
| 123 |
-
- **Leave-1-Worker-Out (L1WO):** Train on 99
|
| 124 |
-
- **
|
|
|
|
| 125 |
|
| 126 |
-
### 4.
|
| 127 |
|
| 128 |
-
**3-class activity state classification**
|
| 129 |
-
- **Working** (84.3% of clips)
|
| 130 |
- **Taking break** (12.3%)
|
| 131 |
- **Not on person** (3.4%)
|
| 132 |
|
| 133 |
-
This is a natural, practically-relevant task with class imbalance — reflecting real deployment conditions.
|
| 134 |
-
|
| 135 |
**Window protocol** (following COMODO/Ego4D standard):
|
| 136 |
- 5-second non-overlapping windows
|
| 137 |
- IMU: 200Hz × 5s = 1,000 timesteps × 6 channels (acc_xyz + gyro_xyz)
|
| 138 |
- Video: 10 FPS × 5s = 50 frames, resized to 224×224
|
| 139 |
-
- Per 27s clip: ~5 windows → **~
|
| 140 |
|
| 141 |
-
### 4.
|
| 142 |
|
| 143 |
| Metric | Why |
|
| 144 |
|--------|-----|
|
|
@@ -148,7 +185,7 @@ This is a natural, practically-relevant task with class imbalance — reflecting
|
|
| 148 |
| **Generalization gap** | Δ between within-person and cross-person F1 |
|
| 149 |
| **Per-class F1** | Breakdown across working/break/not-worn |
|
| 150 |
|
| 151 |
-
### 4.
|
| 152 |
|
| 153 |
#### Experiment A: IMU-Only Models (5 baselines)
|
| 154 |
|
|
@@ -160,7 +197,7 @@ This is a natural, practically-relevant task with class imbalance — reflecting
|
|
| 160 |
| MOMENT-small | Pre-trained time-series foundation | ~40M | Goswami et al., 2024 |
|
| 161 |
| Mantis | Pre-trained time-series | ~15M | Liu et al., 2024 |
|
| 162 |
|
| 163 |
-
All evaluated in both **within-person** (random split) and **cross-person** (
|
| 164 |
|
| 165 |
#### Experiment B: Video-Only Models (3 baselines)
|
| 166 |
|
|
@@ -187,11 +224,11 @@ All evaluated in both **within-person** (random split) and **cross-person** (L20
|
|
| 187 |
| EVI-MAE | Joint video+IMU MAE pre-training → fine-tune | Zhang et al., 2024 |
|
| 188 |
| IMU-only MAE | IMU spectrogram MAE → fine-tune | Adapted from EVI-MAE |
|
| 189 |
|
| 190 |
-
Pre-train on **all
|
| 191 |
|
| 192 |
#### Experiment E: Hard Case Analysis
|
| 193 |
|
| 194 |
-
For
|
| 195 |
- **IMU statistics:** RMS acceleration, gyro variance, spectral centroid, signal entropy, jerk magnitude, dominant frequency
|
| 196 |
- **Temporal statistics:** autocorrelation decay, stationarity (ADF test), transition rate between high/low activity
|
| 197 |
- **Model performance:** per-worker F1 from L1WO
|
|
@@ -201,6 +238,20 @@ Then:
|
|
| 201 |
2. **Cluster** workers by IMU feature similarity → do clusters predict generalization difficulty?
|
| 202 |
3. **Predict difficulty:** train a regression model (IMU stats → expected F1) — can we flag hard workers before deployment?
|
| 203 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 204 |
---
|
| 205 |
|
| 206 |
## 5. Technical Approach
|
|
@@ -217,7 +268,9 @@ Raw JSONL (200Hz, acc+gyro)
|
|
| 217 |
→ Per-channel z-normalization (computed on training set only)
|
| 218 |
```
|
| 219 |
|
| 220 |
-
*
|
|
|
|
|
|
|
| 221 |
|
| 222 |
### 5.2 Video Preprocessing Pipeline
|
| 223 |
|
|
@@ -226,10 +279,13 @@ MP4 (27s clips, ~100 MiB each)
|
|
| 226 |
→ Decode at 10 FPS → 270 frames per clip
|
| 227 |
→ 5-second windows → 50 frames per window
|
| 228 |
→ Resize to 224×224, normalize (ImageNet stats)
|
| 229 |
-
→
|
| 230 |
```
|
| 231 |
|
| 232 |
-
**
|
|
|
|
|
|
|
|
|
|
| 233 |
|
| 234 |
### 5.3 COMODO Distillation Protocol
|
| 235 |
|
|
@@ -238,7 +294,7 @@ Following Chen et al. (2025) exactly:
|
|
| 238 |
- **IMU student:** MOMENT-small or Mantis, trainable. MLP projector (hidden=2048, out=128)
|
| 239 |
- **Loss:** Cross-entropy on similarity distribution with dynamic FIFO instance queue
|
| 240 |
- **Hyperparameters:** 20 epochs, batch 128, lr=3e-4, τ_v=0.1, τ_x=0.05
|
| 241 |
-
- **Hardware:**
|
| 242 |
|
| 243 |
### 5.4 EVI-MAE Protocol
|
| 244 |
|
|
@@ -247,56 +303,97 @@ Following Zhang et al. (2024):
|
|
| 247 |
- **IMU encoder:** Transformer on STFT spectrogram patches (16×16), 75% masking ratio
|
| 248 |
- **Fusion:** Single-layer Transformer
|
| 249 |
- **Pre-training loss:** MAE reconstruction (α=1, β=10) + contrastive (γ=0.01, τ=0.05)
|
| 250 |
-
- **Pre-train:**
|
| 251 |
- **Fine-tune:** 50 epochs on train split with labels
|
| 252 |
-
- **Hardware:**
|
| 253 |
|
| 254 |
---
|
| 255 |
|
| 256 |
-
## 6. Timeline (
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 257 |
|
| 258 |
| Week | Phase | Deliverable | Compute |
|
| 259 |
|------|-------|-------------|---------|
|
| 260 |
-
| **
|
| 261 |
-
| **
|
| 262 |
-
| **
|
| 263 |
-
| **
|
| 264 |
-
| **
|
| 265 |
-
| **
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
|
| 270 |
-
|
| 271 |
-
**
|
| 272 |
-
**
|
|
|
|
|
|
|
|
|
|
| 273 |
|
| 274 |
---
|
| 275 |
|
| 276 |
## 7. Compute Budget
|
| 277 |
|
|
|
|
|
|
|
| 278 |
| Resource | Hours | Unit Cost | Total |
|
| 279 |
|----------|-------|-----------|-------|
|
| 280 |
-
|
|
| 281 |
-
| A10G GPU (24 GB) |
|
| 282 |
-
|
|
| 283 |
-
|
|
| 284 |
-
| **
|
| 285 |
-
| **
|
| 286 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 287 |
|
| 288 |
-
|
| 289 |
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
|
|
| 293 |
-
|
|
| 294 |
-
|
|
| 295 |
-
|
|
| 296 |
-
|
|
| 297 |
-
|
|
| 298 |
|
| 299 |
-
|
| 300 |
|
| 301 |
---
|
| 302 |
|
|
@@ -304,24 +401,28 @@ Following Zhang et al. (2024):
|
|
| 304 |
|
| 305 |
### 8.1 Empirical Contributions
|
| 306 |
|
| 307 |
-
1. **First
|
| 308 |
|
| 309 |
2. **Modality robustness comparison under person shift.** No prior work has compared video vs. IMU vs. multimodal degradation under controlled cross-person shift on the same dataset.
|
| 310 |
|
| 311 |
-
3. **
|
|
|
|
|
|
|
| 312 |
|
| 313 |
-
|
| 314 |
|
| 315 |
### 8.2 Benchmark Contribution
|
| 316 |
|
| 317 |
-
- Standardized splits (5-fold
|
| 318 |
-
- Baseline results for 11+ models across 3 modality configurations
|
| 319 |
- Per-worker difficulty scores for hard case analysis
|
|
|
|
| 320 |
- Open-source evaluation code and pre-computed features
|
| 321 |
|
| 322 |
### 8.3 Practical Contribution
|
| 323 |
|
| 324 |
-
- Guidelines for "how many workers to collect" (scaling curve)
|
|
|
|
| 325 |
- Method for flagging hard-to-generalize deployment scenarios
|
| 326 |
- Evidence for whether IMU-only deployment (privacy-preserving) sacrifices cross-person robustness vs. video
|
| 327 |
|
|
@@ -331,13 +432,13 @@ Following Zhang et al. (2024):
|
|
| 331 |
|
| 332 |
| Paper | What They Did | What We Add |
|
| 333 |
|-------|---------------|-------------|
|
| 334 |
-
| **COMODO** (Chen et al., 2025) | Video→IMU distillation on Ego4D/EgoExo4D/MMEA | Test distillation under cross-person shift;
|
| 335 |
-
| **EVI-MAE** (Zhang et al., 2024) | Multimodal MAE on CMU-MMAC (12 subj.) and WEAR (18 subj.) | Scale to
|
| 336 |
-
| **Generalizable HAR Survey** (Cai et al., 2025) | Taxonomy of cross-person/device/position HAR | Provide the
|
| 337 |
| **IMU-Video OOD HAR** (Cheshmi et al., 2025) | Cross-modal SSL for OOD transfer | Study OOD as cross-person (same task, different people) rather than cross-domain |
|
| 338 |
| **ColloSSL** (Jain et al., 2022) | Multi-device contrastive SSL for HAR | Apply cross-device ideas to cross-person setting with paired video+IMU |
|
| 339 |
| **UniMTS** (Zhang et al., 2024) | Text-aligned IMU pre-training for zero-shot HAR | Potential extension: text-conditioned cross-person generalization |
|
| 340 |
-
| **MIAM** (Mehta et al., 2025) | Industrial assembly monitoring (RGB+depth+IMU) | Same domain but
|
| 341 |
|
| 342 |
---
|
| 343 |
|
|
@@ -345,11 +446,12 @@ Following Zhang et al. (2024):
|
|
| 345 |
|
| 346 |
| Risk | Likelihood | Mitigation |
|
| 347 |
|------|-----------|------------|
|
| 348 |
-
|
|
| 349 |
-
|
|
| 350 |
-
|
|
| 351 |
-
|
|
| 352 |
-
| Compute overrun | Medium |
|
|
|
|
| 353 |
|
| 354 |
---
|
| 355 |
|
|
@@ -361,8 +463,9 @@ I've already worked with this dataset in its earlier form (the compression chall
|
|
| 361 |
- Built the **IMU (200Hz) windowing and video synchronization pipeline**
|
| 362 |
- Explored **frequency-domain and temporal features** from the IMU signals
|
| 363 |
- Observed the cross-worker motion pattern differences that motivate this proposal
|
|
|
|
| 364 |
|
| 365 |
-
This prior work means I can move directly to model training without pipeline development overhead.
|
| 366 |
|
| 367 |
---
|
| 368 |
|
|
|
|
| 1 |
+
# Cross-Person Generalization in Egocentric Video + IMU Activity Recognition: A Benchmark Study at 500K Scale
|
| 2 |
|
| 3 |
**Author:** Shubham Rasal — [shubhamrasal.com](https://shubhamrasal.com)
|
| 4 |
**Date:** April 2026
|
|
|
|
| 18 |
|
| 19 |
### The Opportunity
|
| 20 |
|
| 21 |
+
We will use **500K clips from the Egocentric-1M dataset** — drawing from a pool of ~1,000 workers performing industrial/workplace tasks. This is orders of magnitude larger than any existing ego+IMU benchmark and enables studies that are statistically impossible on current datasets.
|
| 22 |
|
| 23 |
+
Additionally, we have early access to a **13K-clip, 100-worker pilot subset** (the Egocentric Native Compression dataset) for rapid prototyping and pipeline validation before scaling to 500K.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
| Property | Pilot (Available Now) | Full Scale (500K) |
|
| 26 |
+
|----------|----------------------|-------------------|
|
| 27 |
+
| Workers | 100 | ~1,000+ (estimated) |
|
| 28 |
+
| Clips | 12,997 | **500,000** |
|
| 29 |
+
| Duration | ~97.5 hours | **~3,750 hours** |
|
| 30 |
+
| Total size | 1.22 TiB | **~47 TiB** |
|
| 31 |
+
| IMU data | 6.18 GiB | **~238 GiB** |
|
| 32 |
+
| 5s windows | ~65K | **~2.5M** |
|
| 33 |
+
| IMU rate | 200 Hz (acc + gyro) | 200 Hz (acc + gyro) |
|
| 34 |
+
| Activity labels | working / break / not-on-person | working / break / not-on-person |
|
| 35 |
+
| Domain | Industrial / workplace | Industrial / workplace |
|
| 36 |
+
|
| 37 |
+
From preliminary exploration of the pilot dataset, I've observed **consistent differences in motion patterns across workers performing similar tasks** — exactly the distribution shift that drives generalization failures. With ~1,000 workers and 500K clips, we can study this at a scale that no prior work has approached.
|
| 38 |
|
| 39 |
---
|
| 40 |
|
|
|
|
| 55 |
5. **Does self-supervised pre-training on all workers (unlabeled) close the cross-person gap?**
|
| 56 |
EVI-MAE-style pre-training uses unlabeled data from all workers. Does this shared representation space reduce person-specific biases?
|
| 57 |
|
| 58 |
+
6. **What are the scaling laws for cross-person generalization?** *(New at 500K scale)*
|
| 59 |
+
How does downstream cross-person accuracy change as we increase training data from 13K → 50K → 100K → 250K → 500K clips? Is there a data scaling law for cross-person robustness?
|
| 60 |
+
|
| 61 |
---
|
| 62 |
|
| 63 |
## 3. Dataset Overview
|
| 64 |
|
| 65 |
+
### 3.1 Pilot Dataset (Available Now)
|
| 66 |
|
| 67 |
+
**Source:** `gs://build-ai-egocentric-native-compression/`
|
| 68 |
|
| 69 |
| Metric | Value |
|
| 70 |
|--------|-------|
|
|
|
|
| 75 |
| Video per clip | ~100 MiB MP4, ~27 seconds |
|
| 76 |
| IMU per clip | ~500 KiB JSONL, ~5,400 samples at 200Hz |
|
| 77 |
|
| 78 |
+
**Pilot worker statistics (from `index.csv`):**
|
| 79 |
|
| 80 |
| Statistic | Clips per Worker |
|
| 81 |
|-----------|-----------------|
|
|
|
|
| 86 |
| Max | 203 |
|
| 87 |
| Mean ± Std | 130.0 ± 36.3 |
|
| 88 |
|
| 89 |
+
**Activity state distribution (per-worker metadata):**
|
| 90 |
|
| 91 |
| State | Mean Fraction | Std | Min | Max |
|
| 92 |
|-------|---------------|-----|-----|-----|
|
|
|
|
| 95 |
| Not on person | 0.034 | 0.016 | 0.000 | 0.058 |
|
| 96 |
| Worn (working + break) | 0.966 | 0.016 | 0.942 | 1.000 |
|
| 97 |
|
| 98 |
+
### 3.2 Full Dataset (500K Target)
|
| 99 |
+
|
| 100 |
+
| Metric | Estimated Value |
|
| 101 |
+
|--------|-----------------|
|
| 102 |
+
| Total workers | ~1,000+ |
|
| 103 |
+
| Total clips | **500,000** |
|
| 104 |
+
| Total duration | **~3,750 hours (~156 days)** |
|
| 105 |
+
| Video data | **~46.9 TiB** |
|
| 106 |
+
| IMU data | **~238 GiB** |
|
| 107 |
+
| 5s windows | **~2,500,000** |
|
| 108 |
+
| IMU window tensor (float32) | **~56 GiB** |
|
| 109 |
+
|
| 110 |
+
### 3.3 Notable Pilot Workers (Hard Cases)
|
| 111 |
|
| 112 |
| Worker | Working % | Break % | Off-body % | Clips | Notes |
|
| 113 |
|--------|-----------|---------|------------|-------|-------|
|
|
|
|
| 116 |
| worker_019 | 57.0% | 41.2% | 1.8% | 110 | Near-equal work/break split |
|
| 117 |
| worker_087 | 62.1% | 32.5% | 5.3% | 164 | High break + high off-body |
|
| 118 |
|
| 119 |
+
At 500K scale with ~1,000 workers, we expect 10× more behavioral diversity — more outliers, richer distribution tails, and harder generalization challenges.
|
| 120 |
|
| 121 |
+
### 3.4 Curation
|
| 122 |
|
| 123 |
- First and last 2 clips trimmed per worker (removes startup/shutdown artifacts)
|
| 124 |
- Audio stripped from all videos
|
|
|
|
| 128 |
|
| 129 |
## 4. Experimental Design
|
| 130 |
|
| 131 |
+
### 4.1 Two-Stage Approach: Pilot → Full Scale
|
| 132 |
+
|
| 133 |
+
**Stage 1 (Weeks 1-4): Pilot on 13K clips**
|
| 134 |
+
- Validate all pipelines, debug models, establish baseline numbers
|
| 135 |
+
- Run all experiments on 100-worker pilot for fast iteration (~hours per experiment)
|
| 136 |
+
- Identify which methods are worth scaling
|
| 137 |
+
|
| 138 |
+
**Stage 2 (Weeks 5-14): Scale to 500K clips**
|
| 139 |
+
- Run the winning methods at full scale
|
| 140 |
+
- Execute the data scaling experiment (unique contribution at this scale)
|
| 141 |
+
- Full cross-person benchmark at ~1,000-worker scale
|
| 142 |
|
| 143 |
+
### 4.2 Cross-Person Evaluation Protocol
|
| 144 |
|
| 145 |
+
**Primary protocol: 5-Fold Leave-20%-Workers-Out (L20%WO)**
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
+
Split workers into 5 non-overlapping folds (20% each). For each fold:
|
| 148 |
+
- **Train:** 60% of workers
|
| 149 |
+
- **Validation:** 20% of workers
|
| 150 |
+
- **Test:** 20% of workers
|
| 151 |
+
|
| 152 |
+
| | Pilot (100 workers) | Full (est. ~1,000 workers) |
|
| 153 |
+
|---|---|---|
|
| 154 |
+
| Train | 60 workers (~7,800 clips) | ~600 workers (~300K clips) |
|
| 155 |
+
| Val | 20 workers (~2,600 clips) | ~200 workers (~100K clips) |
|
| 156 |
+
| Test | 20 workers (~2,600 clips) | ~200 workers (~100K clips) |
|
| 157 |
+
|
| 158 |
+
Report mean ± std across 5 folds.
|
| 159 |
|
| 160 |
**Secondary protocols:**
|
| 161 |
+
- **Leave-1-Worker-Out (L1WO):** On a representative subset of 100 workers (from the full 1,000). Train on 99, test on 1 × 100. Measures per-worker difficulty.
|
| 162 |
+
- **Worker scaling curve:** Train on {100, 200, 400, 600, 800} workers, always test on the same held-out 200. Answers: "how many workers do you need?"
|
| 163 |
+
- **Data scaling curve:** Train best model at {13K, 50K, 100K, 250K, 500K} clips. Answers: "how much data do you need?"
|
| 164 |
|
| 165 |
+
### 4.3 Task Definition
|
| 166 |
|
| 167 |
+
**3-class activity state classification:**
|
| 168 |
+
- **Working** (84.3% of clips in pilot)
|
| 169 |
- **Taking break** (12.3%)
|
| 170 |
- **Not on person** (3.4%)
|
| 171 |
|
|
|
|
|
|
|
| 172 |
**Window protocol** (following COMODO/Ego4D standard):
|
| 173 |
- 5-second non-overlapping windows
|
| 174 |
- IMU: 200Hz × 5s = 1,000 timesteps × 6 channels (acc_xyz + gyro_xyz)
|
| 175 |
- Video: 10 FPS × 5s = 50 frames, resized to 224×224
|
| 176 |
+
- Per 27s clip: ~5 windows → **~2.5M total windows at 500K scale**
|
| 177 |
|
| 178 |
+
### 4.4 Metrics
|
| 179 |
|
| 180 |
| Metric | Why |
|
| 181 |
|--------|-----|
|
|
|
|
| 185 |
| **Generalization gap** | Δ between within-person and cross-person F1 |
|
| 186 |
| **Per-class F1** | Breakdown across working/break/not-worn |
|
| 187 |
|
| 188 |
+
### 4.5 Baseline Comparisons
|
| 189 |
|
| 190 |
#### Experiment A: IMU-Only Models (5 baselines)
|
| 191 |
|
|
|
|
| 197 |
| MOMENT-small | Pre-trained time-series foundation | ~40M | Goswami et al., 2024 |
|
| 198 |
| Mantis | Pre-trained time-series | ~15M | Liu et al., 2024 |
|
| 199 |
|
| 200 |
+
All evaluated in both **within-person** (random split) and **cross-person** (L20%WO) settings. The gap between the two is the headline result.
|
| 201 |
|
| 202 |
#### Experiment B: Video-Only Models (3 baselines)
|
| 203 |
|
|
|
|
| 224 |
| EVI-MAE | Joint video+IMU MAE pre-training → fine-tune | Zhang et al., 2024 |
|
| 225 |
| IMU-only MAE | IMU spectrogram MAE → fine-tune | Adapted from EVI-MAE |
|
| 226 |
|
| 227 |
+
Pre-train on **all workers** (unlabeled, self-supervised), then fine-tune on train workers with labels, evaluate on held-out test workers. Compare with same architecture trained from scratch.
|
| 228 |
|
| 229 |
#### Experiment E: Hard Case Analysis
|
| 230 |
|
| 231 |
+
For a representative subset of workers, compute:
|
| 232 |
- **IMU statistics:** RMS acceleration, gyro variance, spectral centroid, signal entropy, jerk magnitude, dominant frequency
|
| 233 |
- **Temporal statistics:** autocorrelation decay, stationarity (ADF test), transition rate between high/low activity
|
| 234 |
- **Model performance:** per-worker F1 from L1WO
|
|
|
|
| 238 |
2. **Cluster** workers by IMU feature similarity → do clusters predict generalization difficulty?
|
| 239 |
3. **Predict difficulty:** train a regression model (IMU stats → expected F1) — can we flag hard workers before deployment?
|
| 240 |
|
| 241 |
+
#### Experiment F: Data Scaling Laws *(New at 500K)*
|
| 242 |
+
|
| 243 |
+
Train the best-performing model from Experiments A-D at increasing data scales:
|
| 244 |
+
|
| 245 |
+
| Scale | Clips | Est. Workers | Est. Hours |
|
| 246 |
+
|-------|-------|-------------|------------|
|
| 247 |
+
| 13K | 12,997 | 100 | 97.5h |
|
| 248 |
+
| 50K | 50,000 | ~385 | 375h |
|
| 249 |
+
| 100K | 100,000 | ~770 | 750h |
|
| 250 |
+
| 250K | 250,000 | ~1,000 | 1,875h |
|
| 251 |
+
| 500K | 500,000 | ~1,000 | 3,750h |
|
| 252 |
+
|
| 253 |
+
Measure: cross-person Macro F1 vs. training data size. Is there a scaling law? Does it plateau?
|
| 254 |
+
|
| 255 |
---
|
| 256 |
|
| 257 |
## 5. Technical Approach
|
|
|
|
| 268 |
→ Per-channel z-normalization (computed on training set only)
|
| 269 |
```
|
| 270 |
|
| 271 |
+
**At 500K scale:** 2.5M windows × 6,000 floats = ~56 GiB processed tensor. Fits comfortably in RAM on a cpu-upgrade instance. Preprocessing takes ~80h CPU but is embarrassingly parallel.
|
| 272 |
+
|
| 273 |
+
*Already built:* I have the 200Hz windowing and video sync pipeline from earlier work on the pilot dataset.
|
| 274 |
|
| 275 |
### 5.2 Video Preprocessing Pipeline
|
| 276 |
|
|
|
|
| 279 |
→ Decode at 10 FPS → 270 frames per clip
|
| 280 |
→ 5-second windows → 50 frames per window
|
| 281 |
→ Resize to 224×224, normalize (ImageNet stats)
|
| 282 |
+
→ Pre-compute video encoder embeddings (store to disk)
|
| 283 |
```
|
| 284 |
|
| 285 |
+
**At 500K scale:** Pre-computing video embeddings is critical. We do NOT download all 47 TiB of video. Instead:
|
| 286 |
+
- Stream video clips on-demand during embedding extraction
|
| 287 |
+
- TimeSformer-B produces 768-dim vectors → 2.5M windows × 768 × 4 bytes = **~7.3 GiB** per model. Easily stored.
|
| 288 |
+
- Embedding extraction: ~200h on A10G (parallelizable across 4 GPUs → ~50h wall-clock)
|
| 289 |
|
| 290 |
### 5.3 COMODO Distillation Protocol
|
| 291 |
|
|
|
|
| 294 |
- **IMU student:** MOMENT-small or Mantis, trainable. MLP projector (hidden=2048, out=128)
|
| 295 |
- **Loss:** Cross-entropy on similarity distribution with dynamic FIFO instance queue
|
| 296 |
- **Hyperparameters:** 20 epochs, batch 128, lr=3e-4, τ_v=0.1, τ_x=0.05
|
| 297 |
+
- **Hardware:** A10G GPU (24 GB VRAM) — scales linearly with data, ~12h per student per fold at 500K
|
| 298 |
|
| 299 |
### 5.4 EVI-MAE Protocol
|
| 300 |
|
|
|
|
| 303 |
- **IMU encoder:** Transformer on STFT spectrogram patches (16×16), 75% masking ratio
|
| 304 |
- **Fusion:** Single-layer Transformer
|
| 305 |
- **Pre-training loss:** MAE reconstruction (α=1, β=10) + contrastive (γ=0.01, τ=0.05)
|
| 306 |
+
- **Pre-train:** 100 epochs on all workers (fewer epochs needed — each epoch sees 38× more data)
|
| 307 |
- **Fine-tune:** 50 epochs on train split with labels
|
| 308 |
+
- **Hardware:** **A100x4 (4× 80GB)** for pre-training (~6 days), A10G for fine-tuning
|
| 309 |
|
| 310 |
---
|
| 311 |
|
| 312 |
+
## 6. Timeline (16 Weeks)
|
| 313 |
+
|
| 314 |
+
### Stage 1: Pilot Validation (Weeks 1-4)
|
| 315 |
+
|
| 316 |
+
| Week | Phase | Deliverable | Compute |
|
| 317 |
+
|------|-------|-------------|---------|
|
| 318 |
+
| **1** | Data preparation | IMU windowing, video embedding extraction, EDA notebook on pilot | CPU: 12h, A10G: 10h |
|
| 319 |
+
| **2** | IMU baselines (pilot) | 5 IMU models × 5 folds on 13K clips. Validate pipeline, establish baselines | A10G: 20h |
|
| 320 |
+
| **3** | Video + COMODO (pilot) | Video baselines + COMODO distillation on 13K clips | A10G: 30h |
|
| 321 |
+
| **4** | EVI-MAE (pilot) + triage | Self-supervised pre-training on pilot. Select methods to scale | A10G: 40h |
|
| 322 |
+
|
| 323 |
+
### Stage 2: Full Scale (Weeks 5-14)
|
| 324 |
|
| 325 |
| Week | Phase | Deliverable | Compute |
|
| 326 |
|------|-------|-------------|---------|
|
| 327 |
+
| **5-6** | 500K data ingestion | Download 238 GiB IMU. Stream-extract video embeddings for 500K clips (3 models) | CPU: 88h, A10G: 600h |
|
| 328 |
+
| **7-8** | IMU baselines (500K) | 5 IMU models × 5 folds at full scale | A10G: 250h |
|
| 329 |
+
| **9-10** | Video + COMODO (500K) | Video baselines + COMODO distillation at full scale | A10G: 240h |
|
| 330 |
+
| **11-12** | EVI-MAE (500K) | Self-supervised pre-training on all workers (~6 days on A100x4) + fine-tuning | A100x4: 150h, A10G: 40h |
|
| 331 |
+
| **13** | Hard case analysis | Per-worker difficulty on 100-worker subset, IMU feature correlation | A10G: 100h, CPU: 40h |
|
| 332 |
+
| **14** | Scaling experiments | Data scaling curve (13K→500K), worker scaling curve (100→800) | A10G: 375h |
|
| 333 |
+
|
| 334 |
+
### Stage 3: Synthesis (Weeks 15-16)
|
| 335 |
+
|
| 336 |
+
| Week | Phase | Deliverable | Compute |
|
| 337 |
+
|------|-------|-------------|---------|
|
| 338 |
+
| **15** | Analysis | Compile results, statistical significance tests, generate figures | CPU: 20h |
|
| 339 |
+
| **16** | Write-up | Paper draft (8-page main + appendix) | — |
|
| 340 |
+
|
| 341 |
+
**Total wall-clock:** 16 weeks
|
| 342 |
+
**Active compute window:** Weeks 1-14 (14 weeks of experiments)
|
| 343 |
|
| 344 |
---
|
| 345 |
|
| 346 |
## 7. Compute Budget
|
| 347 |
|
| 348 |
+
### 7.1 Summary
|
| 349 |
+
|
| 350 |
| Resource | Hours | Unit Cost | Total |
|
| 351 |
|----------|-------|-----------|-------|
|
| 352 |
+
| CPU (preprocessing, analysis) | 160h | $0.60/h | $96 |
|
| 353 |
+
| A10G GPU (24 GB) | 1,705h | $2.00/h | $3,410 |
|
| 354 |
+
| A100x4 GPU (4× 80 GB) | 150h | $16.00/h | $2,400 |
|
| 355 |
+
| **Subtotal** | **2,015h** | | **$5,906** |
|
| 356 |
+
| **30% contingency** | | | **+$1,772** |
|
| 357 |
+
| **Total** | | | **~$7,678** |
|
| 358 |
+
|
| 359 |
+
### 7.2 Phase Breakdown
|
| 360 |
+
|
| 361 |
+
| Phase | Resource | Hours | Cost | Notes |
|
| 362 |
+
|-------|----------|-------|------|-------|
|
| 363 |
+
| **1. Pilot validation** | A10G | 100h | $200 | Weeks 1-4, fast iteration |
|
| 364 |
+
| **2. Data ingestion + embedding extraction** | CPU + A10G | 88h + 600h | $1,253 | Parallelizable across 4 GPUs |
|
| 365 |
+
| **3. IMU baselines (500K)** | A10G | 250h | $500 | 5 models × 5 folds |
|
| 366 |
+
| **4. Video + COMODO (500K)** | A10G | 240h | $480 | 3 video models + 2 distillation students |
|
| 367 |
+
| **5. EVI-MAE pre-training** | A100x4 | 150h | $2,400 | ~6 days continuous |
|
| 368 |
+
| **5b. EVI-MAE fine-tuning** | A10G | 40h | $80 | 5 folds |
|
| 369 |
+
| **6. Hard case analysis** | A10G + CPU | 100h + 40h | $224 | L1WO on 100-worker subset |
|
| 370 |
+
| **7. Scaling experiments** | A10G | 375h | $750 | 5 scale points × 5 folds |
|
| 371 |
+
| **8. Synthesis** | CPU | 20h | $12 | Figures, stats |
|
| 372 |
+
|
| 373 |
+
### 7.3 Storage Requirements
|
| 374 |
+
|
| 375 |
+
| Data | Size | Notes |
|
| 376 |
+
|------|------|-------|
|
| 377 |
+
| Raw IMU data (all 500K clips) | ~238 GiB | Download once, keep in project storage |
|
| 378 |
+
| Processed IMU windows (float32) | ~56 GiB | 2.5M windows × [1000, 6] |
|
| 379 |
+
| Video embeddings (3 models) | ~22 GiB | 2.5M windows × 768-dim × 3 encoders |
|
| 380 |
+
| Selective raw video (10% for EVI-MAE) | ~4.8 TiB | Stream during pre-training; cache hot subset |
|
| 381 |
+
| STFT spectrograms (for EVI-MAE) | ~200 GiB | 2.5M × [160, 128, 6] |
|
| 382 |
+
| Model checkpoints & logs | ~100 GiB | ~50 experiments × ~2 GiB each |
|
| 383 |
+
| **Total active storage** | **~5.4 TiB** | |
|
| 384 |
|
| 385 |
+
**Key insight:** We do NOT need all 47 TiB of video. For 80% of experiments (IMU baselines, COMODO, hard case analysis), only the **238 GiB of IMU data + 22 GiB of pre-computed video embeddings** are needed. Only EVI-MAE pre-training requires raw video, and we stream it rather than pre-downloading.
|
| 386 |
|
| 387 |
+
### 7.4 Comparison: Pilot vs. Full Scale
|
| 388 |
+
|
| 389 |
+
| | Pilot (13K) | Full (500K) | Scale Factor |
|
| 390 |
+
|---|---|---|---|
|
| 391 |
+
| GPU-hours | 273h | 2,015h | 7.4× |
|
| 392 |
+
| Cost | $654 | $7,678 | 11.7× |
|
| 393 |
+
| Timeline | 10 weeks | 16 weeks | 1.6× |
|
| 394 |
+
| Storage | 284 GiB | 5.4 TiB | 19× |
|
| 395 |
|
| 396 |
+
The cost scales sub-linearly with data (11.7× cost for 38.5× more data) because: (a) embedding extraction amortizes across all experiments, (b) fewer epochs needed at larger data scale, (c) pilot experiments avoid wasting compute on methods that don't work.
|
| 397 |
|
| 398 |
---
|
| 399 |
|
|
|
|
| 401 |
|
| 402 |
### 8.1 Empirical Contributions
|
| 403 |
|
| 404 |
+
1. **First cross-person generalization study on ego+IMU at 1,000-worker, 500K-clip scale.** Prior work: CMU-MMAC (12 subjects), WEAR (18), MMEA (unknown), Ego4D (mixed, not controlled). We are **50-80× larger** in subject count.
|
| 405 |
|
| 406 |
2. **Modality robustness comparison under person shift.** No prior work has compared video vs. IMU vs. multimodal degradation under controlled cross-person shift on the same dataset.
|
| 407 |
|
| 408 |
+
3. **Data scaling laws for cross-person generalization.** First systematic study of how cross-person accuracy scales with training data (13K → 500K). This has direct practical value: how much data do you need to collect?
|
| 409 |
+
|
| 410 |
+
4. **Worker difficulty prediction from IMU statistics.** If successful, this enables proactive data collection — target workers whose motion patterns are underrepresented.
|
| 411 |
|
| 412 |
+
5. **Cross-modal distillation under distribution shift.** COMODO was validated on same-distribution splits. We test the open question: does distilled knowledge transfer across people?
|
| 413 |
|
| 414 |
### 8.2 Benchmark Contribution
|
| 415 |
|
| 416 |
+
- Standardized splits (5-fold L20%WO) and evaluation protocol for 1,000-worker ego+IMU
|
| 417 |
+
- Baseline results for 11+ models across 3 modality configurations at 500K scale
|
| 418 |
- Per-worker difficulty scores for hard case analysis
|
| 419 |
+
- Data scaling curves with fitted power laws
|
| 420 |
- Open-source evaluation code and pre-computed features
|
| 421 |
|
| 422 |
### 8.3 Practical Contribution
|
| 423 |
|
| 424 |
+
- Guidelines for "how many workers to collect" (worker scaling curve)
|
| 425 |
+
- Guidelines for "how many clips to collect" (data scaling curve)
|
| 426 |
- Method for flagging hard-to-generalize deployment scenarios
|
| 427 |
- Evidence for whether IMU-only deployment (privacy-preserving) sacrifices cross-person robustness vs. video
|
| 428 |
|
|
|
|
| 432 |
|
| 433 |
| Paper | What They Did | What We Add |
|
| 434 |
|-------|---------------|-------------|
|
| 435 |
+
| **COMODO** (Chen et al., 2025) | Video→IMU distillation on Ego4D/EgoExo4D/MMEA | Test distillation under cross-person shift; ~1,000 workers, 500K clips |
|
| 436 |
+
| **EVI-MAE** (Zhang et al., 2024) | Multimodal MAE on CMU-MMAC (12 subj.) and WEAR (18 subj.) | Scale to ~1,000 workers; evaluate cross-person transfer of MAE representations |
|
| 437 |
+
| **Generalizable HAR Survey** (Cai et al., 2025) | Taxonomy of cross-person/device/position HAR | Provide the large-scale empirical study their survey explicitly calls for |
|
| 438 |
| **IMU-Video OOD HAR** (Cheshmi et al., 2025) | Cross-modal SSL for OOD transfer | Study OOD as cross-person (same task, different people) rather than cross-domain |
|
| 439 |
| **ColloSSL** (Jain et al., 2022) | Multi-device contrastive SSL for HAR | Apply cross-device ideas to cross-person setting with paired video+IMU |
|
| 440 |
| **UniMTS** (Zhang et al., 2024) | Text-aligned IMU pre-training for zero-shot HAR | Potential extension: text-conditioned cross-person generalization |
|
| 441 |
+
| **MIAM** (Mehta et al., 2025) | Industrial assembly monitoring (RGB+depth+IMU, 8 volunteers) | Same domain but ~1,000 workers vs. 8; formal cross-person protocol |
|
| 442 |
|
| 443 |
---
|
| 444 |
|
|
|
|
| 446 |
|
| 447 |
| Risk | Likelihood | Mitigation |
|
| 448 |
|------|-----------|------------|
|
| 449 |
+
| 500K dataset access delayed | Medium | Start with 13K pilot (Weeks 1-4). All pipelines and pilot results are independently valuable |
|
| 450 |
+
| Video streaming bottleneck at 47 TiB | Medium | Pre-compute embeddings (only 22 GiB). Only EVI-MAE needs raw video; use selective 10% subset |
|
| 451 |
+
| 3-class task too easy (>95% accuracy) | Medium — "working" is 84% majority class | Use macro-F1 (penalizes majority bias); explore finer-grained temporal segmentation at scale |
|
| 452 |
+
| EVI-MAE training unstable on A100x4 | Low | Validated on pilot first (Week 4). Follow published hyperparameters exactly; reduce lr if needed |
|
| 453 |
+
| Compute budget overrun | Medium | 30% contingency ($1,772). Pilot phase identifies which methods to scale — skip underperformers |
|
| 454 |
+
| Insufficient cross-person variance | Very Low — already confirmed in pilot | 1,000 workers guarantees far more diversity than 100 |
|
| 455 |
|
| 456 |
---
|
| 457 |
|
|
|
|
| 463 |
- Built the **IMU (200Hz) windowing and video synchronization pipeline**
|
| 464 |
- Explored **frequency-domain and temporal features** from the IMU signals
|
| 465 |
- Observed the cross-worker motion pattern differences that motivate this proposal
|
| 466 |
+
- Profiled the full dataset structure (100 workers, 12,997 clips, 1.22 TiB)
|
| 467 |
|
| 468 |
+
This prior work means I can move directly to model training on the pilot without pipeline development overhead, and scale to 500K with confidence in the data pipeline.
|
| 469 |
|
| 470 |
---
|
| 471 |
|