ShubhamRasal's picture
Scale proposal to 500K datapoints from Egocentric-1M dataset — updated budgets, timeline, storage, new scaling experiment
cac9a3f verified
# Cross-Person Generalization in Egocentric Video + IMU Activity Recognition: A Benchmark Study at 500K Scale
**Author:** Shubham Rasal — [shubhamrasal.com](https://shubhamrasal.com)
**Date:** April 2026
**Status:** Proposal Draft
---
## 1. Motivation & Problem Statement
Egocentric video and body-worn IMU sensors are the sensing backbone of AR glasses, smartwatches, and workplace wearables. Activity recognition from these modalities has made rapid progress — EVI-MAE reaches 87.96 mAP on CMU-MMAC, COMODO achieves 59.1% top-1 on Ego4D with IMU alone, and MoBind achieves state-of-the-art cross-modal retrieval. But all of these results are reported on **within-distribution** splits where training and test data come from the same population of users.
**The real-world failure mode is cross-person generalization.** When you train on workers A–H and deploy to worker J, performance degrades — often catastrophically. The recent survey on generalizable HAR (Cai et al., 2025) formalizes this as:
> U_train ∩ U_test = ∅, p(X, y | U_train) ≠ p(X, y | U_test)
This is the setting that matters for any commercial wearable deployment, yet it remains systematically understudied because existing ego+IMU datasets have too few subjects (CMU-MMAC: 12, WEAR: 18, MMEA: unknown) to support meaningful cross-person evaluation at scale.
### The Opportunity
We will use **500K clips from the Egocentric-1M dataset** — drawing from a pool of ~1,000 workers performing industrial/workplace tasks. This is orders of magnitude larger than any existing ego+IMU benchmark and enables studies that are statistically impossible on current datasets.
Additionally, we have early access to a **13K-clip, 100-worker pilot subset** (the Egocentric Native Compression dataset) for rapid prototyping and pipeline validation before scaling to 500K.
| Property | Pilot (Available Now) | Full Scale (500K) |
|----------|----------------------|-------------------|
| Workers | 100 | ~1,000+ (estimated) |
| Clips | 12,997 | **500,000** |
| Duration | ~97.5 hours | **~3,750 hours** |
| Total size | 1.22 TiB | **~47 TiB** |
| IMU data | 6.18 GiB | **~238 GiB** |
| 5s windows | ~65K | **~2.5M** |
| IMU rate | 200 Hz (acc + gyro) | 200 Hz (acc + gyro) |
| Activity labels | working / break / not-on-person | working / break / not-on-person |
| Domain | Industrial / workplace | Industrial / workplace |
From preliminary exploration of the pilot dataset, I've observed **consistent differences in motion patterns across workers performing similar tasks** — exactly the distribution shift that drives generalization failures. With ~1,000 workers and 500K clips, we can study this at a scale that no prior work has approached.
---
## 2. Research Questions
1. **How much does cross-person distribution shift degrade ego+IMU activity recognition?**
Quantify the gap between within-person and cross-person accuracy across IMU-only, video-only, and multimodal setups.
2. **Which modality is more robust to person shift — video or IMU?**
IMU captures body dynamics (person-specific); video captures visual scene (potentially person-invariant). Which degrades less under cross-person shift?
3. **Does video→IMU distillation improve cross-person robustness?**
COMODO distills video knowledge into IMU encoders for same-distribution settings. Does the distilled representation also transfer better to unseen workers?
4. **Can we predict which workers will be "hard cases" before deployment?**
Use IMU statistics (spectral features, signal energy, temporal patterns) to predict which unseen workers will cause generalization failure — enabling targeted data collection or adaptation.
5. **Does self-supervised pre-training on all workers (unlabeled) close the cross-person gap?**
EVI-MAE-style pre-training uses unlabeled data from all workers. Does this shared representation space reduce person-specific biases?
6. **What are the scaling laws for cross-person generalization?** *(New at 500K scale)*
How does downstream cross-person accuracy change as we increase training data from 13K → 50K → 100K → 250K → 500K clips? Is there a data scaling law for cross-person robustness?
---
## 3. Dataset Overview
### 3.1 Pilot Dataset (Available Now)
**Source:** `gs://build-ai-egocentric-native-compression/`
| Metric | Value |
|--------|-------|
| Total workers | 100 (selected from 150, 103 complete) |
| Total clips | 12,997 |
| Total duration | ~97.5 hours |
| Total size | 1.22 TiB (1,240 GiB video + 6.18 GiB IMU) |
| Video per clip | ~100 MiB MP4, ~27 seconds |
| IMU per clip | ~500 KiB JSONL, ~5,400 samples at 200Hz |
**Pilot worker statistics (from `index.csv`):**
| Statistic | Clips per Worker |
|-----------|-----------------|
| Min | 76 |
| Q1 (25th percentile) | 95 |
| Median | 129.5 |
| Q3 (75th percentile) | 164 |
| Max | 203 |
| Mean ± Std | 130.0 ± 36.3 |
**Activity state distribution (per-worker metadata):**
| State | Mean Fraction | Std | Min | Max |
|-------|---------------|-----|-----|-----|
| Working | 0.843 | 0.110 | 0.250 | 0.957 |
| Taking break | 0.123 | 0.110 | 0.014 | 0.721 |
| Not on person | 0.034 | 0.016 | 0.000 | 0.058 |
| Worn (working + break) | 0.966 | 0.016 | 0.942 | 1.000 |
### 3.2 Full Dataset (500K Target)
| Metric | Estimated Value |
|--------|-----------------|
| Total workers | ~1,000+ |
| Total clips | **500,000** |
| Total duration | **~3,750 hours (~156 days)** |
| Video data | **~46.9 TiB** |
| IMU data | **~238 GiB** |
| 5s windows | **~2,500,000** |
| IMU window tensor (float32) | **~56 GiB** |
### 3.3 Notable Pilot Workers (Hard Cases)
| Worker | Working % | Break % | Off-body % | Clips | Notes |
|--------|-----------|---------|------------|-------|-------|
| worker_039 | 25.0% | 72.1% | 2.9% | 135 | Extreme outlier — mostly on break |
| worker_085 | 37.3% | 57.5% | 5.2% | 149 | Majority break time |
| worker_019 | 57.0% | 41.2% | 1.8% | 110 | Near-equal work/break split |
| worker_087 | 62.1% | 32.5% | 5.3% | 164 | High break + high off-body |
At 500K scale with ~1,000 workers, we expect 10× more behavioral diversity — more outliers, richer distribution tails, and harder generalization challenges.
### 3.4 Curation
- First and last 2 clips trimmed per worker (removes startup/shutdown artifacts)
- Audio stripped from all videos
- IMU format: JSONL with `t_us` (microseconds), `acc` [x,y,z] (m/s²), `gyro` [x,y,z] (rad/s)
---
## 4. Experimental Design
### 4.1 Two-Stage Approach: Pilot → Full Scale
**Stage 1 (Weeks 1-4): Pilot on 13K clips**
- Validate all pipelines, debug models, establish baseline numbers
- Run all experiments on 100-worker pilot for fast iteration (~hours per experiment)
- Identify which methods are worth scaling
**Stage 2 (Weeks 5-14): Scale to 500K clips**
- Run the winning methods at full scale
- Execute the data scaling experiment (unique contribution at this scale)
- Full cross-person benchmark at ~1,000-worker scale
### 4.2 Cross-Person Evaluation Protocol
**Primary protocol: 5-Fold Leave-20%-Workers-Out (L20%WO)**
Split workers into 5 non-overlapping folds (20% each). For each fold:
- **Train:** 60% of workers
- **Validation:** 20% of workers
- **Test:** 20% of workers
| | Pilot (100 workers) | Full (est. ~1,000 workers) |
|---|---|---|
| Train | 60 workers (~7,800 clips) | ~600 workers (~300K clips) |
| Val | 20 workers (~2,600 clips) | ~200 workers (~100K clips) |
| Test | 20 workers (~2,600 clips) | ~200 workers (~100K clips) |
Report mean ± std across 5 folds.
**Secondary protocols:**
- **Leave-1-Worker-Out (L1WO):** On a representative subset of 100 workers (from the full 1,000). Train on 99, test on 1 × 100. Measures per-worker difficulty.
- **Worker scaling curve:** Train on {100, 200, 400, 600, 800} workers, always test on the same held-out 200. Answers: "how many workers do you need?"
- **Data scaling curve:** Train best model at {13K, 50K, 100K, 250K, 500K} clips. Answers: "how much data do you need?"
### 4.3 Task Definition
**3-class activity state classification:**
- **Working** (84.3% of clips in pilot)
- **Taking break** (12.3%)
- **Not on person** (3.4%)
**Window protocol** (following COMODO/Ego4D standard):
- 5-second non-overlapping windows
- IMU: 200Hz × 5s = 1,000 timesteps × 6 channels (acc_xyz + gyro_xyz)
- Video: 10 FPS × 5s = 50 frames, resized to 224×224
- Per 27s clip: ~5 windows → **~2.5M total windows at 500K scale**
### 4.4 Metrics
| Metric | Why |
|--------|-----|
| **Macro F1** | Primary metric. Accounts for class imbalance (3.4% off-body would be invisible in accuracy) |
| **Accuracy @1** | Standard comparison with COMODO/EVI-MAE baselines |
| **Per-worker F1 variance** | σ across workers — directly measures generalization stability |
| **Generalization gap** | Δ between within-person and cross-person F1 |
| **Per-class F1** | Breakdown across working/break/not-worn |
### 4.5 Baseline Comparisons
#### Experiment A: IMU-Only Models (5 baselines)
| Model | Type | Params | Reference |
|-------|------|--------|-----------|
| DeepConvLSTM | Supervised CNN+LSTM | ~2M | Ordóñez & Roggen, 2016 |
| Attend & Discriminate | Supervised attention | ~3M | Abedin et al., 2021 |
| CrossHAR | Cross-dataset transfer | ~5M | Hong et al., 2024 |
| MOMENT-small | Pre-trained time-series foundation | ~40M | Goswami et al., 2024 |
| Mantis | Pre-trained time-series | ~15M | Liu et al., 2024 |
All evaluated in both **within-person** (random split) and **cross-person** (L20%WO) settings. The gap between the two is the headline result.
#### Experiment B: Video-Only Models (3 baselines)
| Model | Type | Params | Reference |
|-------|------|--------|-----------|
| TimeSformer-Base | Transformer (K400 pretrained) | 121M | Bertasius et al., 2021 |
| VideoMAE ViT-B | MAE pre-trained Transformer | 86M | Tong et al., 2022 |
| SlowFast R50 | Dual-pathway CNN | 34M | Feichtenhofer et al., 2019 |
#### Experiment C: Cross-Modal Distillation
| Method | Setup | Reference |
|--------|-------|-----------|
| COMODO (MOMENT student) | Frozen TimeSformer → MOMENT IMU encoder | Chen et al., 2025 |
| COMODO (Mantis student) | Frozen TimeSformer → Mantis IMU encoder | Chen et al., 2025 |
| IMU2CLIP | Joint video-IMU-text embedding | Moon et al., 2023 |
**Key question:** Does the distilled IMU model generalize better to unseen workers than the supervised IMU model?
#### Experiment D: Self-Supervised Pre-training
| Method | Setup | Reference |
|--------|-------|-----------|
| EVI-MAE | Joint video+IMU MAE pre-training → fine-tune | Zhang et al., 2024 |
| IMU-only MAE | IMU spectrogram MAE → fine-tune | Adapted from EVI-MAE |
Pre-train on **all workers** (unlabeled, self-supervised), then fine-tune on train workers with labels, evaluate on held-out test workers. Compare with same architecture trained from scratch.
#### Experiment E: Hard Case Analysis
For a representative subset of workers, compute:
- **IMU statistics:** RMS acceleration, gyro variance, spectral centroid, signal entropy, jerk magnitude, dominant frequency
- **Temporal statistics:** autocorrelation decay, stationarity (ADF test), transition rate between high/low activity
- **Model performance:** per-worker F1 from L1WO
Then:
1. **Correlate** IMU statistics with model difficulty (per-worker F1)
2. **Cluster** workers by IMU feature similarity → do clusters predict generalization difficulty?
3. **Predict difficulty:** train a regression model (IMU stats → expected F1) — can we flag hard workers before deployment?
#### Experiment F: Data Scaling Laws *(New at 500K)*
Train the best-performing model from Experiments A-D at increasing data scales:
| Scale | Clips | Est. Workers | Est. Hours |
|-------|-------|-------------|------------|
| 13K | 12,997 | 100 | 97.5h |
| 50K | 50,000 | ~385 | 375h |
| 100K | 100,000 | ~770 | 750h |
| 250K | 250,000 | ~1,000 | 1,875h |
| 500K | 500,000 | ~1,000 | 3,750h |
Measure: cross-person Macro F1 vs. training data size. Is there a scaling law? Does it plateau?
---
## 5. Technical Approach
### 5.1 IMU Preprocessing Pipeline
```
Raw JSONL (200Hz, acc+gyro)
→ Parse, validate timestamps, interpolate gaps
→ 5-second non-overlapping windows (1000 × 6)
→ Two representations:
(a) Raw time series [1000, 6] for Mantis/MOMENT/DeepConvLSTM
(b) STFT spectrogram [160, 128, 6] for EVI-MAE/IMU-MAE
→ Per-channel z-normalization (computed on training set only)
```
**At 500K scale:** 2.5M windows × 6,000 floats = ~56 GiB processed tensor. Fits comfortably in RAM on a cpu-upgrade instance. Preprocessing takes ~80h CPU but is embarrassingly parallel.
*Already built:* I have the 200Hz windowing and video sync pipeline from earlier work on the pilot dataset.
### 5.2 Video Preprocessing Pipeline
```
MP4 (27s clips, ~100 MiB each)
→ Decode at 10 FPS → 270 frames per clip
→ 5-second windows → 50 frames per window
→ Resize to 224×224, normalize (ImageNet stats)
→ Pre-compute video encoder embeddings (store to disk)
```
**At 500K scale:** Pre-computing video embeddings is critical. We do NOT download all 47 TiB of video. Instead:
- Stream video clips on-demand during embedding extraction
- TimeSformer-B produces 768-dim vectors → 2.5M windows × 768 × 4 bytes = **~7.3 GiB** per model. Easily stored.
- Embedding extraction: ~200h on A10G (parallelizable across 4 GPUs → ~50h wall-clock)
### 5.3 COMODO Distillation Protocol
Following Chen et al. (2025) exactly:
- **Video teacher:** TimeSformer-Base, frozen, K400-pretrained. MLP projector (hidden=2048, out=128)
- **IMU student:** MOMENT-small or Mantis, trainable. MLP projector (hidden=2048, out=128)
- **Loss:** Cross-entropy on similarity distribution with dynamic FIFO instance queue
- **Hyperparameters:** 20 epochs, batch 128, lr=3e-4, τ_v=0.1, τ_x=0.05
- **Hardware:** A10G GPU (24 GB VRAM) — scales linearly with data, ~12h per student per fold at 500K
### 5.4 EVI-MAE Protocol
Following Zhang et al. (2024):
- **Video encoder:** ViT-Base (patch 16×16), 90% masking ratio
- **IMU encoder:** Transformer on STFT spectrogram patches (16×16), 75% masking ratio
- **Fusion:** Single-layer Transformer
- **Pre-training loss:** MAE reconstruction (α=1, β=10) + contrastive (γ=0.01, τ=0.05)
- **Pre-train:** 100 epochs on all workers (fewer epochs needed — each epoch sees 38× more data)
- **Fine-tune:** 50 epochs on train split with labels
- **Hardware:** **A100x4 (4× 80GB)** for pre-training (~6 days), A10G for fine-tuning
---
## 6. Timeline (16 Weeks)
### Stage 1: Pilot Validation (Weeks 1-4)
| Week | Phase | Deliverable | Compute |
|------|-------|-------------|---------|
| **1** | Data preparation | IMU windowing, video embedding extraction, EDA notebook on pilot | CPU: 12h, A10G: 10h |
| **2** | IMU baselines (pilot) | 5 IMU models × 5 folds on 13K clips. Validate pipeline, establish baselines | A10G: 20h |
| **3** | Video + COMODO (pilot) | Video baselines + COMODO distillation on 13K clips | A10G: 30h |
| **4** | EVI-MAE (pilot) + triage | Self-supervised pre-training on pilot. Select methods to scale | A10G: 40h |
### Stage 2: Full Scale (Weeks 5-14)
| Week | Phase | Deliverable | Compute |
|------|-------|-------------|---------|
| **5-6** | 500K data ingestion | Download 238 GiB IMU. Stream-extract video embeddings for 500K clips (3 models) | CPU: 88h, A10G: 600h |
| **7-8** | IMU baselines (500K) | 5 IMU models × 5 folds at full scale | A10G: 250h |
| **9-10** | Video + COMODO (500K) | Video baselines + COMODO distillation at full scale | A10G: 240h |
| **11-12** | EVI-MAE (500K) | Self-supervised pre-training on all workers (~6 days on A100x4) + fine-tuning | A100x4: 150h, A10G: 40h |
| **13** | Hard case analysis | Per-worker difficulty on 100-worker subset, IMU feature correlation | A10G: 100h, CPU: 40h |
| **14** | Scaling experiments | Data scaling curve (13K→500K), worker scaling curve (100→800) | A10G: 375h |
### Stage 3: Synthesis (Weeks 15-16)
| Week | Phase | Deliverable | Compute |
|------|-------|-------------|---------|
| **15** | Analysis | Compile results, statistical significance tests, generate figures | CPU: 20h |
| **16** | Write-up | Paper draft (8-page main + appendix) | — |
**Total wall-clock:** 16 weeks
**Active compute window:** Weeks 1-14 (14 weeks of experiments)
---
## 7. Compute Budget
### 7.1 Summary
| Resource | Hours | Unit Cost | Total |
|----------|-------|-----------|-------|
| CPU (preprocessing, analysis) | 160h | $0.60/h | $96 |
| A10G GPU (24 GB) | 1,705h | $2.00/h | $3,410 |
| A100x4 GPU (4× 80 GB) | 150h | $16.00/h | $2,400 |
| **Subtotal** | **2,015h** | | **$5,906** |
| **30% contingency** | | | **+$1,772** |
| **Total** | | | **~$7,678** |
### 7.2 Phase Breakdown
| Phase | Resource | Hours | Cost | Notes |
|-------|----------|-------|------|-------|
| **1. Pilot validation** | A10G | 100h | $200 | Weeks 1-4, fast iteration |
| **2. Data ingestion + embedding extraction** | CPU + A10G | 88h + 600h | $1,253 | Parallelizable across 4 GPUs |
| **3. IMU baselines (500K)** | A10G | 250h | $500 | 5 models × 5 folds |
| **4. Video + COMODO (500K)** | A10G | 240h | $480 | 3 video models + 2 distillation students |
| **5. EVI-MAE pre-training** | A100x4 | 150h | $2,400 | ~6 days continuous |
| **5b. EVI-MAE fine-tuning** | A10G | 40h | $80 | 5 folds |
| **6. Hard case analysis** | A10G + CPU | 100h + 40h | $224 | L1WO on 100-worker subset |
| **7. Scaling experiments** | A10G | 375h | $750 | 5 scale points × 5 folds |
| **8. Synthesis** | CPU | 20h | $12 | Figures, stats |
### 7.3 Storage Requirements
| Data | Size | Notes |
|------|------|-------|
| Raw IMU data (all 500K clips) | ~238 GiB | Download once, keep in project storage |
| Processed IMU windows (float32) | ~56 GiB | 2.5M windows × [1000, 6] |
| Video embeddings (3 models) | ~22 GiB | 2.5M windows × 768-dim × 3 encoders |
| Selective raw video (10% for EVI-MAE) | ~4.8 TiB | Stream during pre-training; cache hot subset |
| STFT spectrograms (for EVI-MAE) | ~200 GiB | 2.5M × [160, 128, 6] |
| Model checkpoints & logs | ~100 GiB | ~50 experiments × ~2 GiB each |
| **Total active storage** | **~5.4 TiB** | |
**Key insight:** We do NOT need all 47 TiB of video. For 80% of experiments (IMU baselines, COMODO, hard case analysis), only the **238 GiB of IMU data + 22 GiB of pre-computed video embeddings** are needed. Only EVI-MAE pre-training requires raw video, and we stream it rather than pre-downloading.
### 7.4 Comparison: Pilot vs. Full Scale
| | Pilot (13K) | Full (500K) | Scale Factor |
|---|---|---|---|
| GPU-hours | 273h | 2,015h | 7.4× |
| Cost | $654 | $7,678 | 11.7× |
| Timeline | 10 weeks | 16 weeks | 1.6× |
| Storage | 284 GiB | 5.4 TiB | 19× |
The cost scales sub-linearly with data (11.7× cost for 38.5× more data) because: (a) embedding extraction amortizes across all experiments, (b) fewer epochs needed at larger data scale, (c) pilot experiments avoid wasting compute on methods that don't work.
---
## 8. Expected Contributions
### 8.1 Empirical Contributions
1. **First cross-person generalization study on ego+IMU at 1,000-worker, 500K-clip scale.** Prior work: CMU-MMAC (12 subjects), WEAR (18), MMEA (unknown), Ego4D (mixed, not controlled). We are **50-80× larger** in subject count.
2. **Modality robustness comparison under person shift.** No prior work has compared video vs. IMU vs. multimodal degradation under controlled cross-person shift on the same dataset.
3. **Data scaling laws for cross-person generalization.** First systematic study of how cross-person accuracy scales with training data (13K → 500K). This has direct practical value: how much data do you need to collect?
4. **Worker difficulty prediction from IMU statistics.** If successful, this enables proactive data collection — target workers whose motion patterns are underrepresented.
5. **Cross-modal distillation under distribution shift.** COMODO was validated on same-distribution splits. We test the open question: does distilled knowledge transfer across people?
### 8.2 Benchmark Contribution
- Standardized splits (5-fold L20%WO) and evaluation protocol for 1,000-worker ego+IMU
- Baseline results for 11+ models across 3 modality configurations at 500K scale
- Per-worker difficulty scores for hard case analysis
- Data scaling curves with fitted power laws
- Open-source evaluation code and pre-computed features
### 8.3 Practical Contribution
- Guidelines for "how many workers to collect" (worker scaling curve)
- Guidelines for "how many clips to collect" (data scaling curve)
- Method for flagging hard-to-generalize deployment scenarios
- Evidence for whether IMU-only deployment (privacy-preserving) sacrifices cross-person robustness vs. video
---
## 9. Relation to Existing Work
| Paper | What They Did | What We Add |
|-------|---------------|-------------|
| **COMODO** (Chen et al., 2025) | Video→IMU distillation on Ego4D/EgoExo4D/MMEA | Test distillation under cross-person shift; ~1,000 workers, 500K clips |
| **EVI-MAE** (Zhang et al., 2024) | Multimodal MAE on CMU-MMAC (12 subj.) and WEAR (18 subj.) | Scale to ~1,000 workers; evaluate cross-person transfer of MAE representations |
| **Generalizable HAR Survey** (Cai et al., 2025) | Taxonomy of cross-person/device/position HAR | Provide the large-scale empirical study their survey explicitly calls for |
| **IMU-Video OOD HAR** (Cheshmi et al., 2025) | Cross-modal SSL for OOD transfer | Study OOD as cross-person (same task, different people) rather than cross-domain |
| **ColloSSL** (Jain et al., 2022) | Multi-device contrastive SSL for HAR | Apply cross-device ideas to cross-person setting with paired video+IMU |
| **UniMTS** (Zhang et al., 2024) | Text-aligned IMU pre-training for zero-shot HAR | Potential extension: text-conditioned cross-person generalization |
| **MIAM** (Mehta et al., 2025) | Industrial assembly monitoring (RGB+depth+IMU, 8 volunteers) | Same domain but ~1,000 workers vs. 8; formal cross-person protocol |
---
## 10. Risk Mitigation
| Risk | Likelihood | Mitigation |
|------|-----------|------------|
| 500K dataset access delayed | Medium | Start with 13K pilot (Weeks 1-4). All pipelines and pilot results are independently valuable |
| Video streaming bottleneck at 47 TiB | Medium | Pre-compute embeddings (only 22 GiB). Only EVI-MAE needs raw video; use selective 10% subset |
| 3-class task too easy (>95% accuracy) | Medium — "working" is 84% majority class | Use macro-F1 (penalizes majority bias); explore finer-grained temporal segmentation at scale |
| EVI-MAE training unstable on A100x4 | Low | Validated on pilot first (Week 4). Follow published hyperparameters exactly; reduce lr if needed |
| Compute budget overrun | Medium | 30% contingency ($1,772). Pilot phase identifies which methods to scale — skip underperformers |
| Insufficient cross-person variance | Very Low — already confirmed in pilot | 1,000 workers guarantees far more diversity than 100 |
---
## 11. About the Researcher
**Shubham Rasal** — [shubhamrasal.com](https://shubhamrasal.com)
I've already worked with this dataset in its earlier form (the compression challenge). Specifically:
- Built the **IMU (200Hz) windowing and video synchronization pipeline**
- Explored **frequency-domain and temporal features** from the IMU signals
- Observed the cross-worker motion pattern differences that motivate this proposal
- Profiled the full dataset structure (100 workers, 12,997 clips, 1.22 TiB)
This prior work means I can move directly to model training on the pilot without pipeline development overhead, and scale to 500K with confidence in the data pipeline.
---
## 12. Key References
1. Chen, B. et al. "COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric HAR." arXiv:2503.07259 (2025).
2. Zhang, M. et al. "Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition." arXiv:2407.06628 (2024).
3. Cai, Y. et al. "Towards Generalizable Human Activity Recognition: A Survey." arXiv:2508.12213 (2025).
4. Cheshmi, S. et al. "Improving OOD HAR via IMU-Video Cross-modal Representation Learning." arXiv:2507.13482 (2025).
5. Moon, S. et al. "IMU2CLIP: Multimodal Contrastive Learning for IMU Motion Sensors." arXiv:2210.14395 (2023).
6. Zhang, X. et al. "UniMTS: Unified Pre-training for Motion Time Series." arXiv:2410.19818 (2024).
7. Mehta, N. et al. "A Multimodal Dataset for Enhancing Industrial Task Monitoring." arXiv:2501.05936 (2025).
8. Jain, Y. et al. "ColloSSL: Collaborative Self-Supervised Learning for HAR." arXiv:2202.00758 (2022).
9. Grauman, K. et al. "Ego4D: Around the World in 3,000 Hours of Egocentric Video." arXiv:2110.07058 (2022).
10. Grauman, K. et al. "Ego-Exo4D: Understanding Skilled Human Activity." arXiv:2311.18259 (2023).
---
*End of proposal.*