Seeing Without Eyes: 4D Human–Scene Understanding from Wearable IMUs

University of Illinois at Urbana-Champaign
* indicates equal contribution

ArXiv 2026
Teaser image

We propose IMU-to-4D, a large foundation model that jointly reasons over human motion, activity descriptions, and 3D scene layouts purely from wearable IMU signals.

Abstract

Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.

Method Overview

Method Overview

(Left) Motion Tokenizer. The root trajectory is chunked into fixed-length windows and normalized, where normalized chunks are VQ-quantized and normalization parameters μ and σ are separately quantized via non-uniform binning, together yielding compact discrete root tokens. Body poses are continuously encoded via an MLP to produce body tokens. (Right) Multi-modal Transformer. Inertial signals from earbuds, smartphone, and watch are encoded and fed into a unified transformer, which jointly decodes human motion via bidirectional attention, and motion descriptions and scene layout autoregressively via causal attention.

Qualitative Comparison

We compare our method against the baseline across three evaluation tracks. The baseline is a modular pipeline include MobilePoser (IMU → motion), MotionGPT3 (motion → text), and Summon (motion → scene). All visualizations are performed on the held-out test sets.

IMU → Motion  (Real-World IMU Dataset)

Ground Truth Ours MobilePoser IMUPoser
We follow prior works to first train on synthetic IMU data, then fine-tune on real-world IMU readings to adapt to real-world sensor noise.

DIP-IMU Dataset

Sample 0
Sample 1
Sample 2

IMUPoser Dataset

Sample 3
Sample 4
Sample 5

IMU → Motion & Text  (HumanML Dataset)

Ground Truth (Motion) Ours (Motion) MobilePoser (Motion) Ground Truth (Text) Ours (Text) MotionGPT3 (Text)

Starting locations of our prediction and MobilePoser are shifted slightly for clarity. MotionGPT3 takes MobilePoser motion predictions as input.

GTA person jumps to the right, then to the left.
OursHe jumps to the left a short distance.
MGPT3A person jumps in place.
GTSomeone walks and jumps in a chaotic manner before kicking the ground.
OursA person hops around, alternating between different spots.
MGPT3A person walks forward then turns around and walks back.
GTA person performs various stretches by bending knees and moving from side to side, then by pulling the right elbow back with the left arm.
OursA person starts with his arms apart, brings them together, and then moves right to left, and completes with a left-handed swing technique.
MGPT3A person leans forward and then looks back to the left.
GTA person quickly crawls backwards on their front side using both hands and legs.
OursThey are crawling backward on the room's surface.
MGPT3A person is crawling forward and moving his legs straight and forward.
GTThey move their arms in a rhythmic up-and-down pattern.
OursThe person is doing upper body exercises with their arms.
MGPT3A person stands still and then releases their right hand.
GTA person executes a backhand shot in tennis.
OursThe person takes a mock shot or two uneven hand movements.
MGPT3A person is stooping to the right then turns his left and walks backwards.

IMU → Motion & Text & Scene  (HUMOTO Dataset)

Ground Truth (Motion) Ours (Motion) MobilePoser (Motion) Ground Truth (Text) Ours (Text) MotionGPT3 (Text)

We visualize ground-truth objects and objects predicted by our method in the first frame. Starting locations are shifted slightly for clarity. Summon takes MobilePoser motion predictions as input.

Summon is tailored to a limited set of interaction types (e.g., sitting and lying down), which restricts its ability to generalize to diverse scene layouts.
GTThe subject places trash can with both hands on ground. The subject picks trash from ground. The subject throws trash in trash can.
OursThe subject bends body toward the ground, grasps a trash can with both hands, and then lifts it.
MGPT3The subject places trash can with both hands on ground. The subject picks trash from the ground and drops it into a trash can.
GTThe subject walks around while carrying a side plate in their right hand.
OursThe subject holds a spatula in their right hand while walking back and forth.
MGPT3A person walks back and forth while carrying a mixing bowl in their right hand.
GTThe subject pulls the dining chair towards the subject. The subject moves the dining chair away from the subject.
OursThe subject lifts the working chair. The subject holds with both hands. The subject puts working chair with both hands down on ground.
MGPT3The subject lifts a serving bowl with both hands, holds it for 30 seconds, and then places it back on the table.
GTThe subject grasps and lifts the deep plate with right hand. The subject shakes the deep plate. The subject puts the deep plate on the table.
OursThe subject uses a wok turner to scoop eggs from a frying pan, lifting the pan with their other hand, and places the eggs onto a side plate.
MGPT3The subject picks up a frying pan and places it in front of the frying pan. The subject stirs the food into the frying pan with right hand.
GTA person removes the cap from a vacuum flask, dispenses soap inside, shakes it to wash the interior, and then grabs a sponge to continue cleaning.
OursThe subject puts soap from a dispenser into a mug, then uses a sponge to wash the inside and outside of the mug over a sink.
MGPT3The subject dispenses soap from a sink into a drawer tray, then places the tray in a storage bowl.

Note: assets provided in the original dataset have broken textures, leading to rendering artifacts.

GTThe subject organizes and transfers cooking utensils into the draw organizer tray.
OursThe subject organizes utensils by moving a turner, a soup ladle, a knife, and a peeler from a table into a drawer organizer tray.
MGPT3The subject takes a serving bowl, transfers a frying pan, and then places them on a table.

Applications


Relocalization (IMU → Motion + Absolute Location)

Imagine an everyday setting where a person performs many activities within the same environment, such as their home. Can our model infer the 3D scene layout and human–scene interactions from IMU signals collected during these activities?

We refer to this as 3D re-localization. This task is non-trivial because IMU signals only capture motion relative to the first reading, containing no absolute positional information — the model must implicitly encode scene structure and reason about where each activity plausibly occurs.

To evaluate this, we fine-tune our model on a subset of ParaHome sequences, where motions are captured within a fixed 3D scene. Given new IMU signals, the goal is to recover the person's initial global translation and orientation within the scene.

Relocalization illustration
Ground Truth Ours
Sample 1
Sample 2

Dynamic Object Predictions

Our method can be extended to jointly predict dynamic objects and human motions from IMU signals on the OMOMO dataset.

Ground Truth Ours
Sample 1
Sample 2

Ablation Studies

We ablate the choice of autoregressive (causal) and bidirectional attention mechanisms for decoding motion tokens. While autoregressive decoding supports online and streaming settings, it is susceptible to error accumulation over long horizons; bidirectional attention leverages the full context for higher accuracy at the cost of requiring the complete input sequence upfront.

Bidirectional attention on motion tokens produces smoother motion trajectories compared to the autoregressive one.
Ground Truth Ours (Bidirectional) Ours (Autoregressive)

DIP-IMU

Sample 0
Sample 1
Sample 2

IMUPoser

Sample 0
Sample 1
Sample 2

Limitations and Failure Cases


Drifting

Human motion may drift over time due to the accumulation of prediction errors, a limitation shared by all IMU→motion methods. We believe incorporating loop-closure mechanisms could alleviate this issue, and leave this direction for future work.

Ground Truth Ours MobilePoser

Hallucinations & Penetrations in Predicted Scenes

IMU-to-scene prediction is highly ill-posed, as multiple scene layouts can correspond to the same IMU signals or motion sequences. We observe that objects predicted by our approach may sometimes not match the ground truth; this ambiguity could be mitigated with larger-scale datasets or by incorporating stronger priors about scene layouts. Additionally, predicted objects may exhibit interpenetrations, as our model does not explicitly enforce penetration constraints — incorporating such penalties could help alleviate this problem.

Ground Truth Ours

Acknowledgements


This project is supported by the Amazon Illinois AICE Center Research Grant. Hao-Yu Hsu is supported by the Amazon AI PhD Scholarship. We thank the NCSA for providing computing resources. We also thank Prof. Romit Roy Choudhury, Dr. Robinson Piramuthu, and Dr. Gunnar Sigurdsson for their helpful discussions.

BibTeX

@article{hsu2026imu4d,
  title={Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs},
  author={Hsu, Hao-Yu and Cheng, Tianhang and Wen, Jing and Schwing, Alexander G and Wang, Shenlong},
  journal={arXiv preprint},
  year={2026}
}