In-N-On

Scaling Egocentric manipulation with in-the-wild and on-task data

Xiongyi Cai*, Ri-Zhao Qiu*†, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, Xiaolong Wang (*:equal contribution †:Project Lead) University of California, San Diego

In-N-On is a training recipe that uses egocentric human data by splitting it into in-the-wild and on-task, enabling zero-shot language following, few-shot learning, and robustness through targeted on-task data.

[pdf] [arxiv] [code] [model] [data]

Abstract.

Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper provides a recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task. We first curate a dataset, PH^SD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. We further adopt domain adaptation techniques to align the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data.

Egocentric Data.

The egocentric dataset PH^SD contains over 20 hours of on-task data, collected on H1 and G1 humanoids as well as from Aria and Apple Vision Pro devices.

Egocentric Data w/ keypoints.

Aria Glasses

Apple Vision Pro

Retargeting from Egocentric Data.

Real-World Demo.

1. Zero-shot Language Following.

Seen language instruction

Unseen language instruction

Seen language instruction

Unseen language instruction

Seen language instruction

Unseen language instruction

2. Improved On-task Performance.

Seen language instruction

Unseen language instruction

3. 1-shot learning.

4. Object generalization.

Approach.

Figure 1: Human0 leverages large-scale egocentric human-humanoid data from the PH^SD dataset for pre-training and post-training, enabling strong instruction following on unseen tasks, few-shot execution, and improved on-task performance.

Figure 2: Our two-stage training pipeline pre-trains on large-scale in-the-wild human and robot data and post-trains on task-aligned demonstrations, using a domain-adversarial discriminator to learn embodiment-invariant representations for effective human-to-robot transfer.

BibTeX

@article{InNOn2025,
          title     = {In-N-On: Scaling Egocentric manipulation with in-the-wild and on-task data},
          author    = {Cai, Xiongyi and Qiu, Ri-Zhao and Chen, Geng and Wei, Lai and Liu, Isabella and Huang, Tianshu and Cheng, Xuxin and Wang, Xiaolong},
          journal   = {},
          year      = {2025}
        }

Content

Aria Glasses

Apple Vision Pro

Seen language instruction

Unseen language instruction

Seen language instruction

Unseen language instruction

Seen language instruction

Unseen language instruction

Seen language instruction

Seen language instruction

Unseen language instruction

Unseen language instruction