GHIL-Glue

Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photorealistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively “glue together” language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. GHIL-Glue achieves a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments.

GHIL-Glue can be applied to existing hierarchical imitation learning methods that use video prediction models. GHIL-Glue filters out generated subgoal images that do not make task progress and trains low-level goal reaching policies to be robust to hallucinated artifacts in generated subgoal images.

GHIL-Glue reduces dithering behavior in hierarchical policy methods and makes policies robust to hallucinated artifacts.

Applying GHIL-GLUE to existing hierarchical policy methods achieves state-of-the-art zero-shot generalization performance on both real and simulated robotics benchmarks.

GHIL-Glue achieves state-of-the-art performance on the CALVIN benchmark as well as on 5 physical environments based on the Bridge V2 robot platform.

For additional ablations and implementation details, please see the Appendix.

BibTeX

@inproceedings{hatch2024ghilglue,
      title={GHIL-Glue: Hierarchical Control with Filtered Subgoal Images},
      author={Kyle Beltran Hatch and Ashwin Balakrishna and Oier Mees and Suraj Nair and Seohong Park and Blake Wulfe and Masha Itkina and Benjamin Eysenbach and Sergey Levine and Thomas Kollar and Benjamin Burchfiel},
      booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
      year={2025},
      address = {Atlanta, USA}
    }

GHIL-Glue: Hierarchical Control with Filtered Subgoal Images

GHIL-Glue provides a simple way to effectively "glue together" language-conditioned video prediction models with low-level control policies for hierarchical imitation learning, significantly improving the performance of existing methods.

Abstract

Hierarchical Control

Improving subgoal quality

Zero-shot generalization

Quantitative Results

Appendix

BibTeX