GHIL-Glue: Hierarchical Control with Filtered Subgoal Images

1Toyota Research Institute, 2UC Berkeley 3Princeton University

GHIL-Glue provides a simple way to effectively "glue together" language-conditioned video prediction models with low-level control policies for hierarchical imitation learning, significantly improving the performance of existing methods.

Abstract

Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photorealistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively “glue together” language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. GHIL-Glue achieves a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments.

GHIL-Glue generalizes zero-shot to unseen robot table-top manipulation tasks and shows robustness to distractor objects in cluttered scenes. The top frames show subgoal images generated by the video prediction model, and the bottom frames show the RGB observations obtained by executing the low-level goal-reaching policy.

Hierarchical Control

GHIL-Glue can be applied to existing hierarchical imitation learning methods that use video prediction models. GHIL-Glue filters out generated subgoal images that do not make task progress and trains low-level goal reaching policies to be robust to hallucinated artifacts in generated subgoal images.

Interpolate start reference image.

Improving subgoal quality

GHIL-Glue reduces dithering behavior in hierarchical policy methods and makes policies robust to hallucinated artifacts.

Interpolate start reference image.

Zero-shot generalization

Applying GHIL-GLUE to existing hierarchical policy methods achieves state-of-the-art zero-shot generalization performance on both real and simulated robotics benchmarks.

Interpolate start reference image.

Quantitative Results

GHIL-Glue achieves state-of-the-art performance on the CALVIN benchmark as well as on 5 physical environments based on the Bridge V2 robot platform.

Interpolate start reference image.

Appendix

For additional ablations and implementation details, please see the Appendix.

BibTeX

@article{hatch2024ghilglue
  author    = {Hatch, Kyle B. and Balakrishna, Ashwin and Mees, Oier and Nair, Suraj, and Park, Seohong, and Wulfe, Blake and Itkina, Masha and Eysenbach, Benjamin and Levine, Sergey and Kollar, Thomas, and Burchfiel, Benjamin},
  title     = {GHIL-Glue: Hierarchical Control with Filtered Subgoal Images},
  journal   = {Under Review},
  year      = {2024},
}