Zero-shot Object-Centric Instruction Following: Integrating Foundation Models with Traditional Navigation

The robot is tasked with following a language instruction in an unseen environment to reach the final target (left). We introduce Language-Inferred Factor Graph for Instruction Following (LIFGIF) (right), a novel method that infers a graph from the instruction using an LLM, offline, and thereafter optimizes the graph with actual observations at every step, online, during navigation. The robot selects the next waypoint from the inferred graph and moves toward it using a local planner.

Abstract

Large scale scenes such as multifloor homes can be robustly and efficiently mapped with a 3D graph of land- marks estimated jointly with robot poses in a factor graph, a technique commonly used in commercial robots such as drones and robot vacuums. In this work, we propose Language- Inferred Factor Graph for Instruction Following (LIFGIF), a zero-shot method to ground natural language instructions in such a map. LIFGIF also includes a policy for following natural language navigation instructions in a novel environ- ment while the map is constructed, enabling robust navigation performance in the physical world. To evaluate LIFGIF, we present a new dataset, Object-Centric VLN (OC-VLN), in order to evaluate grounding of object-centric natural language navigation instructions. We compare to two state-of-the-art zero-shot baselines from related tasks, Object Goal Navigation and Vision Language Navigation, to demonstrate that LIFGIF outperforms them across all our evaluation metrics on OC- VLN. Finally, we successfully demonstrate the effectiveness of LIFGIF for performing zero-shot object-centric instruction following in the real world on a Boston Dynamics Spot robot.

Method

Our LIFGIF method stems from a key insight: language instructions not only guide the robot’s navigation but also encode crucial spatial information about the environment’s layout. Even before making any observations, these instructions provide the robot with a preliminary understanding of the environment’s map, albeit with significant uncertainty. For instance, the instruction “move forward until you see a chair.” implies the presence of a chair somewhere along the forward direction (x-axis) relative to the robot’s current position, even if we don’t know the exact distance to the chair. By representing this spatial information as a factor graph, we can integrate it as a prior into a traditional factor-graph-based SLAM system. As the robot observes landmarks mentioned in the instructions, the uncertainty associated with these landmarks diminishes substantially, thus helping the robot in localizing itself within the context of the instructions during navigation. This effectively bridges the gap between linguistic guidance and spatial awareness.

Robot demonstrations on a spot

"Move forward to the bicycle. Turn right, then move to the chair. Turn left, and stop near the potted plant."

"Move to the bicycle, turn left. move forward and stop at the tv."

"Move forward to the chair. Turn right and move towards the bench. Stop there."

Evaluating in simulation

"Move forward to the brown couch. Stop at the fireplace."