
Existing Vision-Language Models (VLMs) produce long-term trajectory waypoints or directly control actions based on their perception input and language prompt. However, these VLMs are not explicitly aware of the constraints imposed by the scene or kinematics of the vehicle. As a result, the generated trajectories or control inputs are likely to be unsafe and/or infeasible. In this paper, we introduce LeGo-Drive † , which aims to address these issues. Our key idea is to use the VLM to just predict a goal location based on the given language command and perception input, which is then fed to a downstream differentiable trajectory optimizer with learnable components. We train the VLM and the trajectory optimizer in an end-to-end fashion using a loss function that captures the ego-vehicle’s ability to reach the predicted goal while satisfying safety and kinematic constraints. The gradients during the back-propagation flow through the optimization layer and make the VLM aware of the planner’s capabilities, making more feasible goal predictions. We compare our end-to-end approach with a decoupled framework where the planner is just used at the inference time to drive to the VLM-predicted goal location and report a goal reaching Success Rate of 81%. We demonstrate the versatility of LeGo-Drive † across various driving scenarios and navigation commands, highlighting its potential for practical deployment in autonomous vehicles.
Given the front-view RGB image, we employ GPT-4V [29] to determine the best course of action from a range of potential driving maneuvers. GPT-4V accurately identifies an obstruction ahead and recommends switching to the left lane to continue moving forward. With this recommended command, our pipeline is able to predict a collision-free goal point and an optimized trajectory.
Compound commands such as "Turn right and stop by the food stall on the right" is broken into atomic commands:
Turn right and Stop by the food stall on the right.
These atomic commands are then executed sequentially.