Bidirectional Decoding

Improving Action Chunking via Closed-Loop Resampling

Stanford University

Abstract

Predicting and executing a sequence of actions without intermediate replanning, known as action chunking, is increasingly used in robot learning from human demonstrations. However, its effects on learned policies remain puzzling: some studies highlight its importance for achieving strong performance, while others observe detrimental effects. In this paper, we first dissect the role of action chunking by analyzing the divergence between the learner and the demonstrator. We find that longer action chunks enable a policy to better capture temporal dependencies by taking into account more past states and actions within the chunk. However, this advantage comes at the cost of exacerbating errors in stochastic environments due to fewer observations of recent states. To address this, we propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop operations. BID samples multiple predictions at each time step and searches for the optimal one based on two criteria: (i) backward coherence, which favors samples aligned with previous decisions, (ii) forward contrast, which favors samples close to outputs of a stronger policy and distant from those of a weaker policy. By coupling decisions within and across action chunks, BID enhances temporal consistency over extended sequences while enabling adaptive replanning in stochastic environments. Experimental results show that BID substantially outperforms conventional closed-loop operations of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.

Analysis

Why action chunking does not work in noisy/dynamic settings

In action chunking, the agent predicts the joint distribution of a sequence of actions conditioned on the context and then executes all or part of the sequence without replanning. In our analysis, we consider two models with equal context length \( c \) and investigate the effect of using a longer action horizon. Consider a shorter action chunk of length \( h \) and a longer one of length \( h + d \). The longer action chunk benefits from remembering more past states and suffers from having not observed the more recent states, as is illustrated below:

Analysis gif

Gray shadows are observed contexts; darker indictates higher importance.
Hatches areas denote executed actions.

The expected loss of the two agents, \(\pi_h\) and \(\pi_{h+d}\), with respect to the expert, \( \pi^* \), are related by the following inequality. $$ \alpha_f - \epsilon_b(1 - P_b^{2d}) \leq \min_{\pi_{h+d}} \mathbb{E}_{G} \left[ \mathcal{L}(\pi_{h+d}, \pi^*) | C \right] - \min_{\pi_{h}} \mathbb{E}_{G} \left[\mathcal{L}(\pi_{h}, \pi^*) | C \right] \leq - \alpha_b + \epsilon_f(1 - P_f^{2d}). $$ Intuitively, the advantage of each policy stems from the additional information it has access to (i.e., \( \alpha_f \) for \( \pi_h \) and \( \alpha_b \) for \( \pi_{h+d} \)) while the disadvantage is bounded by the maximum divergence arising from inferring missing information incorrectly (i.e., \(\epsilon_b(1 - P_b^{2d})\) and \(\epsilon_f(1 - P_f^{2d}\)). Our theoretical analysis provides us two important takeaways :

Takeaway 1: In a near-deterministic environment, the optimal longer action horizon policy outperforms the optimal shorter action horizon policy. More formally, if \(a_t\) is temporally dependent on at least one state in \(\{s_{t-h-c-d:t-h-c-1} \}\) and \(\epsilon_{f}\) is finite, $$ \min_{\pi_{h+d}} \mathbb{E}_{G} \left[ \mathcal{L}(\pi_{h+d}, \pi^*)| C \right] < \min_{\pi_{h}} \mathbb{E}_{G} \left[\mathcal{L}(\pi_{h}, \pi^*)| C \right]$$

Takeaway 2: In a highly stochastic environment, the optimal shorter action horizon policy outperforms the optimal longer action horizon policy. More formally, if temporal dependency decreases over the number of time steps, then $$ \min_{\pi_{h}} \mathbb{E}_{G} \left[\mathcal{L}(\pi_{h}, \pi^*)| C \right] < \min_{\pi_{h+d}} \mathbb{E}_{G} \left[ \mathcal{L}(\pi_{h+d}, \pi^*)| C \right] $$

Takeaway 3: This analysis motivates us to use closed-loop operations, to maximize reactivity to environment stochasticity, while increasing temporal consistency through resampling techniques.

Method

Bidirectional Decoding

Our hypothesis suggests that while the probability of any pair of samples sharing the same latent strategy is low, the likelihood of finding a consistent pair from a large number of samples is significantly higher. This motivates us to solve the closed-loop action chunking problem by identifying the optimal action within a batch of plans at each time step, $$ a^* = \arg \min_{a \in \mathcal{A}} \mathcal{L}_B(a) + \mathcal{L}_F(a) $$ where \( \mathcal{L}_B \) and \( \mathcal{L}_F \) are two criteria measuring the temporal dependency with respect to the backward decision and forward plan. To ensure temporal coherence, we reference the action chunk from the previous time step, \( \{ \hat{a}_{t-1}, \cdots, \hat{a}_{t+h-1} \} \) and minimize the weighted sum of Euclidean distance across \( h - 1 \) overlapping steps: $$ \mathcal{L}_B = \sum_{\tau=0}^{h-1} \rho^\tau \left\| a_{t+\tau} - {\hat a}_{t+\tau} \right\|_2. $$ This encourages consistent latent strategies across time steps, while allowing for gradual adaptation to unforeseen environment dynamics.
A robust policy should predict far enough to capture temporal dependencies in demonstrations. To ensure this, we compare each candidate plan with two reference sets: one from a well-trained model as the stronger policy, and another from an early underfitting checkpoint or a shorter prediction horizon model as the weaker policy. The forward objective minimizes the average distance to positive samples from the strong policy and maximizes the average distance to negative samples from the weak policy: $$ \mathcal{L}_F = \frac{1}{N}\left(\sum_{a^{+} \in \mathcal{A}^{+}} \sum_{\tau=0}^{l} \left\| a^{(t)}_{t+\tau} - a^{+}_{t+\tau} \right\|_2 - \sum_{a^{-} \in \mathcal{A}^{-}} \sum_{\tau=0}^{l} \left\| a^{(t)}_{t+\tau} - a^{-}_{t+\tau} \right\|_2\right),$$ where \( \mathcal{A}^{+} = \mathcal{A} \setminus \{a\} \) is the positive set predicted by the strong policy \( \pi \), \( \mathcal{A}^{-} \) is the negative set predicted by the weak policy \( \pi' \).

Real World Experiments: Dynamic Moving Objects

Vanilla Open-Loop

Cannot react to stochasticity in environment - gripper closes before reaching the cup.

Vanilla Closed-Loop

Reacts to stochasticity but cannot execute a long-term plan consistently resulting in jittery behavior.

EMA Closed-Loop

Cannot react to environment stochasticity quickly enough and fails to grasp the cup firmly.

BID Closed-Loop

Reacts to environment stochasticity and carries out a consistent long-term plan.

We use a pretrained diffusion policy and compare the performance of BID with random sampling in open loop, random sampling in closed loop, and Exponential Moving Average (EMA) sampling. The first task is to pick up a moving cup whose initial position is fixed and place it on a nearby saucer. The cup is pulled with a string until both the gripper grasps the cup. BID consistently outperforms the other methods achieving over 2x improvement in success rate.

Vanilla Open-Loop
(static environment)

BID Closed-Loop
(static environment)

Vanilla Open-Loop
(dynamic environment)

BID Closed-Loop
(dynamic environment)

In particular, we observe that open loop action decoding often struggles with precision even in the static setting. Our next task for the robot is to drop a toy into a plastic cup. In the dynamic setting, this cup is moved by hand as the robot carries out the task. In both static and dynamic settings, BID achieves over 2x improvement in success rate compared to random sampling in open loop.

Simulation Experiments: Stochastic Action Noises

In simulation, we evaluate BID on the Push-T, RoboMimic, and 4-Object Franka Kitchen tasks. Below, we provide sample behaviors of the four methods on PushT, Franka Kitchen, Square and ToolHang tasks in a stochastic environment.

Vanilla Open-Loop

Fails to react to the stochasticity in the environment.

Vanilla Closed-Loop

Reacts to stochasticity but lacks a long-term plan.

EMA Closed-Loop

Fails to balance reactivity and long-term planning.

BID Closed-Loop

Reactive to the stochasticity and carries out a consistent long-term plan.

Vanilla Open-Loop

Fails to react to stochasticity causing failure modes like inability to grab objects.

Vanilla Closed-Loop

Adapts to stochasticity but slow and jittery trajectories.

EMA Closed-Loop

Lacks long-term planning such as aiming for one control knob but then switching to another mid-way. Adapts to stochasticity but slow and jittery trajectories.

BID Closed-Loop

Adapts to the stochasticity and carries out a consistent long-term plan. BID is also faster and smoother than EMA.

Vanilla Open-Loop

Fails to react to stochasticity causing lack of precision.

Vanilla Closed-Loop

Adapts to stochasticity but slow trajectories and imprecise actions.

EMA Closed-Loop

Adapts to stochasticity but slow trajectories and imprecise actions.

BID Closed-Loop

Adapts to the stochasticity and carries out a consistent long-term plan.

Vanilla Open-Loop

Fails to react to stochasticity causing it to be stuck in a failure mode.

Vanilla Closed-Loop

Adapts to stochasticity but slow and jittery trajectories.

EMA Closed-Loop

Cannot adapt to stochasiticity well enough leading it to fail to grasp the square properly multiple times.

BID Closed-Loop

Adapts to the stochasticity and carries out a consistent long-term plan. BID is also faster and smoother than other methods.

Results

We evaluate BID on the Push-T, RoboMimic, and 4-Object Franka Kitchen tasks. While existing inference methods offer some benefits for closed-loop operations, they either lack robustness or are highly sensitive to decay rate. BID consistently achieves substantial gains across all tasks, surpassing the vanilla baseline by over 32% in relative improvements.

Comparison of methods

We also evaluate two critical properties of BID: scalability with increasing batch sizes and compatibility with existing inference methods. Notably, BID benefits significantly from large batch sizes, demonstrating strong potential for test-time scaling. Moreover, when combined with EMA, BID boosts the relative performance gain from 32% to 46%, exhibiting a complementary effect with existing methods.

Scalability with increasing batch sizes and compatibility with existing inference methods

In real world, we, first, consider a task where the robot is to deliver an object held in its gripper into a cup held by a human. This task mirrors real-world scenarios where robots interact with a dynamic environment, accommodating moving objects and agents. In the second task, the robot is to pick up a moving cup and place it on a nearby saucer. BID achieves over 2x improvement in success rate compared to all other methods in the stochastic setting while matching the performance of the best alternative in the static one.

Real World Experiments Results

BibTeX

@article{liu2024bidirectional,
  author    = {Liu, Yuejiang and Hamid, Jubayer Ibn and Xie, Annie and Lee, Yoonho and Du, Max and Finn, Chelsea},
  title     = {Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling},
  journal   = {arXiv preprint arXiv:2408.17355},
  year      = {2024},
}