ACL 2026

Libra-VLA: Achieving Learning Equilibrium
via Asynchronous Coarse-to-Fine Dual-System

Yifei Wei1,2  Linqing Zhong1,2  Yi Liu2  Yuxiang Lu2  Xindong HE2  Maoqing Yao2*  Guanghui Ren2*
1Beihang University  2AgiBot  * Corresponding authors
Paper (coming soon) arXiv (coming soon)
Comparison of action generation paradigms. (a) Discrete autoregressive approaches discretize actions into massive bins. (b) Continuous diffusion approaches directly predict continuous signals. (c) Our proposed Libra-VLA operates in a hybrid action space, where discrete coarse bins representing macro-intents serve as anchors for continuous fine actions, naturally aligning with the inherent hierarchical characteristics.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation by grounding high-level semantic instructions into executable physical actions. However, prevailing approaches typically adopt a monolithic generation paradigm, attempting to map high-level visual-linguistic features directly to high-frequency motor commands in a flat, non-hierarchical fashion. This strategy overlooks the inherent hierarchy of robotic manipulation, where complex actions can be naturally modeled in a Hybrid Action Space, decomposing into discrete macro-directional reaching and continuous micro-pose alignment. Bypassing this structural hierarchy widens the semantic-actuation gap, compelling models to rely on cross-modal alignment via massive-scale robotic pre-training to bridge the gap.

To address this, we introduce Libra-VLA, a novel Coarse-to-Fine Dual-System VLA architecture. Our core insight is twofold: we explicitly decouple the learning complexity into a coarse-to-fine hierarchy to strike a training equilibrium, while simultaneously leveraging this structural modularity to implement an asynchronous execution strategy. Specifically, our architecture integrates: (1) Semantic Planner, which focuses on macro-intent, predicting discrete action tokens that guide the robot's general direction, and (2) Action Refiner, which conditions on the coarse intent to generate high-frequency continuous actions for precise pose alignment.

Crucially, our empirical analysis reveals that performance follows an inverted-U curve relative to the granularity of action decomposition. We demonstrate that Libra-VLA achieves best performance exactly when the learning difficulty is balanced between the two sub-systems. Combined with the asynchronous design, our approach offers a scalable, data-efficient, and responsive solution for open-world manipulation.


Method

Libra-VLA Architecture
Architectural Overview of Libra-VLA. (Left) System 2: Semantic Planner runs at low frequency, predicting discrete macro-directional intents. (Right) System 1: Action Refiner runs at high frequency, synthesizing continuous micro-pose alignments. The two systems are bridged via an asynchronous execution strategy managed by an intent buffer.

We factorize the joint policy distribution into two conditional stages. The Semantic Planner (System 2) leverages a pre-trained VLM to predict coarse directional tokens via a Parallel Coarse-Action Head, trained with Cross-Entropy loss. To align physical control with the VLM's semantic space, we discretize normalized continuous actions into N uniform bins. These discrete tokens serve as coarse actions representing macro-directional intents rather than precise kinematics, naturally aligning with the VLM's semantic reasoning capabilities.

The Action Refiner (System 1) operates as a conditional diffusion policy that refines the coarse intent into executable precise motions. To furnish fine-grained visual representations for precise actuation while achieving structural decoupling, the Fine-Action Head is augmented with an independent visual encoder to extract geometric features. The Fine-Action Head conditions on the composite input of noisy actions, geometric features, and macro-intent embeddings to iteratively reverse the diffusion process.

The asynchronous execution strategy further decouples inference costs: the Semantic Planner predicts an extended macro-horizon Lmacro = M × Hchunk in a single pass, filling an intent buffer (FIFO queue). The Action Refiner then operates at high frequency by consuming buffered intents, significantly reducing average inference latency.


Experiments

LIBERO Benchmark

MethodsAction SpaceSpatialObjectGoalLongAvg.
CoT-VLADiscrete87.591.687.669.081.1
WorldVLADiscrete87.696.283.460.081.8
OpenVLADiscrete84.788.479.253.776.5
π0-FASTDiscrete96.496.888.660.285.5
DD-VLADiscrete97.298.697.492.096.3
Diffusion PolicyContinuous78.392.568.350.572.4
OctoContinuous78.985.784.651.175.1
DreamVLAContinuous97.594.089.589.592.6
GO-1Continuous96.297.896.089.294.8
GR00T-N1Continuous94.497.693.090.693.9
F1Continuous98.297.895.491.395.7
GE-ActContinuous98.297.695.894.496.5
π0Continuous96.898.895.885.294.1
π0.5Continuous98.898.298.092.496.9
Libra-VLA (Ours)Hybrid98.699.498.092.897.2
Best results in bold, second-best underlined.

LIBERO-Plus Benchmark

MethodsAction SpaceCameraRobotLanguageLightBackgroundNoiseLayoutAvg.
Zero-Shot Transfer
WorldVLADiscrete0.127.941.643.717.110.938.025.0
OpenVLADiscrete0.83.523.08.134.815.228.515.6
NORADiscrete2.237.065.145.758.612.862.139.0
UniVLAContinuous1.846.269.669.081.021.231.942.9
π0-FASTDiscrete65.121.661.073.273.274.468.861.6
OpenVLA-OFTContinuous56.431.979.588.793.375.874.269.6
Libra-VLA (Ours)Hybrid60.747.888.597.394.594.979.979.1
Supervised Fine-Tuning
π0*Continuous79.621.172.584.786.268.369.467.4
π0.5*Continuous70.341.781.197.394.671.884.975.7
OpenVLA-OFT+Continuous92.830.385.894.993.989.377.679.6
Libra-VLA (Ours)Hybrid94.541.883.295.394.393.775.382.3
* denotes results reproduced for fair comparison. Best results in bold, second-best underlined.

Ablation: Intent Granularity

Performance follows an inverted-U curve relative to bin granularity (N). The learning equilibrium is achieved where coarse tokens provide sufficient guidance without overwhelming the planner.

⚖ Learning Equilibrium
Action Refiner
System 1 · Fast · Continuous
Semantic Planner
System 2 · Slow · Discrete
← Degenerate to Diffusion Degenerate to Autoregression →
Coarse Action Bin Number
Intent Granularity
Coarse Fine
▲ Libra Point
Learning difficulty is evenly distributed. The Semantic Planner provides sufficiently informative macro-directional guidance ("where to go"), while the Action Refiner focuses on precise local refinement ("how to interact").
100 95 90 85 80 75 Avg. Success Rate (%) N=2 N=10 N=50 N=100 Intent Granularity (N) Libra Point Libra-VLA (Full) Libra-Refinement 92.3 95.1 87.5 83.9 79.0 97.2 94.9 93.9
VERefineBin (N)SpatialObjectGoalLongAvg.
295.897.695.480.292.3
1098.698.696.087.095.1
5096.096.478.279.487.5
10094.095.070.875.683.9
297.098.638.681.879.0
1098.699.498.092.897.2
5096.898.495.089.294.9
10095.496.892.890.493.9

Ablation: Architectural Effectiveness

The Coarse-to-Fine paradigm (Refine) is the primary performance driver, while the independent Visual Encoder (VE) provides further gains through structure and feature decoupling.

ModelVERefineSpatialObjectGoalLongAvg.
Libra-Base95.895.486.076.088.3
Libra-VE94.894.669.289.487.0
Libra-Refinement98.698.696.087.095.1
Full (Ours)98.699.498.092.897.2

Asynchronous Execution

Increasing the Horizon Expansion Factor (M) substantially reduces inference latency while maintaining high success rates, thanks to the spatial tolerance of coarse quantization. Baseline (Libra-Base) inference latency: 220ms on RTX 4090.

Factor (M)SpatialObjectGoalLongAvg.Latency (ms)Reduction
298.699.498.092.897.212244.5%
397.699.294.293.896.211249.1%
497.499.893.293.896.110751.4%
597.898.892.092.495.310452.7%

Real-World Experiments

Evaluated on AgiBot G1 robot platform across three long-horizon tasks.

Real-World Tasks
Real-world tasks: Wipe Stain, Pour Water, Make Sandwich
0 20 40 60 80 100 Success Rate (%) 66.7 75.0 75.0 Wipe Stain 33.3 50.0 83.3 Pour Water 33.3 33.3 50.0 Make Sandwich 44.4 52.8 69.4 Average GO-1 π₀ Libra-VLA
MethodsWipe StainPour WaterMake SandwichAvg.
GO-166.733.333.344.4
π075.050.033.352.8
Libra-VLA (Ours)75.083.350.069.4
Success rates (%) on real-world long-horizon tasks evaluated on AgiBot G1.

Citation

Coming soon.