ACL 2026

Libra-VLA: Achieving Learning Equilibrium
via Asynchronous Coarse-to-Fine Dual-System

Yifei Wei1,2  Linqing Zhong1,2  Yi Liu2  Yuxiang Lu2  Xindong He2  Maoqing Yao2*  Guanghui Ren2*
1Beihang University  2AgiBot  * Corresponding authors
arXiv Code (coming soon)
Comparison of action generation paradigms. (a) Discrete autoregressive approaches discretize actions into massive bins. (b) Continuous diffusion approaches directly predict continuous signals. (c) Our proposed Libra-VLA operates in a hybrid action space, where discrete coarse bins representing macro-intents serve as anchors for continuous fine actions, naturally aligning with the inherent hierarchical characteristics.

Abstract

Vision-Language-Action (VLA) models are a promising paradigm for generalist robotic manipulation by grounding high-level semantic instructions into executable physical actions. However, prevailing approaches typically adopt a monolithic generation paradigm, directly mapping visual-linguistic features to high-frequency motor commands in a flat, non-hierarchical fashion. This strategy overlooks the inherent hierarchy of robotic manipulation, where complex actions can be naturally modeled in a Hybrid Action Space, decomposing into discrete macro-directional reaching and continuous micro-pose alignment, severely widening the semantic-actuation gap and imposing a heavy representational burden on grounding high-level semantics to continuous actions.

To address this, we introduce Libra-VLA, a novel Coarse-to-Fine Dual-System VLA architecture. We explicitly decouple the learning complexity into a coarse-to-fine hierarchy to strike a training equilibrium, while simultaneously leveraging this structural modularity to implement an asynchronous execution strategy.

The Semantic Planner predicts discrete action tokens capturing macro-directional intent, while the Action Refiner conditions on coarse intent to generate high-frequency continuous actions for precise alignment.

Crucially, our empirical analysis reveals that performance follows an inverted-U curve relative to action decomposition granularity, peaking exactly when the learning difficulty is balanced between the two sub-systems. With the asynchronous design, our approach offers a scalable, robust, and responsive solution for open-world manipulation.


Method

Libra-VLA Architecture
Architectural Overview of Libra-VLA. (Left) System 2: Semantic Planner runs at low frequency, predicting discrete macro-directional intents. (Right) System 1: Action Refiner runs at high frequency, synthesizing continuous micro-pose alignments. The two systems are bridged via an asynchronous execution strategy managed by an intent buffer.

We factorize the joint policy distribution into two conditional stages. The Semantic Planner (System 2) leverages a pre-trained VLM to predict coarse directional tokens via a Parallel Coarse-Action Head, trained with Cross-Entropy loss. To align physical control with the VLM's semantic space, we discretize normalized continuous actions into N uniform bins. These discrete tokens serve as coarse actions representing macro-directional intents rather than precise kinematics, naturally aligning with the VLM's semantic reasoning capabilities.

The Action Refiner (System 1) operates as a conditional diffusion policy that refines the coarse intent into executable precise motions. To furnish fine-grained visual representations for precise actuation while achieving structural decoupling, the Fine-Action Head is augmented with an independent visual encoder to extract geometric features. The Fine-Action Head conditions on the composite input of noisy actions, geometric features, and macro-intent embeddings to iteratively reverse the diffusion process.

The asynchronous execution strategy further decouples inference costs: the Semantic Planner predicts an extended macro-horizon Lmacro = M × Hchunk in a single pass, filling an intent buffer (FIFO queue). The Action Refiner then operates at high frequency by consuming buffered intents, significantly reducing average inference latency.


Experiments

The model is trained from scratch. All results are obtained without large-scale robot-data pretraining.

LIBERO Benchmark

MethodsAction SpaceSpatialObjectGoalLongAvg.
CoT-VLADiscrete87.591.687.669.081.1
WorldVLADiscrete87.696.283.460.081.8
OpenVLADiscrete84.788.479.253.776.5
π0-FASTDiscrete96.496.888.660.285.5
DD-VLADiscrete97.298.697.492.096.3
Diffusion PolicyContinuous78.392.568.350.572.4
OctoContinuous78.985.784.651.175.1
DreamVLAContinuous97.594.089.589.592.6
GO-1Continuous96.297.896.089.294.8
GR00T-N1Continuous94.497.693.090.693.9
F1Continuous98.297.895.491.395.7
GE-ActContinuous98.297.695.894.496.5
π0Continuous96.898.895.885.294.1
π0.5Continuous98.898.298.092.496.9
Libra-VLA (Ours)Hybrid98.699.498.092.897.2
Best results in bold, second-best underlined.

LIBERO-Plus Benchmark

MethodsAction SpaceCameraRobotLanguageLightBackgroundNoiseLayoutAvg.
Zero-Shot Transfer
WorldVLADiscrete0.127.941.643.717.110.938.025.0
OpenVLADiscrete0.83.523.08.134.815.228.515.6
NORADiscrete2.237.065.145.758.612.862.139.0
UniVLAContinuous1.846.269.669.081.021.231.942.9
π0-FASTDiscrete65.121.661.073.273.274.468.861.6
OpenVLA-OFTContinuous56.431.979.588.793.375.874.269.6
Libra-VLA (Ours)Hybrid68.948.892.797.993.486.377.579.5
Supervised Fine-Tuning
π0*Continuous79.621.172.584.786.268.369.467.4
π0.5*Continuous70.341.781.197.394.671.884.975.7
OpenVLA-OFT+Continuous92.830.385.894.993.989.377.679.6
Libra-VLA (Ours)Hybrid94.541.883.295.394.393.775.382.3
* denotes results reproduced for fair comparison. Best results in bold, second-best underlined.

Ablation: Intent Granularity

Performance follows an inverted-U curve relative to bin granularity (N). The learning equilibrium is achieved where coarse tokens provide sufficient guidance without overwhelming the planner.

⚖ Learning Equilibrium
Action Refiner
System 1 · Fast · Continuous
Semantic Planner
System 2 · Slow · Discrete
← Degenerate to Diffusion Degenerate to Autoregression →
Coarse Action Bin Number
Intent Granularity
Coarse Fine
▲ Libra Point
Learning difficulty is evenly distributed. The Semantic Planner provides sufficiently informative macro-directional guidance ("where to go"), while the Action Refiner focuses on precise local refinement ("how to interact").
100 95 90 85 80 75 Avg. Success Rate (%) N=2 N=10 N=50 N=100 Intent Granularity (N) Libra Point Libra-VLA (Full) Libra-Refinement 92.3 95.1 87.5 83.9 79.0 97.2 94.9 93.9
VERefineBin (N)SpatialObjectGoalLongAvg.
295.897.695.480.292.3
1098.698.696.087.095.1
5096.096.478.279.487.5
10094.095.070.875.683.9
297.098.638.681.879.0
1098.699.498.092.897.2
5096.898.495.089.294.9
10095.496.892.890.493.9

Ablation: Architectural Effectiveness

The Coarse-to-Fine paradigm (Refine) is the primary performance driver, while the independent Visual Encoder (VE) provides further gains through structure and feature decoupling.

ModelVERefineSpatialObjectGoalLongAvg.
Libra-Base95.895.486.076.088.3
Libra-VE94.894.669.289.487.0
Libra-Refinement98.698.696.087.095.1
Full (Ours)98.699.498.092.897.2

Ablation: Training Strategy

Our dynamic curriculum strategy balances stable early-stage convergence and late-stage robustness. Pure Teacher Forcing leaves the refiner over-reliant on perfect coarse inputs, while No Teacher Forcing destabilizes early optimization with noisy planner predictions. The dynamic curriculum starts from ground-truth anchors and gradually switches to predicted ones, yielding the best performance across all suites.

Training StrategySpatialObjectGoalLongAvg.
Pure Teacher Forcing96.699.295.892.496.0
No Teacher Forcing97.499.095.090.495.5
Dynamic Curriculum (Ours)98.699.498.092.897.2

Training Convergence

The coarse-to-fine workload decoupling substantially accelerates optimization. At one-third of the training budget (10k steps), Libra-VLA already surpasses the monolithic Libra-Base by 16.3 points, with the largest gap on long-horizon tasks. The continuous-action MSE loss curves below confirm this advantage: Libra-VLA's fine-action loss drops to ~0.01 while Libra-Base remains at ~0.07. The short-lived bump at step 5k corresponds to the dynamic curriculum switching from ground-truth to predicted coarse anchors — the refiner quickly recovers, exhibiting error-correction behavior.

MethodSpatialObjectGoalLongAvg.
Libra-Base81.086.866.054.472.1
Libra-VLA (Ours)97.299.275.082.288.4
Intermediate success rates (%) at 10,000 training steps on LIBERO, averaged over 500 rollouts per suite.
Training loss (linear scale)
(a) Linear scale. Continuous-action MSE loss on LIBERO. Libra-VLA converges markedly faster than the monolithic baseline.
Training loss (log scale)
(b) Log scale. Reveals the low-loss regime where the linear view saturates, and the transient rise at step 5k caused by the dynamic curriculum switch.

Asynchronous Execution

Increasing the Horizon Expansion Factor (M) substantially reduces inference latency while maintaining high success rates, thanks to the spatial tolerance of coarse quantization. Baseline (Libra-Base) inference latency: 220ms on RTX 4090.

Factor (M)SpatialObjectGoalLongAvg.Latency (ms)Reduction
298.699.498.092.897.212244.5%
397.699.294.293.896.211249.1%
497.499.893.293.896.110751.4%
597.898.892.092.495.310452.7%

Real-World Experiments

Evaluated on AgiBot G1 robot platform across three long-horizon tasks.

Real-World Tasks
Real-world tasks: Wipe Stain, Pour Water, Make Sandwich
0 20 40 60 80 100 Success Rate (%) 66.7 75.0 75.0 Wipe Stain 33.3 50.0 83.3 Pour Water 33.3 33.3 50.0 Make Sandwich 44.4 52.8 69.4 Average GO-1 π₀ Libra-VLA
MethodsWipe StainPour WaterMake SandwichAvg.
GO-166.733.333.344.4
π075.050.033.352.8
Libra-VLA (Ours)75.083.350.069.4
Success rates (%) on real-world long-horizon tasks evaluated on AgiBot G1.

Citation

@misc{wei2026libravlaachievinglearningequilibrium,
      title={Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System},
      author={Yifei Wei and Linqing Zhong and Yi Liu and Yuxiang Lu and Xindong He and Maoqing Yao and Guanghui Ren},
      year={2026},
      eprint={2604.24921},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2604.24921},
}