Libra-VLA | Project Page

Vision-Language-Action (VLA) models are a promising paradigm for generalist robotic manipulation by grounding high-level semantic instructions into executable physical actions. However, prevailing approaches typically adopt a monolithic generation paradigm, directly mapping visual-linguistic features to high-frequency motor commands in a flat, non-hierarchical fashion. This strategy overlooks the inherent hierarchy of robotic manipulation, where complex actions can be naturally modeled in a Hybrid Action Space, decomposing into discrete macro-directional reaching and continuous micro-pose alignment, severely widening the semantic-actuation gap and imposing a heavy representational burden on grounding high-level semantics to continuous actions.

To address this, we introduce Libra-VLA, a novel Coarse-to-Fine Dual-System VLA architecture. We explicitly decouple the learning complexity into a coarse-to-fine hierarchy to strike a training equilibrium, while simultaneously leveraging this structural modularity to implement an asynchronous execution strategy.

The Semantic Planner predicts discrete action tokens capturing macro-directional intent, while the Action Refiner conditions on coarse intent to generate high-frequency continuous actions for precise alignment.

Crucially, our empirical analysis reveals that performance follows an inverted-U curve relative to action decomposition granularity, peaking exactly when the learning difficulty is balanced between the two sub-systems. With the asynchronous design, our approach offers a scalable, robust, and responsive solution for open-world manipulation.

Experiments

The model is trained from scratch. All results are obtained without large-scale robot-data pretraining.

LIBERO Benchmark

Methods	Action Space	Spatial	Object	Goal	Long	Avg.
CoT-VLA	Discrete	87.5	91.6	87.6	69.0	81.1
WorldVLA	Discrete	87.6	96.2	83.4	60.0	81.8
OpenVLA	Discrete	84.7	88.4	79.2	53.7	76.5
π₀-FAST	Discrete	96.4	96.8	88.6	60.2	85.5
DD-VLA	Discrete	97.2	98.6	97.4	92.0	96.3
Diffusion Policy	Continuous	78.3	92.5	68.3	50.5	72.4
Octo	Continuous	78.9	85.7	84.6	51.1	75.1
DreamVLA	Continuous	97.5	94.0	89.5	89.5	92.6
GO-1	Continuous	96.2	97.8	96.0	89.2	94.8
GR00T-N1	Continuous	94.4	97.6	93.0	90.6	93.9
F1	Continuous	98.2	97.8	95.4	91.3	95.7
GE-Act	Continuous	98.2	97.6	95.8	94.4	96.5
π₀	Continuous	96.8	98.8	95.8	85.2	94.1
π_0.5	Continuous	98.8	98.2	98.0	92.4	96.9
Libra-VLA (Ours)	Hybrid	98.6	99.4	98.0	92.8	97.2

Best results in bold, second-best underlined.

LIBERO-Plus Benchmark

Methods	Action Space	Camera	Robot	Language	Light	Background	Noise	Layout	Avg.
Zero-Shot Transfer
WorldVLA	Discrete	0.1	27.9	41.6	43.7	17.1	10.9	38.0	25.0
OpenVLA	Discrete	0.8	3.5	23.0	8.1	34.8	15.2	28.5	15.6
NORA	Discrete	2.2	37.0	65.1	45.7	58.6	12.8	62.1	39.0
UniVLA	Continuous	1.8	46.2	69.6	69.0	81.0	21.2	31.9	42.9
π₀-FAST	Discrete	65.1	21.6	61.0	73.2	73.2	74.4	68.8	61.6
OpenVLA-OFT	Continuous	56.4	31.9	79.5	88.7	93.3	75.8	74.2	69.6
Libra-VLA (Ours)	Hybrid	68.9	48.8	92.7	97.9	93.4	86.3	77.5	79.5
Supervised Fine-Tuning
π₀*	Continuous	79.6	21.1	72.5	84.7	86.2	68.3	69.4	67.4
π_0.5*	Continuous	70.3	41.7	81.1	97.3	94.6	71.8	84.9	75.7
OpenVLA-OFT+	Continuous	92.8	30.3	85.8	94.9	93.9	89.3	77.6	79.6
Libra-VLA (Ours)	Hybrid	94.5	41.8	83.2	95.3	94.3	93.7	75.3	82.3

* denotes results reproduced for fair comparison. Best results in bold, second-best underlined.

Ablation: Intent Granularity

Performance follows an inverted-U curve relative to bin granularity (N). The learning equilibrium is achieved where coarse tokens provide sufficient guidance without overwhelming the planner.

⚖ Learning Equilibrium

Action Refiner

System 1 · Fast · Continuous

Semantic Planner

System 2 · Slow · Discrete

Coarse Action Bin Number

Intent Granularity

Coarse Fine

▲ Libra Point

Learning difficulty is evenly distributed. The Semantic Planner provides sufficiently informative macro-directional guidance ("where to go"), while the Action Refiner focuses on precise local refinement ("how to interact").

VE	Refine	Bin (N)	Spatial	Object	Goal	Long	Avg.
✗	✓	2	95.8	97.6	95.4	80.2	92.3
✗	✓	10	98.6	98.6	96.0	87.0	95.1
✗	✓	50	96.0	96.4	78.2	79.4	87.5
✗	✓	100	94.0	95.0	70.8	75.6	83.9
✓	✓	2	97.0	98.6	38.6	81.8	79.0
✓	✓	10	98.6	99.4	98.0	92.8	97.2
✓	✓	50	96.8	98.4	95.0	89.2	94.9
✓	✓	100	95.4	96.8	92.8	90.4	93.9

Ablation: Architectural Effectiveness

The Coarse-to-Fine paradigm (Refine) is the primary performance driver, while the independent Visual Encoder (VE) provides further gains through structure and feature decoupling.

Model	VE	Refine	Spatial	Object	Goal	Long	Avg.
Libra-Base	✗	✗	95.8	95.4	86.0	76.0	88.3
Libra-VE	✓	✗	94.8	94.6	69.2	89.4	87.0
Libra-Refinement	✗	✓	98.6	98.6	96.0	87.0	95.1
Full (Ours)	✓	✓	98.6	99.4	98.0	92.8	97.2

Ablation: Training Strategy

Our dynamic curriculum strategy balances stable early-stage convergence and late-stage robustness. Pure Teacher Forcing leaves the refiner over-reliant on perfect coarse inputs, while No Teacher Forcing destabilizes early optimization with noisy planner predictions. The dynamic curriculum starts from ground-truth anchors and gradually switches to predicted ones, yielding the best performance across all suites.

Training Strategy	Spatial	Object	Goal	Long	Avg.
Pure Teacher Forcing	96.6	99.2	95.8	92.4	96.0
No Teacher Forcing	97.4	99.0	95.0	90.4	95.5
Dynamic Curriculum (Ours)	98.6	99.4	98.0	92.8	97.2

Training Convergence

The coarse-to-fine workload decoupling substantially accelerates optimization. At one-third of the training budget (10k steps), Libra-VLA already surpasses the monolithic Libra-Base by 16.3 points, with the largest gap on long-horizon tasks. The continuous-action MSE loss curves below confirm this advantage: Libra-VLA's fine-action loss drops to ~0.01 while Libra-Base remains at ~0.07. The short-lived bump at step 5k corresponds to the dynamic curriculum switching from ground-truth to predicted coarse anchors — the refiner quickly recovers, exhibiting error-correction behavior.

Method	Spatial	Object	Goal	Long	Avg.
Libra-Base	81.0	86.8	66.0	54.4	72.1
Libra-VLA (Ours)	97.2	99.2	75.0	82.2	88.4

Intermediate success rates (%) at 10,000 training steps on LIBERO, averaged over 500 rollouts per suite.

(a) Linear scale. Continuous-action MSE loss on LIBERO. Libra-VLA converges markedly faster than the monolithic baseline.

(b) Log scale. Reveals the low-loss regime where the linear view saturates, and the transient rise at step 5k caused by the dynamic curriculum switch.

Asynchronous Execution

Increasing the Horizon Expansion Factor (M) substantially reduces inference latency while maintaining high success rates, thanks to the spatial tolerance of coarse quantization. Baseline (Libra-Base) inference latency: 220ms on RTX 4090.

Factor (M)	Spatial	Object	Goal	Long	Avg.	Latency (ms)	Reduction
2	98.6	99.4	98.0	92.8	97.2	122	44.5%
3	97.6	99.2	94.2	93.8	96.2	112	49.1%
4	97.4	99.8	93.2	93.8	96.1	107	51.4%
5	97.8	98.8	92.0	92.4	95.3	104	52.7%

Methods	Wipe Stain	Pour Water	Make Sandwich	Avg.
GO-1	66.7	33.3	33.3	44.4
π₀	75.0	50.0	33.3	52.8
Libra-VLA (Ours)	75.0	83.3	50.0	69.4

Libra-VLA: Achieving Learning Equilibrium
via Asynchronous Coarse-to-Fine Dual-System

Abstract

Method

Experiments

LIBERO Benchmark

LIBERO-Plus Benchmark

Ablation: Intent Granularity

Ablation: Architectural Effectiveness

Ablation: Training Strategy

Training Convergence

Asynchronous Execution

Real-World Experiments

Citation

Libra-VLA: Achieving Learning Equilibriumvia Asynchronous Coarse-to-Fine Dual-System

Abstract

Method

Experiments

LIBERO Benchmark

LIBERO-Plus Benchmark

Ablation: Intent Granularity

Ablation: Architectural Effectiveness

Ablation: Training Strategy

Training Convergence

Asynchronous Execution

Real-World Experiments

Citation

Libra-VLA: Achieving Learning Equilibrium
via Asynchronous Coarse-to-Fine Dual-System