Ordered Action Tokenization

OAT is an ordered, prefix-decodable action representation built for compactness, total decodability, and high autoregressive modelability.

RSS 2026 accepted paper, expanded from the original Article with native web controls.

Chaoqi Liu¹, Xiaoshen Han¹, Jiawei Gao¹, Yue Zhao², Haonan Chen¹, and Yilun Du¹

¹Harvard University · ²Stanford University

arXiv Code Article

Three tokenizer desiderata

Any prefix tokens at inference

Superior policy performance

20+ evaluated tasks

OAT teaser summarizing tokenizer desiderata and policy performance

Paper in one minute

Action tokens are not just an implementation detail.

Autoregressive robot policies need a way to turn continuous control signals into discrete symbols. That conversion decides how many steps the policy must generate, whether every generated sequence can be executed, and how predictable the next token is. OAT argues that a good action tokenizer should optimize all three at once.

The problem: binning is valid but too long, FAST is compact but can be non-decodable, and learned latent tokenizers are compact but often have weak autoregressive modelability.
The method: OAT learns a small sequence of discrete register tokens and trains them with a coarse-to-fine modelability priority.
The result: policies can trade inference cost for action fidelity by choosing how many high-modelability tokens to generate, while still decoding a valid action chunk.

Small glossary

Five terms used throughout this page.

Action chunk: A short future sequence of robot actions predicted at once, then partially executed before re-planning.
Tokenization: The map from continuous actions to discrete symbols that an autoregressive policy can model.
Detokenization: The reverse map from generated symbols back to executable continuous robot actions.
Autoregressive policy: A policy that predicts the next action token conditioned on observations and previously generated tokens.
Modelability: How easy the token distribution is for a generative model to learn, sample, and use downstream.

1. Why should we care about action tokenization?

Discrete action tokens are becoming an increasingly important design choice in modern robot learning systems. Recent systems: RDT-2 employs vector-quantized (VQ) action tokens for its stage-1 training; TRI's LBM/VLA leverages FAST and VQ-style tokenizations; and the winning solution of the BEHAVIOR 2025 Challenge integrated FAST tokens in training and inference.

Across these systems, action tokenization plays a particularly critical role during pre-training, where discrete tokens provide a structured, scalable interface between high-capacity sequence models and continuous robot control. As a result, the choice of action tokenizer increasingly shapes not only efficiency, but also what kinds of behaviors models can learn and generalize.

2. An overlooked axis: Modelability.

Classical theories like the rate-distortion tradeoff focus on balancing compression rate and reconstruction fidelity. In the era of GenAI, we argue that a third axis, modelability, is crucial and often overlooked: how difficult it is for a generative model to capture the distribution of a representation. Poorly structured representations may be compact and accurate, yet fundamentally hard to model.

This is the central distinction: a tokenizer can reconstruct actions well and still be a poor interface for policy learning. If the token stream has low autoregressive modelability, or is sparse and high-entropy, the model pays that cost at every next-token prediction step. The representation is not merely a storage format; it is the learning problem the policy actually sees.

Rate How many tokens?

Shorter action codes reduce autoregressive depth and latency.

Distortion How much action detail?

Continuous robot control still needs enough precision for contact-rich execution.

Modelability How predictable is the sequence?

Token order should make next-token prediction easier, not merely possible.

Reconstruction is not enough.

For robot control, the tokenizer is useful only if the downstream policy can reliably model its tokens. Low reconstruction error matters, but it does not guarantee a token sequence with stable left-to-right structure.

3. Three properties we seek.

We argue that an effective action tokenizer for autoregressive policies should satisfy three key properties:

(P.1) Reasonable compression. The representation should compress action chunks enough to enable efficient sequence modeling, but not so aggressively that too much information is lost.
(P.2) Total decodability. The detokenization mapping should be a well-defined total function: every token sequence in the discrete token space must decode to a valid action chunk. This is essential because policies may generate arbitrary token sequences at inference time. If decoding is only partially defined, invalid tokens can lead to undefined behavior or catastrophic failures during execution.
(P.3) Predictive ordering. Token sequences should admit a meaningful left-to-right causal structure aligned with next-token prediction. This structure is critical for modelability, allowing autoregressive models to learn stable, predictable token dynamics.

In the remainder of this article, I explain them.

Interactive Comparison

Which tokenizer satisfies the desiderata?

Select a tokenizer family to compare compression, total decodability, and autoregressive modelability.

Compression Low

Total decodability Yes

Autoregressive modelability Low

Binning is universally decodable but produces long, flat token sequences that are hard for autoregressive policies to model efficiently.

4. What's missing in today's action tokens?

Each existing tokenizer family misses the target in a different way. The practical failure mode depends on which desideratum it gives up.

Binning Valid but slow.

Every generated token decodes, but the policy must generate hundreds of flat dimension-time tokens.

FAST Compact but partial.

The frequency structure helps next-token prediction, but arbitrary BPE sequences may not decode.

Latents Compact but low-modelability.

A neural decoder makes outputs valid, but the token sequence has weak autoregressive modelability.

OAT Compact, valid, high-modelability.

Tokens are learned as a progressive sequence, so early predictions carry coarse motion structure.

Binning. The most common scheme is per-dimension, per-timestamp binning. While simple, it scales poorly: long horizons and high-dimensional actions can produce hundreds of tokens per chunk, dramatically slowing training and inference and increasing latency. More importantly, such long, flat sequences have poor modelability across dimensions: knowing a[t, 1...i] offers little help in predicting a[t, i+1], making binning poorly aligned with autoregressive generation.

Frequency-domain transform. Frequency-based methods such as FAST achieve high information density (P.1) and impose a low-to-high frequency structure (P.3), where early tokens capture global trajectory structure and later tokens refine details. However, FAST violates P.2 (total decodability). Because Byte Pair Encoding (BPE) produces variable-length sequences, arbitrary token sequences may not decode into a valid fixed-size frequency representation, leading to undefined behavior and runtime failures. We refer readers to appendix of our paper and the discussion on Hugging Face for further details.

Vanilla Latents. Learned encoder-decoder latent tokenizers can achieve strong compression (P.1), and neural decoders ensure total decodability (P.2). However, the resulting token spaces often have weak autoregressive modelability: the token positions do not provide a stable left-to-right structure for next-token prediction. This makes them poorly aligned with policies that rely on meaningful left-to-right structure (P.3) for stable generation.

In summary, existing approaches each satisfy subsets of the desiderata, but none simultaneously achieve compression, total decodability, and high autoregressive modelability.

5. Ordered Action Tokenization

We introduce OAT, a learned autoencoder framework that discretizes action chunks into an ordered sequence of tokens. OAT encodes actions using transformer-based register tokens, discretizes the resulting latents with FSQ, and reconstructs actions via a conditional decoder. To improve autoregressive modelability, we combine causal attention over register tokens with nested dropout during training. Together, these design choices encourage a high-modelability latent representation in which earlier tokens capture coarse, global structure and later tokens refine details.

Summarize the action chunk. A transformer encoder reads the continuous action sequence and writes the important temporal information into a fixed set of register tokens.
Discretize the registers. Finite scalar quantization turns the register latents into discrete tokens that an autoregressive policy can predict.
Force a left-to-right structure. Causal attention makes later registers depend on earlier registers, aligning the representation with next-token generation.
Train with missing tails. Nested dropout randomly masks later tokens during tokenizer training, so early tokens must carry the highest-priority information.
Decode back to control. A conditional decoder maps the generated token prefix back into a continuous action chunk for execution.

OAT method overview with register tokens and prefix decoding

6. Why does order improve modelability?

The ordering induced by OAT admits a natural interpretation through information theory. Shannon showed that the optimal code length for an event scales with the negative logarithm of its probability, - log p: frequent patterns require fewer bits, while rare events demand more representational capacity. Action chunks follow a similarly skewed distribution, most trajectories share common coarse structure, whereas fine-grained deviations occur infrequently.

From this perspective, OAT learns a form of progressive coding. Early tokens capture high-probability, globally shared motion patterns, while later tokens encode increasingly rare residual details. This ordering emerges naturally from nested dropout: because the decoder must reconstruct actions from partial prefixes, the tokenizer is incentivized to allocate information in decreasing order of frequency and importance. As a result, longer prefixes yield monotonic reconstruction improvement, and token order aligns closely with autoregressive next-token prediction, without manually assigning physical features to particular token positions.

7. A by-product: prefix-based decoding.

Autoregressive policy trained on OAT need not proceed to completion. Because any prefix of OAT token sequence can be detokenized into a valid action chunk, OAT supports prefix-based execution and enables an anytime trade-off between computation and performance. Short prefixes yield fast but coarse predictions, while longer prefixes produce more refined actions at higher computational cost. This flexibility arises naturally from the ordered tokenization and requires no changes to the policy architecture or training objective, distinguishing OAT from prior tokenizers that rely on fixed-length detokenization.

Prefix Lab

Decode fewer tokens, keep the action valid.

OAT makes every prefix executable. More tokens refine the trajectory, but the policy can stop early when latency matters.

decoded prefix ground truth

1 prefix token

One token decodes a complete action chunk, but the reconstruction is coarse and visibly offset from the ground truth.

Each prefix decodes to a complete action chunk. Green points are ground-truth waypoints; red points are the full chunk reconstructed from the selected prefix. More prefix tokens reduce the red-green error and increase fine-grained fidelity.

Interactive MeshCat Visualization

Visualization of reconstructed action chunks using increasing numbers of decoded tokens. Earlier tokens capture the coarse, global structure of the motion, and additional tokens progressively refine fine-grained details, producing trajectories that increasingly match the ground truth. All trajectories are generated by the same model.

1 token

MSE = 0.592

2 tokens

MSE = 0.446

4 tokens

MSE = 0.038

8 tokens

MSE = 0.009

ground truth

8. Experiments

We evaluate OAT across more than 20 tasks spanning four simulation benchmarks (LIBERO, RoboMimic, MetaWorld, and RoboCasa) and real-world robot execution. The results compare success rate, autoregressive depth and latency, modelability ablations, and real-world task success, showing how ordered, high-modelability prefix decoding translates into both stronger policies and more flexible inference.

How to read the results

Look for monotonicity. OAT₁, OAT₂, OAT₄, and OAT₈ should improve as more high-modelability tokens are decoded.
Compare equal-depth methods. OAT₈ and QueST both use eight tokens, so differences are mostly about token structure rather than token count.
Separate latency from success. Shorter prefixes expose the speed/performance trade-off, while the ablation tests whether the modelability objective is doing real work.

OAT Is Superior

OAT consistently outperforms prior action tokenization schemes and matches or exceeds the strongest baselines, while additionally enabling prefix-based decoding that is unavailable to existing methods. OAT₈ achieves the best performance across simulated and real-world benchmarks.

Simulation success rate across manipulation benchmarks, with real-world success over 20 independent trials.
Policy	LIBERO	RoboMimic	MetaWorld	RoboCasa	PnP Ball	Stack Cups
DP	36.6	67.1	19.3	54.0	14/20	11/20
Bin	14.4	39.5	14.5	27.7	4/20	8/20
FAST	23.0	24.0	7.1	13.2	8/20	6/20
QueST	48.2	66.9	17.9	52.3	11/20	8/20
OAT₁	11.7	50.8	11.3	47.7	7/20	3/20
OAT₂	39.8	52.5	16.4	50.3	11/20	9/20
OAT₄	46.4	65.3	19.5	51.7	13/20	12/20
OAT₈	56.3	73.1	24.4	54.6	16/20	16/20

Compression Rate And Inference Latency

OAT enables a smooth and controllable trade-off between compression rate, inference latency, and policy performance. With full decoding, OAT and QueST have the same amount of compute per inference.

Token count and policy inference latency in milliseconds.
Policy	LIBERO		RoboMimic		MetaWorld		RoboCasa
Policy	#Tok.	Lat.	#Tok.	Lat.	#Tok.	Lat.	#Tok.	Lat.
DP	×	42.0	×	38.1	×	37.7	×	35.3
Bin	224	517.2	224	509.5	128	306.6	384	888.3
FAST	44.2	114.4	53.1	142.0	49.8	129.7	69.7	166.1
QueST	8	27.1	8	29.6	8	31.4	8	30.2
OAT₁	1	10.5	1	11.3	1	15.5	1	13.5
OAT₂	2	13.2	2	15.3	2	17.9	2	15.8
OAT₄	4	17.4	4	18.4	4	22.1	4	19.8
OAT₈	8	27.4	8	29.9	8	31.3	8	30.0

Token Modelability Is The Key To Success

Across all benchmarks, removing the ordering-inducing objective leads to a consistent performance degradation. OAT_×'s performance is significantly worse than OAT₄ and OAT₈, and in some cases falls below QueST.

Ordering ablation success rate.
Policy	LIBERO	RoboMimic	MetaWorld	RoboCasa
QueST	48.2	66.9	17.9	52.3
OAT₁	11.7	50.8	11.3	47.7
OAT₂	39.8	52.5	16.4	50.3
OAT₄	46.4	65.3	19.5	51.7
OAT₈	56.3	73.1	24.4	54.6
OAT_×	35.2	61.1	17.6	48.5

Uncertainty Use confidence to allocate compute.

The policy could generate additional tokens only when uncertainty remains high or precision matters.

VLA systems Expose a discrete action interface.

Ordered action tokens could act as an auxiliary supervision signal or planning abstraction.

Hybrid control Combine tokens with continuous experts.

Discrete action reasoning and diffusion or flow decoders do not need to be competing choices.

Appendix

GitHub

@inproceedings{liu2026orderedactiontokenization, 
    title={OAT: Ordered Action Tokenization}, 
    author={Chaoqi Liu and Xiaoshen Han and Jiawei Gao and Yue Zhao and Haonan Chen and Yilun Du},
    booktitle={Proceedings of Robotics: Science and Systems}, 
    year={2026}
}

Action tokens are not just an implementation detail.

Five terms used throughout this page.

1. Why should we care about action tokenization?

2. An overlooked axis: Modelability.

3. Three properties we seek.

Which tokenizer satisfies the desiderata?

4. What's missing in today's action tokens?

5. Ordered Action Tokenization

6. Why does order improve modelability?

7. A by-product: prefix-based decoding.

Decode fewer tokens, keep the action valid.

Interactive MeshCat Visualization

8. Experiments

OAT Is Superior

Compression Rate And Inference Latency

Token Modelability Is The Key To Success

Real-World Execution

DP

Bin

FAST

QueST

OAT1

OAT2

OAT4

OAT8

9. Closing thoughts and open directions.

Appendix

OAT₁

OAT₂

OAT₄

OAT₈