Ordered Action Tokenization
OAT is an ordered, prefix-decodable action representation built for compactness, total decodability, and high autoregressive modelability.
RSS 2026 accepted paper, expanded from the original Article with native web controls.
1Harvard University ยท 2Stanford University
Action tokens are not just an implementation detail.
Autoregressive robot policies need a way to turn continuous control signals into discrete symbols. That conversion decides how many steps the policy must generate, whether every generated sequence can be executed, and how predictable the next token is. OAT argues that a good action tokenizer should optimize all three at once.
- The problem: binning is valid but too long, FAST is compact but can be non-decodable, and learned latent tokenizers are compact but often have weak autoregressive modelability.
- The method: OAT learns a small sequence of discrete register tokens and trains them with a coarse-to-fine modelability priority.
- The result: policies can trade inference cost for action fidelity by choosing how many high-modelability tokens to generate, while still decoding a valid action chunk.
Five terms used throughout this page.
- Action chunk
- A short future sequence of robot actions predicted at once, then partially executed before re-planning.
- Tokenization
- The map from continuous actions to discrete symbols that an autoregressive policy can model.
- Detokenization
- The reverse map from generated symbols back to executable continuous robot actions.
- Autoregressive policy
- A policy that predicts the next action token conditioned on observations and previously generated tokens.
- Modelability
- How easy the token distribution is for a generative model to learn, sample, and use downstream.
1. Why should we care about action tokenization?
Discrete action tokens are becoming an increasingly important design choice in modern robot learning systems. Recent systems: RDT-2 employs vector-quantized (VQ) action tokens for its stage-1 training; TRI's LBM/VLA leverages FAST and VQ-style tokenizations; and the winning solution of the BEHAVIOR 2025 Challenge integrated FAST tokens in training and inference.
Across these systems, action tokenization plays a particularly critical role during pre-training, where discrete tokens provide a structured, scalable interface between high-capacity sequence models and continuous robot control. As a result, the choice of action tokenizer increasingly shapes not only efficiency, but also what kinds of behaviors models can learn and generalize.
2. An overlooked axis: Modelability.
Classical theories like the rate-distortion tradeoff focus on balancing compression rate and reconstruction fidelity. In the era of GenAI, we argue that a third axis, modelability, is crucial and often overlooked: how difficult it is for a generative model to capture the distribution of a representation. Poorly structured representations may be compact and accurate, yet fundamentally hard to model.
This is the central distinction: a tokenizer can reconstruct actions well and still be a poor interface for policy learning. If the token stream has low autoregressive modelability, or is sparse and high-entropy, the model pays that cost at every next-token prediction step. The representation is not merely a storage format; it is the learning problem the policy actually sees.
Shorter action codes reduce autoregressive depth and latency.
Continuous robot control still needs enough precision for contact-rich execution.
Token order should make next-token prediction easier, not merely possible.
For robot control, the tokenizer is useful only if the downstream policy can reliably model its tokens. Low reconstruction error matters, but it does not guarantee a token sequence with stable left-to-right structure.
3. Three properties we seek.
We argue that an effective action tokenizer for autoregressive policies should satisfy three key properties:
- (P.1) Reasonable compression. The representation should compress action chunks enough to enable efficient sequence modeling, but not so aggressively that too much information is lost.
- (P.2) Total decodability. The detokenization mapping should be a well-defined total function: every token sequence in the discrete token space must decode to a valid action chunk. This is essential because policies may generate arbitrary token sequences at inference time. If decoding is only partially defined, invalid tokens can lead to undefined behavior or catastrophic failures during execution.
- (P.3) Predictive ordering. Token sequences should admit a meaningful left-to-right causal structure aligned with next-token prediction. This structure is critical for modelability, allowing autoregressive models to learn stable, predictable token dynamics.
In the remainder of this article, I explain them.
Which tokenizer satisfies the desiderata?
Select a tokenizer family to compare compression, total decodability, and autoregressive modelability.
Binning is universally decodable but produces long, flat token sequences that are hard for autoregressive policies to model efficiently.
4. What's missing in today's action tokens?
Each existing tokenizer family misses the target in a different way. The practical failure mode depends on which desideratum it gives up.
Every generated token decodes, but the policy must generate hundreds of flat dimension-time tokens.
The frequency structure helps next-token prediction, but arbitrary BPE sequences may not decode.
A neural decoder makes outputs valid, but the token sequence has weak autoregressive modelability.
Tokens are learned as a progressive sequence, so early predictions carry coarse motion structure.
Binning. The most common scheme is per-dimension, per-timestamp binning. While simple, it scales poorly: long horizons and high-dimensional actions can produce hundreds of tokens per chunk, dramatically slowing training and inference and increasing latency. More importantly, such long, flat sequences have poor modelability across dimensions: knowing a[t, 1...i] offers little help in predicting a[t, i+1], making binning poorly aligned with autoregressive generation.
Frequency-domain transform. Frequency-based methods such as FAST achieve high information density (P.1) and impose a low-to-high frequency structure (P.3), where early tokens capture global trajectory structure and later tokens refine details. However, FAST violates P.2 (total decodability). Because Byte Pair Encoding (BPE) produces variable-length sequences, arbitrary token sequences may not decode into a valid fixed-size frequency representation, leading to undefined behavior and runtime failures. We refer readers to appendix of our paper and the discussion on Hugging Face for further details.
Vanilla Latents. Learned encoder-decoder latent tokenizers can achieve strong compression (P.1), and neural decoders ensure total decodability (P.2). However, the resulting token spaces often have weak autoregressive modelability: the token positions do not provide a stable left-to-right structure for next-token prediction. This makes them poorly aligned with policies that rely on meaningful left-to-right structure (P.3) for stable generation.
In summary, existing approaches each satisfy subsets of the desiderata, but none simultaneously achieve compression, total decodability, and high autoregressive modelability.
5. Ordered Action Tokenization
We introduce OAT, a learned autoencoder framework that discretizes action chunks into an ordered sequence of tokens. OAT encodes actions using transformer-based register tokens, discretizes the resulting latents with FSQ, and reconstructs actions via a conditional decoder. To improve autoregressive modelability, we combine causal attention over register tokens with nested dropout during training. Together, these design choices encourage a high-modelability latent representation in which earlier tokens capture coarse, global structure and later tokens refine details.
- Summarize the action chunk. A transformer encoder reads the continuous action sequence and writes the important temporal information into a fixed set of register tokens.
- Discretize the registers. Finite scalar quantization turns the register latents into discrete tokens that an autoregressive policy can predict.
- Force a left-to-right structure. Causal attention makes later registers depend on earlier registers, aligning the representation with next-token generation.
- Train with missing tails. Nested dropout randomly masks later tokens during tokenizer training, so early tokens must carry the highest-priority information.
- Decode back to control. A conditional decoder maps the generated token prefix back into a continuous action chunk for execution.
6. Why does order improve modelability?
The ordering induced by OAT admits a natural interpretation through information theory. Shannon showed that the optimal code length for an event scales with the negative logarithm of its probability, - log p: frequent patterns require fewer bits, while rare events demand more representational capacity. Action chunks follow a similarly skewed distribution, most trajectories share common coarse structure, whereas fine-grained deviations occur infrequently.
From this perspective, OAT learns a form of progressive coding. Early tokens capture high-probability, globally shared motion patterns, while later tokens encode increasingly rare residual details. This ordering emerges naturally from nested dropout: because the decoder must reconstruct actions from partial prefixes, the tokenizer is incentivized to allocate information in decreasing order of frequency and importance. As a result, longer prefixes yield monotonic reconstruction improvement, and token order aligns closely with autoregressive next-token prediction, without manually assigning physical features to particular token positions.
7. A by-product: prefix-based decoding.
Autoregressive policy trained on OAT need not proceed to completion. Because any prefix of OAT token sequence can be detokenized into a valid action chunk, OAT supports prefix-based execution and enables an anytime trade-off between computation and performance. Short prefixes yield fast but coarse predictions, while longer prefixes produce more refined actions at higher computational cost. This flexibility arises naturally from the ordered tokenization and requires no changes to the policy architecture or training objective, distinguishing OAT from prior tokenizers that rely on fixed-length detokenization.
Decode fewer tokens, keep the action valid.
OAT makes every prefix executable. More tokens refine the trajectory, but the policy can stop early when latency matters.
One token decodes a complete action chunk, but the reconstruction is coarse and visibly offset from the ground truth.
Each prefix decodes to a complete action chunk. Green points are ground-truth waypoints; red points are the full chunk reconstructed from the selected prefix. More prefix tokens reduce the red-green error and increase fine-grained fidelity.
Interactive MeshCat Visualization
Visualization of reconstructed action chunks using increasing numbers of decoded tokens. Earlier tokens capture the coarse, global structure of the motion, and additional tokens progressively refine fine-grained details, producing trajectories that increasingly match the ground truth. All trajectories are generated by the same model.
8. Experiments
We evaluate OAT across more than 20 tasks spanning four simulation benchmarks (LIBERO, RoboMimic, MetaWorld, and RoboCasa) and real-world robot execution. The results compare success rate, autoregressive depth and latency, modelability ablations, and real-world task success, showing how ordered, high-modelability prefix decoding translates into both stronger policies and more flexible inference.
- Look for monotonicity. OAT1, OAT2, OAT4, and OAT8 should improve as more high-modelability tokens are decoded.
- Compare equal-depth methods. OAT8 and QueST both use eight tokens, so differences are mostly about token structure rather than token count.
- Separate latency from success. Shorter prefixes expose the speed/performance trade-off, while the ablation tests whether the modelability objective is doing real work.
OAT Is Superior
OAT consistently outperforms prior action tokenization schemes and matches or exceeds the strongest baselines, while additionally enabling prefix-based decoding that is unavailable to existing methods. OAT8 achieves the best performance across simulated and real-world benchmarks.
| Policy | LIBERO | RoboMimic | MetaWorld | RoboCasa | PnP Ball | Stack Cups |
|---|---|---|---|---|---|---|
| DP | 36.6 | 67.1 | 19.3 | 54.0 | 14/20 | 11/20 |
| Bin | 14.4 | 39.5 | 14.5 | 27.7 | 4/20 | 8/20 |
| FAST | 23.0 | 24.0 | 7.1 | 13.2 | 8/20 | 6/20 |
| QueST | 48.2 | 66.9 | 17.9 | 52.3 | 11/20 | 8/20 |
| OAT1 | 11.7 | 50.8 | 11.3 | 47.7 | 7/20 | 3/20 |
| OAT2 | 39.8 | 52.5 | 16.4 | 50.3 | 11/20 | 9/20 |
| OAT4 | 46.4 | 65.3 | 19.5 | 51.7 | 13/20 | 12/20 |
| OAT8 | 56.3 | 73.1 | 24.4 | 54.6 | 16/20 | 16/20 |
Compression Rate And Inference Latency
OAT enables a smooth and controllable trade-off between compression rate, inference latency, and policy performance. With full decoding, OAT and QueST have the same amount of compute per inference.
| Policy | LIBERO | RoboMimic | MetaWorld | RoboCasa | ||||
|---|---|---|---|---|---|---|---|---|
| #Tok. | Lat. | #Tok. | Lat. | #Tok. | Lat. | #Tok. | Lat. | |
| DP | × | 42.0 | × | 38.1 | × | 37.7 | × | 35.3 |
| Bin | 224 | 517.2 | 224 | 509.5 | 128 | 306.6 | 384 | 888.3 |
| FAST | 44.2 | 114.4 | 53.1 | 142.0 | 49.8 | 129.7 | 69.7 | 166.1 |
| QueST | 8 | 27.1 | 8 | 29.6 | 8 | 31.4 | 8 | 30.2 |
| OAT1 | 1 | 10.5 | 1 | 11.3 | 1 | 15.5 | 1 | 13.5 |
| OAT2 | 2 | 13.2 | 2 | 15.3 | 2 | 17.9 | 2 | 15.8 |
| OAT4 | 4 | 17.4 | 4 | 18.4 | 4 | 22.1 | 4 | 19.8 |
| OAT8 | 8 | 27.4 | 8 | 29.9 | 8 | 31.3 | 8 | 30.0 |
Token Modelability Is The Key To Success
Across all benchmarks, removing the ordering-inducing objective leads to a consistent performance degradation. OAT×'s performance is significantly worse than OAT4 and OAT8, and in some cases falls below QueST.
| Policy | LIBERO | RoboMimic | MetaWorld | RoboCasa |
|---|---|---|---|---|
| QueST | 48.2 | 66.9 | 17.9 | 52.3 |
| OAT1 | 11.7 | 50.8 | 11.3 | 47.7 |
| OAT2 | 39.8 | 52.5 | 16.4 | 50.3 |
| OAT4 | 46.4 | 65.3 | 19.5 | 51.7 |
| OAT8 | 56.3 | 73.1 | 24.4 | 54.6 |
| OAT× | 35.2 | 61.1 | 17.6 | 48.5 |
Real-World Execution
More than 90 real-world robot execution videos cover successful and failed attempts across tasks, methods, and camera views. Reloading randomizes the initial configurations. Halting during FAST execution is primarily caused by undecodable action tokens. In such cases, the policy is instructed not to produce any action and to remain stationary for safety reasons.
DP
Video unavailable for this task.
Video unavailable for this task.
Bin
Video unavailable for this task.
Video unavailable for this task.
FAST
Video unavailable for this task.
Video unavailable for this task.
QueST
Video unavailable for this task.
Video unavailable for this task.
OAT1
Video unavailable for this task.
Video unavailable for this task.
OAT2
Video unavailable for this task.
Video unavailable for this task.
OAT4
Video unavailable for this task.
Video unavailable for this task.
OAT8
Video unavailable for this task.
Video unavailable for this task.
9. Closing thoughts and open directions.
A recurring question throughout this project has been whether action tokens are still necessary in the presence of powerful continuous models such as flow or diffusion policies. Our view is that future robotic systems will likely combine both discrete and continuous representations rather than choosing one over the other. A concrete example is the BEHAVIOR 2025 Challenge winning solution, which integrates discrete action tokens with continuous action experts.
A key capability enabled by OAT is prefix-based detokenization: actions can be decoded from variable-length token prefixes, yielding an anytime trade-off between computation and action fidelity. In this work, the autoregressive depth is fixed at deployment time. From an information-theoretic perspective, this is suboptimal. The number of tokens required to represent an action chunk should depend on its intrinsic complexity and the precision required for successful execution. Simple, predictable behaviors may admit compact representations, whereas complex, contact-rich interactions may require deeper autoregressive refinement. Estimating this action complexity online, and determining when additional tokens meaningfully reduce uncertainty, remains an open problem. We view adaptive autoregressive depth as a natural and important direction for future work, made possible precisely by OAT's ordered and prefix-decodable structure. Ultimately, we believe this estimation problem deserves a principled solution grounded in uncertainty and information, rather than ad hoc engineering heuristics.
Simple motions could execute from short prefixes, while contact-rich steps could request more tokens.
The policy could generate additional tokens only when uncertainty remains high or precision matters.
Ordered action tokens could act as an auxiliary supervision signal or planning abstraction.
Discrete action reasoning and diffusion or flow decoders do not need to be competing choices.
Appendix
@inproceedings{liu2026orderedactiontokenization,
title={OAT: Ordered Action Tokenization},
author={Chaoqi Liu and Xiaoshen Han and Jiawei Gao and Yue Zhao and Haonan Chen and Yilun Du},
booktitle={Proceedings of Robotics: Science and Systems},
year={2026}
}