Ordered Action Tokenization

Ordered Action Tokenization

OAT is an ordered, prefix-decodable action representation built for compactness, total decodability, and high autoregressive modelability.

RSS 2026 accepted paper, expanded from the original Article with native web controls.

Chaoqi Liu1, Xiaoshen Han1, Jiawei Gao1, Yue Zhao2, Haonan Chen1, and Yilun Du1

1Harvard University ยท 2Stanford University

Three tokenizer desiderata
Any prefix tokens at inference
Superior policy performance
20+ evaluated tasks
OAT teaser summarizing tokenizer desiderata and policy performance
Paper in one minute

Action tokens are not just an implementation detail.

Autoregressive robot policies need a way to turn continuous control signals into discrete symbols. That conversion decides how many steps the policy must generate, whether every generated sequence can be executed, and how predictable the next token is. OAT argues that a good action tokenizer should optimize all three at once.

  1. The problem: binning is valid but too long, FAST is compact but can be non-decodable, and learned latent tokenizers are compact but often have weak autoregressive modelability.
  2. The method: OAT learns a small sequence of discrete register tokens and trains them with a coarse-to-fine modelability priority.
  3. The result: policies can trade inference cost for action fidelity by choosing how many high-modelability tokens to generate, while still decoding a valid action chunk.
Small glossary

Five terms used throughout this page.

Action chunk
A short future sequence of robot actions predicted at once, then partially executed before re-planning.
Tokenization
The map from continuous actions to discrete symbols that an autoregressive policy can model.
Detokenization
The reverse map from generated symbols back to executable continuous robot actions.
Autoregressive policy
A policy that predicts the next action token conditioned on observations and previously generated tokens.
Modelability
How easy the token distribution is for a generative model to learn, sample, and use downstream.

1. Why should we care about action tokenization?

Discrete action tokens are becoming an increasingly important design choice in modern robot learning systems. Recent systems: RDT-2 employs vector-quantized (VQ) action tokens for its stage-1 training; TRI's LBM/VLA leverages FAST and VQ-style tokenizations; and the winning solution of the BEHAVIOR 2025 Challenge integrated FAST tokens in training and inference.

Across these systems, action tokenization plays a particularly critical role during pre-training, where discrete tokens provide a structured, scalable interface between high-capacity sequence models and continuous robot control. As a result, the choice of action tokenizer increasingly shapes not only efficiency, but also what kinds of behaviors models can learn and generalize.

2. An overlooked axis: Modelability.

Classical theories like the rate-distortion tradeoff focus on balancing compression rate and reconstruction fidelity. In the era of GenAI, we argue that a third axis, modelability, is crucial and often overlooked: how difficult it is for a generative model to capture the distribution of a representation. Poorly structured representations may be compact and accurate, yet fundamentally hard to model.

This is the central distinction: a tokenizer can reconstruct actions well and still be a poor interface for policy learning. If the token stream has low autoregressive modelability, or is sparse and high-entropy, the model pays that cost at every next-token prediction step. The representation is not merely a storage format; it is the learning problem the policy actually sees.

Rate How many tokens?

Shorter action codes reduce autoregressive depth and latency.

Distortion How much action detail?

Continuous robot control still needs enough precision for contact-rich execution.

Modelability How predictable is the sequence?

Token order should make next-token prediction easier, not merely possible.

Reconstruction is not enough.

For robot control, the tokenizer is useful only if the downstream policy can reliably model its tokens. Low reconstruction error matters, but it does not guarantee a token sequence with stable left-to-right structure.

3. Three properties we seek.

We argue that an effective action tokenizer for autoregressive policies should satisfy three key properties:

  • (P.1) Reasonable compression. The representation should compress action chunks enough to enable efficient sequence modeling, but not so aggressively that too much information is lost.
  • (P.2) Total decodability. The detokenization mapping should be a well-defined total function: every token sequence in the discrete token space must decode to a valid action chunk. This is essential because policies may generate arbitrary token sequences at inference time. If decoding is only partially defined, invalid tokens can lead to undefined behavior or catastrophic failures during execution.
  • (P.3) Predictive ordering. Token sequences should admit a meaningful left-to-right causal structure aligned with next-token prediction. This structure is critical for modelability, allowing autoregressive models to learn stable, predictable token dynamics.

In the remainder of this article, I explain them.

Interactive Comparison

Which tokenizer satisfies the desiderata?

Select a tokenizer family to compare compression, total decodability, and autoregressive modelability.

Compression Low
Total decodability Yes
Autoregressive modelability Low

Binning is universally decodable but produces long, flat token sequences that are hard for autoregressive policies to model efficiently.

4. What's missing in today's action tokens?

Each existing tokenizer family misses the target in a different way. The practical failure mode depends on which desideratum it gives up.

Binning Valid but slow.

Every generated token decodes, but the policy must generate hundreds of flat dimension-time tokens.

FAST Compact but partial.

The frequency structure helps next-token prediction, but arbitrary BPE sequences may not decode.

Latents Compact but low-modelability.

A neural decoder makes outputs valid, but the token sequence has weak autoregressive modelability.

OAT Compact, valid, high-modelability.

Tokens are learned as a progressive sequence, so early predictions carry coarse motion structure.

Binning. The most common scheme is per-dimension, per-timestamp binning. While simple, it scales poorly: long horizons and high-dimensional actions can produce hundreds of tokens per chunk, dramatically slowing training and inference and increasing latency. More importantly, such long, flat sequences have poor modelability across dimensions: knowing a[t, 1...i] offers little help in predicting a[t, i+1], making binning poorly aligned with autoregressive generation.

Frequency-domain transform. Frequency-based methods such as FAST achieve high information density (P.1) and impose a low-to-high frequency structure (P.3), where early tokens capture global trajectory structure and later tokens refine details. However, FAST violates P.2 (total decodability). Because Byte Pair Encoding (BPE) produces variable-length sequences, arbitrary token sequences may not decode into a valid fixed-size frequency representation, leading to undefined behavior and runtime failures. We refer readers to appendix of our paper and the discussion on Hugging Face for further details.

Vanilla Latents. Learned encoder-decoder latent tokenizers can achieve strong compression (P.1), and neural decoders ensure total decodability (P.2). However, the resulting token spaces often have weak autoregressive modelability: the token positions do not provide a stable left-to-right structure for next-token prediction. This makes them poorly aligned with policies that rely on meaningful left-to-right structure (P.3) for stable generation.

In summary, existing approaches each satisfy subsets of the desiderata, but none simultaneously achieve compression, total decodability, and high autoregressive modelability.

5. Ordered Action Tokenization

We introduce OAT, a learned autoencoder framework that discretizes action chunks into an ordered sequence of tokens. OAT encodes actions using transformer-based register tokens, discretizes the resulting latents with FSQ, and reconstructs actions via a conditional decoder. To improve autoregressive modelability, we combine causal attention over register tokens with nested dropout during training. Together, these design choices encourage a high-modelability latent representation in which earlier tokens capture coarse, global structure and later tokens refine details.

  1. Summarize the action chunk. A transformer encoder reads the continuous action sequence and writes the important temporal information into a fixed set of register tokens.
  2. Discretize the registers. Finite scalar quantization turns the register latents into discrete tokens that an autoregressive policy can predict.
  3. Force a left-to-right structure. Causal attention makes later registers depend on earlier registers, aligning the representation with next-token generation.
  4. Train with missing tails. Nested dropout randomly masks later tokens during tokenizer training, so early tokens must carry the highest-priority information.
  5. Decode back to control. A conditional decoder maps the generated token prefix back into a continuous action chunk for execution.
OAT method overview with register tokens and prefix decoding

6. Why does order improve modelability?

The ordering induced by OAT admits a natural interpretation through information theory. Shannon showed that the optimal code length for an event scales with the negative logarithm of its probability, - log p: frequent patterns require fewer bits, while rare events demand more representational capacity. Action chunks follow a similarly skewed distribution, most trajectories share common coarse structure, whereas fine-grained deviations occur infrequently.

From this perspective, OAT learns a form of progressive coding. Early tokens capture high-probability, globally shared motion patterns, while later tokens encode increasingly rare residual details. This ordering emerges naturally from nested dropout: because the decoder must reconstruct actions from partial prefixes, the tokenizer is incentivized to allocate information in decreasing order of frequency and importance. As a result, longer prefixes yield monotonic reconstruction improvement, and token order aligns closely with autoregressive next-token prediction, without manually assigning physical features to particular token positions.

7. A by-product: prefix-based decoding.

Autoregressive policy trained on OAT need not proceed to completion. Because any prefix of OAT token sequence can be detokenized into a valid action chunk, OAT supports prefix-based execution and enables an anytime trade-off between computation and performance. Short prefixes yield fast but coarse predictions, while longer prefixes produce more refined actions at higher computational cost. This flexibility arises naturally from the ordered tokenization and requires no changes to the policy architecture or training objective, distinguishing OAT from prior tokenizers that rely on fixed-length detokenization.

Prefix Lab

Decode fewer tokens, keep the action valid.

OAT makes every prefix executable. More tokens refine the trajectory, but the policy can stop early when latency matters.

1 prefix token

One token decodes a complete action chunk, but the reconstruction is coarse and visibly offset from the ground truth.

Each prefix decodes to a complete action chunk. Green points are ground-truth waypoints; red points are the full chunk reconstructed from the selected prefix. More prefix tokens reduce the red-green error and increase fine-grained fidelity.

Interactive MeshCat Visualization

Visualization of reconstructed action chunks using increasing numbers of decoded tokens. Earlier tokens capture the coarse, global structure of the motion, and additional tokens progressively refine fine-grained details, producing trajectories that increasingly match the ground truth. All trajectories are generated by the same model.

1 token

MSE = 0.592

2 tokens

MSE = 0.446

4 tokens

MSE = 0.038

8 tokens

MSE = 0.009

ground truth

8. Experiments

We evaluate OAT across more than 20 tasks spanning four simulation benchmarks (LIBERO, RoboMimic, MetaWorld, and RoboCasa) and real-world robot execution. The results compare success rate, autoregressive depth and latency, modelability ablations, and real-world task success, showing how ordered, high-modelability prefix decoding translates into both stronger policies and more flexible inference.

How to read the results
  • Look for monotonicity. OAT1, OAT2, OAT4, and OAT8 should improve as more high-modelability tokens are decoded.
  • Compare equal-depth methods. OAT8 and QueST both use eight tokens, so differences are mostly about token structure rather than token count.
  • Separate latency from success. Shorter prefixes expose the speed/performance trade-off, while the ablation tests whether the modelability objective is doing real work.

OAT Is Superior

OAT consistently outperforms prior action tokenization schemes and matches or exceeds the strongest baselines, while additionally enabling prefix-based decoding that is unavailable to existing methods. OAT8 achieves the best performance across simulated and real-world benchmarks.

Simulation success rate across manipulation benchmarks, with real-world success over 20 independent trials.
Policy LIBERO RoboMimic MetaWorld RoboCasa PnP Ball Stack Cups
DP36.667.119.354.014/2011/20
Bin14.439.514.527.74/208/20
FAST23.024.07.113.28/206/20
QueST48.266.917.952.311/208/20
OAT111.750.811.347.77/203/20
OAT239.852.516.450.311/209/20
OAT446.465.319.551.713/2012/20
OAT856.373.124.454.616/2016/20

Compression Rate And Inference Latency

OAT enables a smooth and controllable trade-off between compression rate, inference latency, and policy performance. With full decoding, OAT and QueST have the same amount of compute per inference.

Token count and policy inference latency in milliseconds.
Policy LIBERO RoboMimic MetaWorld RoboCasa
#Tok.Lat. #Tok.Lat. #Tok.Lat. #Tok.Lat.
DP×42.0×38.1×37.7×35.3
Bin224517.2224509.5128306.6384888.3
FAST44.2114.453.1142.049.8129.769.7166.1
QueST827.1829.6831.4830.2
OAT1110.5111.3115.5113.5
OAT2213.2215.3217.9215.8
OAT4417.4418.4422.1419.8
OAT8827.4829.9831.3830.0

Token Modelability Is The Key To Success

Across all benchmarks, removing the ordering-inducing objective leads to a consistent performance degradation. OAT×'s performance is significantly worse than OAT4 and OAT8, and in some cases falls below QueST.

Ordering ablation success rate.
Policy LIBERO RoboMimic MetaWorld RoboCasa
QueST48.266.917.952.3
OAT111.750.811.347.7
OAT239.852.516.450.3
OAT446.465.319.551.7
OAT856.373.124.454.6
OAT×35.261.117.648.5

Real-World Execution

More than 90 real-world robot execution videos cover successful and failed attempts across tasks, methods, and camera views. Reloading randomizes the initial configurations. Halting during FAST execution is primarily caused by undecodable action tokens. In such cases, the policy is instructed not to produce any action and to remain stationary for safety reasons.

DP

Success execution
Failure execution

Bin

Success execution
Failure execution

FAST

Success execution
Failure execution

QueST

Success execution
Failure execution

OAT1

Success execution
Failure execution

OAT2

Success execution
Failure execution

OAT4

Success execution
Failure execution

OAT8

Success execution
Failure execution

9. Closing thoughts and open directions.

A recurring question throughout this project has been whether action tokens are still necessary in the presence of powerful continuous models such as flow or diffusion policies. Our view is that future robotic systems will likely combine both discrete and continuous representations rather than choosing one over the other. A concrete example is the BEHAVIOR 2025 Challenge winning solution, which integrates discrete action tokens with continuous action experts.

A key capability enabled by OAT is prefix-based detokenization: actions can be decoded from variable-length token prefixes, yielding an anytime trade-off between computation and action fidelity. In this work, the autoregressive depth is fixed at deployment time. From an information-theoretic perspective, this is suboptimal. The number of tokens required to represent an action chunk should depend on its intrinsic complexity and the precision required for successful execution. Simple, predictable behaviors may admit compact representations, whereas complex, contact-rich interactions may require deeper autoregressive refinement. Estimating this action complexity online, and determining when additional tokens meaningfully reduce uncertainty, remains an open problem. We view adaptive autoregressive depth as a natural and important direction for future work, made possible precisely by OAT's ordered and prefix-decodable structure. Ultimately, we believe this estimation problem deserves a principled solution grounded in uncertainty and information, rather than ad hoc engineering heuristics.

Adaptive depth Stop when the action is good enough.

Simple motions could execute from short prefixes, while contact-rich steps could request more tokens.

Uncertainty Use confidence to allocate compute.

The policy could generate additional tokens only when uncertainty remains high or precision matters.

VLA systems Expose a discrete action interface.

Ordered action tokens could act as an auxiliary supervision signal or planning abstraction.

Hybrid control Combine tokens with continuous experts.

Discrete action reasoning and diffusion or flow decoders do not need to be competing choices.

Appendix

BibTeX
@inproceedings{liu2026orderedactiontokenization, 
    title={OAT: Ordered Action Tokenization}, 
    author={Chaoqi Liu and Xiaoshen Han and Jiawei Gao and Yue Zhao and Haonan Chen and Yilun Du},
    booktitle={Proceedings of Robotics: Science and Systems}, 
    year={2026}
}