Ordered Action Tokenization

Ordered action tokens for autoregressive robot policies: compact enough to generate, always decodable, and ordered so prefixes are useful.

RSS 2026 accepted paper, expanded from the original Article with native web controls.

Chaoqi Liu¹, Xiaoshen Han¹, Jiawei Gao¹, Yue Zhao², Haonan Chen¹, and Yilun Du¹

¹Harvard University · ²Stanford University

arXiv Code

OAT teaser summarizing tokenizer desiderata and policy performance

Compact short action-token sequences

Total all supported prefixes decode

Ordered prefixes carry coarse motion

Anytime trade tokens for latency

20+ simulation and real tasks

Paper in one minute

Action tokenization defines the policy's prediction problem.

An autoregressive robot policy does not directly predict continuous control. It predicts tokens for a short future sequence of robot actions, or action chunk, then trusts the detokenizer to recover executable continuous control. That makes tokenization a learning problem: a representation can reconstruct well and still be slow, invalid under sampling, or difficult to predict left-to-right.

The problem: existing tokenizers expose a three-way tension between compression, total decodability, and autoregressive predictability.
The method: OAT learns ordered discrete register tokens and trains every prefix to decode into a complete action chunk.
The result: token count becomes a runtime budget: fewer tokens for low-latency coarse control, more tokens when the task needs precision.

1. The tokenizer defines what the policy has to predict.

Discrete action tokens are becoming an increasingly important design choice in modern robot learning systems: RDT-2 employs vector-quantized (VQ) action tokens for its stage-1 training; TRI's LBM/VLA leverages FAST and VQ-style tokenizations; and the winning solution of the BEHAVIOR 2025 Challenge integrated FAST tokens in training and inference.

In all of these systems, the policy sees symbols before it sees control. The tokenizer determines sequence length, the set of samples that can be decoded safely, and the left-to-right structure the model must learn. It is not a preprocessing detail; it is the prediction problem.

2. Reconstruction error is not enough.

Classical theories like the rate-distortion tradeoff focus on balancing compression rate and reconstruction fidelity. For generative robot policies, we argue that a third axis, modelability, is crucial and often overlooked: how difficult it is for a generative model to capture the distribution of a representation. Poorly structured representations may be compact and accurate, yet fundamentally hard to model.

This is the central distinction: a tokenizer can reconstruct actions well and still be poor for policy learning. If the token stream has low autoregressive modelability, or is sparse and high-entropy, the model pays that cost at every next-token prediction step.

Rate How many tokens?

Shorter action codes reduce autoregressive depth and latency.

Distortion How much action detail?

Continuous robot control still needs enough precision for contact-rich execution.

Modelability How predictable is the sequence?

Token order should make next-token prediction easier, not merely possible.

Reconstruction is not enough.

For robot control, the tokenizer is useful only if the downstream policy can reliably model its tokens. Low reconstruction error matters, but it does not guarantee a token sequence with stable left-to-right structure.

3. A useful action tokenizer must be compact, total, and ordered.

The design target is not a single metric. An action tokenizer for autoregressive policies has to satisfy three requirements at the same time:

(P.1) Reasonable compression. The representation should compress action chunks enough to enable efficient sequence modeling, but not so aggressively that too much information is lost.
(P.2) Total decodability. The detokenization mapping should be a well-defined total function: every token sequence in the discrete token space must decode to a valid action chunk. This is essential because policies may generate arbitrary token sequences at inference time. If decoding is only partially defined, invalid tokens can lead to undefined behavior or catastrophic failures during execution.
(P.3) Predictive ordering. Token sequences should admit a meaningful left-to-right causal structure aligned with next-token prediction. This structure is critical for modelability, allowing autoregressive models to learn stable, predictable token dynamics.

Where does each tokenizer give up?

Select a tokenizer family and read it against the three requirements: compression, total decodability, and autoregressive modelability.

Compression Low

Total decodability Yes

Autoregressive modelability Low

Binning is universally decodable but produces long, flat token sequences that are hard for autoregressive policies to model efficiently.

4. Prior tokenizers each give up one requirement.

Before OAT, each major tokenizer family misses the target in a different way. The OAT row shows the design point we want: compact, total, and ordered.

Binning Valid but slow.

Every generated token decodes, but the policy must generate hundreds of flat dimension-time tokens.

FAST Compact but partial.

The frequency structure helps next-token prediction, but arbitrary BPE sequences may not decode.

Latents Compact but low-modelability.

A neural decoder makes outputs valid, but the token sequence has weak autoregressive modelability.

OAT Compact, valid, high-modelability.

Tokens are learned as a progressive sequence, so early predictions carry coarse motion structure.

Binning. The most common scheme is per-dimension, per-timestamp binning. While simple, it scales poorly: long horizons and high-dimensional actions can produce hundreds of tokens per chunk, dramatically slowing training and inference and increasing latency. More importantly, such long, flat sequences have poor modelability across dimensions: knowing the earlier coordinates at a timestep offers little help in predicting the next one, making binning poorly aligned with autoregressive generation.

Frequency-domain transform. Frequency-based methods such as FAST achieve high information density (P.1) and impose a low-to-high frequency structure (P.3), where early tokens capture global trajectory structure and later tokens refine details. However, FAST violates P.2 (total decodability). Because Byte Pair Encoding (BPE) produces variable-length sequences, arbitrary token sequences may not decode into a valid fixed-size frequency representation, leading to undefined behavior and runtime failures. See the paper appendix and the discussion on Hugging Face for further details.

Vanilla Latents. Learned encoder-decoder latent tokenizers can achieve strong compression (P.1), and neural decoders ensure total decodability (P.2). However, the resulting token spaces often have weak autoregressive modelability: the token positions do not provide a stable left-to-right structure for next-token prediction. This makes them poorly aligned with policies that rely on meaningful left-to-right structure (P.3) for stable generation.

The gap is not that prior tokenizers are weak in every way. It is that none simultaneously make the sequence short, make the whole token space executable, and give the policy an easy left-to-right prediction problem.

5. OAT learns ordered registers that decode from any prefix.

OAT is a learned tokenizer for action chunks. It writes each chunk into a fixed set of register latents, discretizes those registers with finite scalar quantization (FSQ), and decodes generated tokens back into continuous control. The key design choice is order: register attention is causal, and nested dropout trains the decoder to reconstruct from partial prefixes.

Summarize the action chunk. A transformer encoder reads the continuous action sequence and writes the important temporal information into a fixed set of register tokens.
Discretize the registers. Finite scalar quantization turns the register latents into discrete tokens that an autoregressive policy can predict.
Force a left-to-right structure. Causal attention makes later registers depend on earlier registers, aligning the representation with next-token generation.
Train with missing tails. Nested dropout randomly masks later tokens during tokenizer training, so early tokens must carry the highest-priority information.
Decode back to control. A conditional decoder maps the generated token prefix back into a continuous action chunk for execution.

OAT method overview with register tokens and prefix decoding

6. Ordered tokens turn action coding into progressive refinement.

The ordering induced by OAT admits a natural interpretation through information theory. Shannon showed that the optimal code length for an event scales with the negative logarithm of its probability, so frequent patterns require fewer bits, while rare events demand more representational capacity. Action chunks follow a similarly skewed distribution: most trajectories share common coarse structure, whereas fine-grained deviations occur infrequently.

From this perspective, OAT learns a progressive code. Early tokens capture high-probability, globally shared motion patterns; later tokens encode increasingly rare residual details. Nested dropout makes this pressure explicit: every short prefix has to reconstruct the action, so information is allocated in decreasing order of usefulness.

Order is a modeling signal.

The first token is not an arbitrary latent slot; it is trained to carry the highest-priority control information.

Longer prefixes refine the same action.

Additional tokens reduce residual error instead of replacing the action represented by the prefix.

7. Every prefix is an executable action budget.

Because OAT trains the decoder on masked suffixes, a policy does not have to finish the token sequence before acting. A prefix can be padded, detokenized, and executed as a complete action chunk. In practice, token count becomes a runtime budget: one or two tokens for fast coarse control, more tokens when the task needs precision.

Stop early, keep the action valid.

OAT makes every prefix executable. More tokens refine the trajectory, but the policy can stop early when latency matters.

decoded prefix ground truth

1 prefix token

One token decodes a complete action chunk, but the reconstruction is coarse and visibly offset from the ground truth.

Each prefix decodes to a complete action chunk. Green points are ground-truth waypoints; red points are the full chunk reconstructed from the selected prefix. More prefix tokens reduce the red-green error and increase fine-grained fidelity.

8. OAT improves success, latency, and real execution.

We evaluate OAT across more than 20 tasks spanning four simulation benchmarks (LIBERO, RoboMimic, MetaWorld, and RoboCasa) and real-world robot execution. The experiments test whether ordered prefixes are only a nice representation, or whether they produce better closed-loop policies.

Result map

Success Does the full ordered representation beat existing tokenizers?

OAT₈ is best in every reported success column.

Latency Can shorter prefixes buy speed while staying executable?

OAT₁, OAT₂, and OAT₄ reduce sequential depth while preserving valid decoding.

Ablation Is ordering the reason?

The ordering ablation drops below OAT₄ and OAT₈, showing that causal registers and nested dropout do real policy work.

Full OAT wins across simulation and real tasks.

OAT₈ achieves the best reported success in every simulation benchmark and both real-world tasks, while preserving the prefix execution option unavailable to fixed tokenizers.

Simulation success rate across manipulation benchmarks, with real-world success over 20 independent trials.
Policy	LIBERO	RoboMimic	MetaWorld	RoboCasa	PnP Ball	Stack Cups
DP	36.6	67.1	19.3	54.0	14/20	11/20
Bin	14.4	39.5	14.5	27.7	4/20	8/20
FAST	23.0	24.0	7.1	13.2	8/20	6/20
QueST	48.2	66.9	17.9	52.3	11/20	8/20
OAT₁	11.7	50.8	11.3	47.7	7/20	3/20
OAT₂	39.8	52.5	16.4	50.3	11/20	9/20
OAT₄	46.4	65.3	19.5	51.7	13/20	12/20
OAT₈	56.3	73.1	24.4	54.6	16/20	16/20

Full OAT matches QueST's eight-token latency, but shorter prefixes are much faster.

With full decoding, OAT and QueST have comparable autoregressive depth. The difference is that OAT can stop at one, two, or four tokens when latency matters.

Token count and policy inference latency in milliseconds.
Policy	LIBERO		RoboMimic		MetaWorld		RoboCasa
Policy	#Tok.	Lat.	#Tok.	Lat.	#Tok.	Lat.	#Tok.	Lat.
DP	×	42.0	×	38.1	×	37.7	×	35.3
Bin	224	517.2	224	509.5	128	306.6	384	888.3
FAST	44.2	114.4	53.1	142.0	49.8	129.7	69.7	166.1
QueST	8	27.1	8	29.6	8	31.4	8	30.2
OAT₁	1	10.5	1	11.3	1	15.5	1	13.5
OAT₂	2	13.2	2	15.3	2	17.9	2	15.8
OAT₄	4	17.4	4	18.4	4	22.1	4	19.8
OAT₈	8	27.4	8	29.9	8	31.3	8	30.0

Ordering is doing the policy work.

Across the four simulation benchmarks, removing the ordering-inducing objective causes a consistent degradation. OAT_× is significantly worse than OAT₄ and OAT₈, and in some cases falls below QueST.

Ordering ablation success rate.
Policy	LIBERO	RoboMimic	MetaWorld	RoboCasa
QueST	48.2	66.9	17.9	52.3
OAT₁	11.7	50.8	11.3	47.7
OAT₂	39.8	52.5	16.4	50.3
OAT₄	46.4	65.3	19.5	51.7
OAT₈	56.3	73.1	24.4	54.6
OAT_×	35.2	61.1	17.6	48.5

The same action becomes sharper as the prefix grows.

These MeshCat reconstructions show the mechanism behind the prefix budget: early tokens recover the coarse motion, while additional tokens refine the residual details. All trajectories are generated by the same tokenizer and decoder.

1 token

MSE = 0.592

2 tokens

MSE = 0.446

4 tokens

OAT's central point is not that discrete tokens replace continuous controllers. It is that when robot policies use action tokens, token order becomes part of the control problem. Compactness, total decodability, and prefix modelability should be designed together; once they are, token count becomes a runtime budget instead of a fixed preprocessing choice.

This also changes the natural next question. In this work, autoregressive depth is fixed at deployment time. Future policies could decide depth online: simple motions may execute from short prefixes, while contact-rich steps may request more tokens before acting. That makes adaptive token budgets a concrete direction enabled by ordered, prefix-decodable action representations.

The follow-up Tokens in the Era of VLAs page takes up several of these questions directly: block-wise autoregressive generation, ordered-token supervision for VLA cotraining, and hybrid control where a continuous expert executes the final action.

Adaptive depth Stop when the action is good enough.

Simple motions could execute from short prefixes, while contact-rich steps could request more tokens.

Uncertainty Use confidence to allocate compute.

The policy could generate additional tokens only when uncertainty remains high or precision matters.

VLA systems Use ordered tokens as supervision.

They can provide action-aware targets even when a continuous expert executes the final control.

Hybrid control Combine tokens with continuous experts.

Discrete action reasoning and diffusion or flow decoders do not need to be competing choices.

Appendix

@misc{liu2026oatorderedactiontokenization,
      title={OAT: Ordered Action Tokenization},
      author={Chaoqi Liu and Xiaoshen Han and Jiawei Gao and Yue Zhao and Haonan Chen and Yilun Du},
      year={2026},
      eprint={2602.04215},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.04215},
}

Action tokenization defines the policy's prediction problem.

1. The tokenizer defines what the policy has to predict.

2. Reconstruction error is not enough.

3. A useful action tokenizer must be compact, total, and ordered.

Where does each tokenizer give up?

4. Prior tokenizers each give up one requirement.

5. OAT learns ordered registers that decode from any prefix.

6. Ordered tokens turn action coding into progressive refinement.

7. Every prefix is an executable action budget.

Stop early, keep the action valid.

8. OAT improves success, latency, and real execution.

Full OAT wins across simulation and real tasks.

Full OAT matches QueST's eight-token latency, but shorter prefixes are much faster.

Ordering is doing the policy work.

The same action becomes sharper as the prefix grows.

Real-world execution checks the same failure modes.

DP

Bin

FAST

QueST

OAT1

OAT2

OAT4

OAT8

9. Token order turns representation design into a control decision.

Appendix

OAT₁

OAT₂

OAT₄

OAT₈