Ordered Action Tokenization

Praxis A Controlled Laboratory for Vision-Language-Action Policy Research

Praxis turns VLA studies into policy cells: choose the objective, action representation, backbone, and evaluator, then leave behind an auditable record of what changed.

Chaoqi Liu1,2

1Harvard University   2Stanford University

Technical report, June 10, 2026.

Abstract

Vision-language-action (VLA) policy research now spans pretrained multimodal backbones, discrete action tokenizers, continuous control objectives, and broad simulated benchmarks. Many comparisons, however, remain benchmark-centric: objective, backbone, tokenizer, decoding rule, normalization state, serving path, and benchmark adapter can change together before a rollout score is reported. Such scores are useful for tracking capability, but weak evidence for what caused a difference. We present Praxis, a research substrate for controlled VLA ablation. praxis-vla represents each policy as a matrix cell over objective, action representation, and vision-language model (VLM) backbone, so a study can name the coordinate it changes. praxis-eval moves benchmark semantics, execution mode, dependency isolation, and evaluation records outside policy families. This artifact paper reports no new architecture or leaderboard result; it defines notation, contracts, and executable boundaries for attributing differences in VLA experiments.

Praxis coordinate system over objectives, VLM backbones, and action codecs
praxis-vla policy construction as explicit cells
praxis-eval shared local and remote evaluation boundary
36 cells public release across objectives and backbones
3 objectives BAR, KI, and continuous flow matching
6 codecs Bin, FAST, QueST, ACodec, OAT, and bOAT
3 backbones SmolVLM, Qwen3-VL, and PaliGemma2
Praxis in one minute

VLA work needs a lab, not another private stack.

A rollout score often bundles the objective, tokenizer, backbone, simulator adapter, normalization state, and serving path into one number. Praxis makes those choices visible before training or evaluation starts.

  1. The problem: many VLA comparisons replace too much at once, so the measured difference is hard to attribute.
  2. The method: praxis-vla defines policy cells, while praxis-eval fixes the reset/act boundary that benchmark drivers use.
  3. The point: a new study can add one objective, codec, backbone, or benchmark without forking the whole lab.
Why Praxis exists

Benchmarks track capability. Praxis tracks the comparison.

VLA systems are now full pipelines. If one paper swaps action tokens, VLM family, processor logic, action normalization, and rollout glue together, it may compare two pipelines rather than one research idea.

Praxis changes the unit of discussion from a named system to a policy cell. The cell states the objective, action representation, VLM backbone, and evaluator interface that are in play.

Policy cell
A declared combination of objective, action representation, backbone, and evaluator binding.
Study axis
The part of the cell a study intends to move, such as OAT versus bOAT or BAR versus KI.
Evaluation boundary
The narrow reset/act protocol that keeps simulator semantics benchmark-owned.
Artifact record
The metrics, configs, action specs, logs, and runtime metadata needed to inspect the result.
Two halves

Praxis is praxis-vla plus praxis-eval.

The brand is intentionally bigger than a single repository. praxis-vla owns the policy side of the matrix. praxis-eval owns the benchmark-facing protocol. Together they let a cell be built, run, and inspected.

Policy-cell assembly flow from objective, action codec, VLM adapter, and evaluation binding into shared interfaces
Local choices enter through objective, action codec, VLM adapter, and evaluation binding; supported cells carry shared interfaces for policy I/O, action specs, prefix state, artifacts, environment setup, and runtime.
Policy side praxis-vla

Registers objectives, action representations, VLM adapters, policy I/O, and saved artifacts.

Evaluation side praxis-eval

Runs local or remote policies through benchmark drivers with a shared reset/act interface.

Brand promise Change one thing at a time.

Keep the surrounding protocol visible so the result reads as a study, not just a score.

Policy design matrix

A VLA policy becomes a cell before it becomes a claim.

Tokenized policies occupy objective, action-codec, and backbone axes. Tokenless flow-matching policies occupy the objective-backbone plane. The current release is concrete, and the axes are open-ended by design.

Praxis policy-cell geometry with objective layers, action-codec axis, and VLM-backbone axis
The public count multiplies across backbones: 18 BAR cells, 15 KI cells, and 3 FM cells. Axis ellipses show where future objectives, codecs, and backbones enter.
Objective axis Where does action generation live?

BAR predicts action-token blocks, KI supervises VLM context with tokens, and FM predicts continuous action chunks.

Action axis What language does the policy model?

Bin, FAST, QueST, ACodec, OAT, and bOAT expose different rate, distortion, and modelability tradeoffs.

Backbone axis Which VLM context feeds control?

SmolVLM, Qwen3-VL, and PaliGemma2 are treated as swappable policy inputs with adapter-owned preprocessing.

Policy cells

Each objective isolates a different control hypothesis.

The policy-cell notation is bookkeeping with teeth. It states what a comparison intends to vary before training, loading, serving, or rollout begins.

BAR, FM, and KI objective families showing whether the VLM lane or action expert lane owns action generation
BAR makes the VLM generate scheduled action-token blocks. FM keeps final control in a continuous expert. KI uses token supervision for VLM context, then routes control through a stop-gradient flow expert.
BAR
Generate action tokens in scheduled blocks, then detokenize them into executable control.
KI
Use action tokens as VLM supervision while a continuous expert owns final action generation.
FM
Remove the discrete-token axis and predict continuous action chunks directly.
Evaluation boundary

Evaluation is a protocol, not hidden glue.

A policy is meaningful only relative to the observations it receives, the action convention it must satisfy, the tasks selected for rollout, and the simulator runtime that executes those actions. praxis-eval keeps those responsibilities separated.

praxis-eval boundary separating benchmark-owned evaluation from policy-owned execution
Benchmark drivers own rollout semantics, metrics, and artifacts. Policies receive declared observations and return actions through the same local or remote protocol.
  • Benchmark-owned semantics. Task selection, simulator stepping, metrics, and records stay inside the driver.
  • Policy-owned execution. The policy exposes reset and act behavior without taking over the evaluator.
  • Structured records. Each rollout can carry metrics, artifacts, metadata, and runtime details for later inspection.
Extension path

New research should extend an axis, not fork the lab.

Praxis is not a closed catalog. It is a structure for adding new objectives, action representations, backbones, and benchmarks while preserving shared interfaces where possible.

  1. Add an objective. Declare the supervision signal, inference procedure, rollout-state needs, and compatible action representations.
  2. Add an action representation. Declare token space, decode semantics, prefix behavior, schedules, and validity assumptions.
  3. Add a VLM backbone. Localize processor, prompt format, image handling, prefix state, hidden geometry, and cache behavior.
  4. Add a benchmark. Keep simulator semantics in a driver and expose policy-facing observations, actions, metrics, and records.

Appendix

Readers interested in using Praxis should start with the public docs for praxis-vla and praxis-eval.

BibTeX
@misc{liu2026praxis,
  title={Praxis: A Controlled Laboratory for Vision-Language-Action Policy Research},
  author={Liu, Chaoqi},
  year={2026},
  url={https://ordered-action-tokenization.github.io/praxis/}
}