NAC: Neural Action Codec for Vision-Language-Action Models

The University of Texas at Dallas
Neural audio codecs adapted for action tokenization
Figure 1: Neural audio codecs adapted for action tokenization. Top: Modern neural codecs compress raw waveforms into compact, multi-scale discrete codes, preserving coarse structures and fine temporal details. Bottom: NAC applies this approach to robot action chunks, treating actions as multi-channel 1D signals to learn a highly compressed, discrete latent space for downstream autoregressive policy learning.

Abstract

Vision-language-action (VLA) models rely on discrete action tokenizers to bridge continuous robot control and autoregressive sequence modeling, yet existing tokenizers often trade off between compression, latency, and downstream performance. We revisit this design through the lens of neural audio codecs—convolutional encoder–decoder architectures with residual vector quantization that serve as the standard front end for audio foundation models. Motivated by their success, we introduce the Neural Action Codec (NAC), which treats short robot action trajectories as multi-channel 1D signals and compresses them using a multi-scale RVQGAN architecture. We observe that audio-specific mel-spectrogram objectives are ill-suited for kinematic signals; however, by replacing them with simple time-domain and non-mel spectral reconstruction losses, audio-codec-style models can autoencode actions with high fidelity without substantial architectural changes. NAC provides a compact, ordered token space via offset codebooks, enabling standard autoregressive policies to operate over short, structured sequences. Meanwhile, a Vocos-style decoder with an ISTFT head and adversarial discriminators recovers smooth, detailed trajectories. Across LIBERO-10, RoboMimic, and a suite of real-world manipulation tasks, NAC achieves lower reconstruction error and higher success rates than binning, FAST, and prior VQ-based tokenizers at comparable or better compression rates. These results demonstrate that repurposed neural audio codecs offer a strong, practical backbone for learned action tokenization in modern VLAs.

Motivation: Actions as Signals

The action tokenizer is a critical design choice in VLA pretraining. Early methods relied on uniform per-dimension binning, producing prohibitively long token sequences for high-frequency control. Subsequent approaches like FAST used frequency-domain compression to shorten sequences. The primary difficulty lies in capturing the statistical regularities that let a policy model the underlying action distribution; a secondary challenge is latency, dictated by the compression rate and the tokenizer's decoding speed.

The audio generation domain has extensively studied and largely mitigated these same challenges. Audio and robotic actions share a continuous time-series structure but differ notably: action sequences operate at lower frequencies (\(30\text{–}60\,\text{Hz}\)) than audio (\(16\text{–}48\,\text{kHz}\)), and are multi-channel (typically 7–14 dimensions for robot joints or end-effectors). Despite these differences, the core objective of compressing continuous spatio-temporal signals remains fundamentally similar. Our key finding is that adapting audio codecs to actions primarily requires rethinking frequency-domain objectives: by removing mel-frequency losses, multi-scale RVQGAN models compress action sequences without major architectural changes.

Method: Neural Action Codec

NAC maps continuous action chunks to 1D signals, compresses them via a convolutional encoder and multi-scale RVQ, and decodes them with a Vocoder-style decoder. We then train an autoregressive behavioral cloning policy that leverages NAC's structured token space.

NAC training and policy inference overview
Figure 2: NAC overview. A continuous action chunk is flattened into a 1D pseudo-waveform and encoded by a SEANet-style encoder. Multi-scale residual vector quantization (MRVQ) compresses the latent into discrete codes at progressively finer temporal resolutions. A Vocos-style decoder with an ISTFT head reconstructs the action chunk, with a DAC discriminator providing an adversarial signal. The policy then models the resulting offset code sequence autoregressively for downstream control.

Tokenizer

Behavioral Cloning Policy

We train an autoregressive policy on top of the frozen NAC tokenizer. To provide structured generation, NAC uses offset codebooks: with \(n_q\) scales, the policy vocabulary size is \(|\mathcal{V}| = n_q \times V_{\text{bins}} + 1\), accounting for a beginning-of-sequence token. All tokens for scale 0 are predicted first, followed sequentially by scale 1, and so forth. At inference, the policy generates tokens with this fixed layout, partitions them into per-scale segments, recovers code indices via modulo arithmetic, and decodes through the frozen detokenizer—executing the first few steps in a receding-horizon fashion.

Experiments

We compare NAC against continuous-control and discrete-token baselines—Bin, Diffusion Policy, FAST, VQ-VLA, and OAT—spanning naive binning, diffusion-based control, hand-designed compression, and learned tokenization. All policy comparisons share observation inputs, action horizons, and training protocols, differing only in action parameterization. We evaluate on LIBERO-10 and RoboMimic in simulation and a suite of real-world manipulation tasks.

LIBERO and RoboMimic simulation environments
Figure 3: Simulation environments. We benchmark tokenizers across a LIBERO-10 subset and RoboMimic to assess downstream control performance.

Overall Manipulation Performance

NAC achieves the highest success rate on every benchmark. On LIBERO-10, it outperforms FAST by 11.71 points and OAT by 5.56 points, with similar gains on RoboMimic. In the real world, NAC reaches 50% total success versus 40% for both OAT and FAST, with the largest gains on tasks requiring precise, localized corrections such as grasping grapes and stacking blocks.

Table 2: Overall manipulation performance (success rate %) across simulation and real-world environments. Simulation benchmarks were evaluated over 8 seeds with 50 trials per task. Real-world values are the average across 8 physical tasks (10 trials each).
EnvironmentBinDiffusion PolicyFASTVQ-VLAOATNAC (Ours)
LIBERO-103.95 ± 0.825.48 ± 1.338.02 ± 1.310.85 ± 1.8544.17 ± 1.249.73 ± 1.0
RoboMimic7.56 ± 1.0527.25 ± 1.8728.38 ± 2.3721.44 ± 1.4531.94 ± 2.1533.94 ± 1.86
Real World6.2522.540.031.2540.050.0

Can Audio Codecs Model Robot Actions?

Audio codecs are viable for action tokenization, provided audio-specific assumptions are removed. Mel-spectrogram training collapses downstream performance to nearly zero, while simple signal-domain or non-mel frequency objectives produce strong policies (MSE yields the strongest control; spectrogram loss the best reconstruction MSE). Removing the discriminator causes complete downstream failure, and replacing the ISTFT head with a linear decoder worsens both reconstruction and policy success.

Table 1a: Reconstruction loss.
Recon. LossPerf. (%)MSE
L144.78 ± 2.480.002 ± 0.005
MSE49.2 ± 1.540.0008 ± 0.0007
DCT47.85 ± 1.180.0007 ± 0.0008
Mel Spec.0 ± 0.110.038 ± 0.026
Spectrogram48.3 ± 2.920.0002 ± 0.001
Table 1b: Discriminator.
DiscriminatorPerf. (%)MSE
DAC49.45 ± 2.020.0005 ± 0.0018
MPD46.28 ± 1.480.0005 ± 0.0007
MRD45.68 ± 1.720.0004 ± 0.0006
None00.35 ± 0.12
Table 1c: Decoder head.
Tokenizer HeadPerf. (%)MSE
ISTFT48.3 ± 2.920.0002 ± 0.001
Linear42.1 ± 1.540.0006 ± 0.001

Table 1: Action tokenization ablations on LIBERO-10. Performance (%) is downstream policy task success rate; MSE is reconstruction error on 14,000 validation action chunks.

Compression & Latency

NAC compresses each action chunk to 12 tokens, matching the best learned tokenizers while reducing token count by nearly 19× relative to Bin and 3× relative to FAST. Though slower than hand-designed tokenizers, NAC is significantly faster than VQ-VLA and operates within a practical range for real-time control.

Table 3: Compression and latency statistics for different tokenizers. Recon denotes total reconstruction time on a single Nvidia RTX 4090 GPU.
MethodParams (M)Tokens|K|# of BitsEnc (ms)Dec (ms)Recon (ms)
Bin0.002224102422400.0450.0390.079
OAT65.2071210241200.9312.3923.347
VQ-VLA65.5561210241207.0864.04511.049
FAST0.0003610243600.1700.1100.290
NAC (Ours)63.0061210241201.2702.1833.536

Reconstruction Quality

The video below compares the reconstruction fidelity of each tokenizer—how faithfully decoded trajectories track the ground-truth action signal across methods.

Tokenizer reconstruction quality compared across Bin, FAST, VQ-VLA, OAT, and NAC.

Real-World Manipulation

We evaluated all methods on 8 physical manipulation tasks spanning fine grasping, object placement, and deformable control, with 10 trials per task. NAC's compressed token space transfers effectively to physical control, outperforming alternatives on average.

Real-world evaluation tasks with start and completed states
Figure 4: Real-world evaluation tasks. Outlines indicate initial start states (red) and successfully completed states (green).
Table 6: Real-world manipulation performance (success rate %) across 8 tasks with 10 trials per task.
TaskBinDiffusionFASTVQ-VLAOATNAC (Ours)
Weighing504080904090
Grapes030803080100
Marker03060305050
Two Blocks0000030
Three Blocks0000100
Chess001004010
Place Stone0010503040
Fold Towel08080507080
Total6.2522.54031.254050

The video below shows downstream policy rollouts, comparing closed-loop task execution across all tokenizers.

Downstream policy performance compared across Bin, FAST, VQ-VLA, OAT, and NAC.

Citation

If you find this work useful, please consider citing our paper:

@misc{jawaid2026nac,
  title={NAC: Neural Action Codec for Vision-Language-Action Models},
  author={Ahad Jawaid and Yu Xiang},
  year={2026},
  institution={The University of Texas at Dallas}
}