BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

Seyed Rohollah Hosseyni°, Ali Ahmad Rahmani‡, Seyed Jamal Seyedmohammadi†, Sanaz Seyedin°, Arash Mohammadi†

°Amirkabir University of Technology (AUT), ‡Iran University of Science and Technology (IUST), †Concordia Institute of Information Systems Engineering (CIISE)

arXiv Code

Performance Comparison with VQ-VAE-based Baseline Motion Models

BAD outperforms the baseline motion models, MMM and T2M-GPT, with a similar model size and design choices. The low FID score demonstrates BAD's ability to capture the sequential flow of information while simultaneously modeling rich bidirectional dependencies in complex motion sequences, indicating that the generated motions are natural and realistic. For text-motion consistency, BAD further improves R-Precision and MM-Dist metrics.

Performance Comparison with RVQ-VAE-based Motion Models

Momask and BAMM use an advanced VQ-VAE based on Residual Vector Quantization (RVQ) as their motion tokenizer. RVQ-VAE significantly improves motion tokenizer performance and, consequently, the overall framework. For example, the reconstruction FID (rFID) of RVQ-VAE on the HumanML3D dataset is 0.019 (Table 2 of the Momask paper), while the rFID of our simple VQ-VAE is 0.085 on the HumanML3D dataset. By using a simple VQ-VAE as our motion tokenizer, BAD outperforms BAMM and achieves a very close FID score compared to Momask while obtaining comparable text-motion consistency (R-Precision and MM-Dist).

The following table compares BAD with Momask and BAMM in four temporal editing tasks on the HumanML3D dataset. Results show that BAD outperforms Momask and BAMM on temporal editing tasks in terms of FID score.

Contribution: Improving Limitations of Autoregressive and Mask-Based Motion Models Using a Novel Framework

We introduce the Bidirectional Autoregressive Diffusion (BAD) framework, a new pretraining strategy for sequence modeling that combines the strengths of autoregressive and mask-based models. BAD features a novel corruption (diffusion) technique for discrete data, using permutation operations, and employs a hybrid attention mask. This mask combines permuted causal attention and bidirectional attention to balance causal and bidirectional dependencies. By using a simple VQ-VAE for motion tokenization, our approach reduces complexity while delivering results comparable to advanced RVQ-VAE-based models.

Overall architecture of our Motion Model

The overall framework for our text-to-motion generation model consists of two components: (a) Motion Tokenizer: We use a simple VQ-VAE-based motion tokenizer, similar to T2M-GPT and MMM. The motion tokenizer transforms a continuous raw 3D motion sequence into a sequence of discrete motion tokens. (b) Conditional Mask-Based Transformer: We use an architecture and design choices inspired by MMM, but our pre-training and corruption strategies significantly differ from MMM. We use a random ordering for input corruption and a hybrid attention mask, consisting of a permuted causal attention mask and a bidirectional attention mask, allowing the model to capture both causal and bidirectional dependencies effectively. Two examples of different attention masks are displayed in the following image.

Sampling: Due to the permutational nature of our procedure, different types of sampling can be used, and we demonstrate two of them. (1) Order-Agnostic Autoregressive Sampling (OAAS): In OAAS, we start by creating mask tokens with a random order. Decoding begins from the first mask token, which can attend to all others. As the process continues, the attention mask is updated, allowing each token to attend to T-p+1 mask tokens and existing unmasked tokens until all are decoded. (2) Confidence-Based Sampling (CBS): CBS also starts with randomly ordered mask tokens. During decoding, high-confidence tokens are kept, while low-confidence ones are masked and reprocessed, ensuring the sequence benefits from the most reliable predictions.