BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation
Seyed Rohollah Hosseyni°, Ali Ahmad Rahmani‡, Seyed Jamal Seyedmohammadi†, Sanaz Seyedin°, Arash Mohammadi†
°Amirkabir University of Technology (AUT), ‡Iran University of Science and Technology (IUST), †Concordia Institute of Information Systems Engineering (CIISE)
arXiv CodePerformance Comparison with VQ-VAE-based Baseline Motion Models
BAD outperforms the baseline motion models, MMM and T2M-GPT, with a similar model size and design choices. The low FID score demonstrates BAD's ability to capture the sequential flow of information while simultaneously modeling rich bidirectional dependencies in complex motion sequences, indicating that the generated motions are natural and realistic. For text-motion consistency, BAD further improves R-Precision and MM-Dist metrics.
Performance Comparison with RVQ-VAE-based Motion Models
Momask and BAMM use an advanced VQ-VAE based on Residual Vector Quantization (RVQ) as their motion tokenizer. RVQ-VAE significantly improves motion tokenizer performance and, consequently, the overall framework. For example, the reconstruction FID (rFID) of RVQ-VAE on the HumanML3D dataset is 0.019 (Table 2 of the Momask paper), while the rFID of our simple VQ-VAE is 0.085 on the HumanML3D dataset. By using a simple VQ-VAE as our motion tokenizer, BAD outperforms BAMM and achieves a very close FID score compared to Momask while obtaining comparable text-motion consistency (R-Precision and MM-Dist).
The following table compares BAD with Momask and BAMM in four temporal editing tasks on the HumanML3D dataset. Results show that BAD outperforms Momask and BAMM on temporal editing tasks in terms of FID score.
Contribution: Improving Limitations of Autoregressive and Mask-Based Motion Models Using a Novel Framework
We introduce the Bidirectional Autoregressive Diffusion (BAD) framework, a new pretraining strategy for sequence modeling that combines the strengths of autoregressive and mask-based models. BAD features a novel corruption (diffusion) technique for discrete data, using permutation operations, and employs a hybrid attention mask. This mask combines permuted causal attention and bidirectional attention to balance causal and bidirectional dependencies. By using a simple VQ-VAE for motion tokenization, our approach reduces complexity while delivering results comparable to advanced RVQ-VAE-based models.
Overall architecture of our Motion Model
The overall framework for our text-to-motion generation model consists of two components: (a) Motion Tokenizer: We use a simple VQ-VAE-based motion tokenizer, similar to T2M-GPT and MMM. The motion tokenizer transforms a continuous raw 3D motion sequence into a sequence of discrete motion tokens. (b) Conditional Mask-Based Transformer: We use an architecture and design choices inspired by MMM, but our pre-training and corruption strategies significantly differ from MMM. We use a random ordering for input corruption and a hybrid attention mask, consisting of a permuted causal attention mask and a bidirectional attention mask, allowing the model to capture both causal and bidirectional dependencies effectively. Two examples of different attention masks are displayed in the following image.
Sampling: Due to the permutational nature of our procedure, different types of sampling can be used, and we demonstrate two of them. (1) Order-Agnostic Autoregressive Sampling (OAAS): In OAAS, we start by creating mask tokens with a random order. Decoding begins from the first mask token, which can attend to all others. As the process continues, the attention mask is updated, allowing each token to attend to T-p+1 mask tokens and existing unmasked tokens until all are decoded. (2) Confidence-Based Sampling (CBS): CBS also starts with randomly ordered mask tokens. During decoding, high-confidence tokens are kept, while low-confidence ones are masked and reprocessed, ensuring the sequence benefits from the most reliable predictions.
Compared to SOTA
Text to Motion 1:
"a person jauntily skips forward"
BAD (ours)
Ground Truth
MMM
MoMask
T2M-GPT
MDM
Text to Motion 2:
"A person is performing lunges"
BAD (ours)
Ground Truth
BAMM
MoMask
T2M-GPT
MDM
Text to Motion 3:
"the person was pushed but did not fall"
BAD (ours)
Ground Truth
BAMM
MoMask
T2M-GPT
MDM
Motion Editing
Motion Temporal Inpainting (Motion In-betweening):
Generating 50% motion in the middle based on the text "a person walks forward then turns around and takes long jumps.” conditioned on first 25% and last 25% of motion of “a man walks in a clockwise circle an then sits down.”
BAD
Motion Temporal Outpainting:
Generating first 25% and last 25% of motion of based on the text. “a person walks forward then turns around and takes long jumps.” conditioned on 50% motion in the middle of motion of “a man walks in a clockwise circle an then sits down.”
BAD
Motion Temporal Prefix Editing:
Generating 50% motion in the last half based on the text "a person walks around in an "s" shape." conditioned on the first 50% of motion of "a man is walking forward and jumps after."
BAD
Motion Temporal Suffix Editing:
Generating 50% motion in the start based on the text “a person walks forward and jumps over an object, then turns around to jump over it again and walk back.” conditioned on last 50% of motion of a person slowly walked forward and made a circle.”
BAD
Long Sequence Generation:
Generating long sequence motion by combining multiple motions as follow: 'a person runs forward and jumps.', 'a person crawls.', 'a person does a cart wheel.', 'a person walks forward up stairs and then climbs down.', 'a person sits on the chair and then steps up.'
BAD
More Results
Text to Motion:
BAD
the man is throwing his right hand.
BAD
a person walks forwards, sits.