BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

Seyed Rohollah Hosseyni°, Ali Ahmad Rahmani‡, Seyed Jamal Seyedmohammadi†, Sanaz Seyedin°, Arash Mohammadi†

°Amirkabir University of Technology (AUT), ‡Iran University of Science and Technology (IUST), †Concordia Institute of Information Systems Engineering (CIISE)

arXiv Code

Performance Comparison with VQ-VAE-based Baseline Motion Models

BAD outperforms the baseline motion models, MMM and T2M-GPT, with a similar model size and design choices. The low FID score demonstrates BAD's ability to capture the sequential flow of information while simultaneously modeling rich bidirectional dependencies in complex motion sequences, indicating that the generated motions are natural and realistic. For text-motion consistency, BAD further improves R-Precision and MM-Dist metrics.

Responsive image

Performance Comparison with RVQ-VAE-based Motion Models

Momask and BAMM use an advanced VQ-VAE based on Residual Vector Quantization (RVQ) as their motion tokenizer. RVQ-VAE significantly improves motion tokenizer performance and, consequently, the overall framework. For example, the reconstruction FID (rFID) of RVQ-VAE on the HumanML3D dataset is 0.019 (Table 2 of the Momask paper), while the rFID of our simple VQ-VAE is 0.085 on the HumanML3D dataset. By using a simple VQ-VAE as our motion tokenizer, BAD outperforms BAMM and achieves a very close FID score compared to Momask while obtaining comparable text-motion consistency (R-Precision and MM-Dist).

RVQ-VAE Comparison

The following table compares BAD with Momask and BAMM in four temporal editing tasks on the HumanML3D dataset. Results show that BAD outperforms Momask and BAMM on temporal editing tasks in terms of FID score.

Temporal Editing

Contribution: Improving Limitations of Autoregressive and Mask-Based Motion Models Using a Novel Framework

We introduce the Bidirectional Autoregressive Diffusion (BAD) framework, a new pretraining strategy for sequence modeling that combines the strengths of autoregressive and mask-based models. BAD features a novel corruption (diffusion) technique for discrete data, using permutation operations, and employs a hybrid attention mask. This mask combines permuted causal attention and bidirectional attention to balance causal and bidirectional dependencies. By using a simple VQ-VAE for motion tokenization, our approach reduces complexity while delivering results comparable to advanced RVQ-VAE-based models.

Overall architecture of our Motion Model

The overall framework for our text-to-motion generation model consists of two components: (a) Motion Tokenizer: We use a simple VQ-VAE-based motion tokenizer, similar to T2M-GPT and MMM. The motion tokenizer transforms a continuous raw 3D motion sequence into a sequence of discrete motion tokens. (b) Conditional Mask-Based Transformer: We use an architecture and design choices inspired by MMM, but our pre-training and corruption strategies significantly differ from MMM. We use a random ordering for input corruption and a hybrid attention mask, consisting of a permuted causal attention mask and a bidirectional attention mask, allowing the model to capture both causal and bidirectional dependencies effectively. Two examples of different attention masks are displayed in the following image.

Sampling: Due to the permutational nature of our procedure, different types of sampling can be used, and we demonstrate two of them. (1) Order-Agnostic Autoregressive Sampling (OAAS): In OAAS, we start by creating mask tokens with a random order. Decoding begins from the first mask token, which can attend to all others. As the process continues, the attention mask is updated, allowing each token to attend to T-p+1 mask tokens and existing unmasked tokens until all are decoded. (2) Confidence-Based Sampling (CBS): CBS also starts with randomly ordered mask tokens. During decoding, high-confidence tokens are kept, while low-confidence ones are masked and reprocessed, ensuring the sequence benefits from the most reliable predictions.

Framework Image Attention Masks

Compared to SOTA

Text to Motion 1:

"a person jauntily skips forward"

BAD (ours)

Ground Truth

MMM

MoMask

T2M-GPT

MDM

Text to Motion 2:

"A person is performing lunges"

BAD (ours)

Ground Truth

BAMM

MoMask

T2M-GPT

MDM

Text to Motion 3:

"the person was pushed but did not fall"

BAD (ours)

Ground Truth

BAMM

MoMask

T2M-GPT

MDM

Motion Editing

Motion Temporal Inpainting (Motion In-betweening):

Generating 50% motion in the middle based on the text "a person walks forward then turns around and takes long jumps.” conditioned on first 25% and last 25% of motion of “a man walks in a clockwise circle an then sits down.”

BAD

Motion Temporal Outpainting:

Generating first 25% and last 25% of motion of based on the text. “a person walks forward then turns around and takes long jumps.” conditioned on 50% motion in the middle of motion of “a man walks in a clockwise circle an then sits down.”

BAD

Motion Temporal Prefix Editing:

Generating 50% motion in the last half based on the text "a person walks around in an "s" shape." conditioned on the first 50% of motion of "a man is walking forward and jumps after."

BAD

Motion Temporal Suffix Editing:

Generating 50% motion in the start based on the text “a person walks forward and jumps over an object, then turns around to jump over it again and walk back.​” conditioned on last 50% of motion of a person slowly walked forward and made a circle.”

BAD

Long Sequence Generation:

Generating long sequence motion by combining multiple motions as follow: 'a person runs forward and jumps.', 'a person crawls.', 'a person does a cart wheel.', 'a person walks forward up stairs and then climbs down.', 'a person sits on the chair and then steps up.'

BAD

More Results

Text to Motion:

BAD

the man is throwing his right hand.

BAD

a person walks forwards, sits.

-->