JOURNAL ARTICLE

Controllable Motion Synthesis and Reconstruction with Autoregressive Diffusion Models

Abstract

Data-driven and controllable human motion synthesis and prediction are active research areas with various applications in interactive media and social robotics. Challenges remain in these fields for generating diverse motions given past observations and dealing with imperfect poses. This paper introduces MoDiff, an autoregressive probabilistic diffusion model over motion sequences conditioned on control contexts of other modalities. Our model integrates a cross-modal Transformer encoder and a Transformer-based decoder, which are found effective in capturing temporal correlations in motion and control modalities. We also introduce a new data dropout method based on the diffusion forward process to provide richer data representations and robust generation. We demonstrate the superior performance of MoDiff in controllable motion synthesis for locomotion with respect to two baselines and show the benefits of diffusion data dropout for robust synthesis and reconstruction of high-fidelity motion close to recorded data.

Keywords:
Computer science Autoregressive model Probabilistic logic Artificial intelligence Dropout (neural networks) Fidelity Synthetic data Data modeling Imperfect Encoder Motion capture Modalities Transformer Motion (physics) Machine learning Computer vision Mathematics Engineering

Metrics

3
Cited By
0.75
FWCI (Field Weighted Citation Impact)
36
Refs
0.68
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Motion and Animation
Physical Sciences →  Engineering →  Control and Systems Engineering
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Vision and Imaging
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.