Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align textual descriptions with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as the atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text–motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text–Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.
Left: The Residual VQ-VAE encodes a continuous motion sequence into discrete motion tokens, including base tokens and residual tokens, which will be generated by the mask transformer and residual transformer respectively. Right: The Mask Transformer predicts the masked base tokens conditioned on the textual description. To achieve segment-level fine-grained alignment, we introduce a Text Segment Extraction module and a Motion Segment Extraction module, which extract text and motion segments respectively, and align them through the Fine-grained Text-Motion Alignment module.
@article{bowen2026segmo,
author = {Bowen Dang, Lin Wu, Xiaohang Yang, Zheng Yuan, Zhixiang Chen},
title = {SegMo: Segment-aligned Text to 3D Human Motion Generation},
journal = {WACV},
year = {2026},
}