MoRAG - Multi-Fusion Retrieval Augmented Generation for Human Motion

Sai Shashank Kalakonda¹, Shubh Maheshwari², Ravi Kiran Sarvadevabhatla¹

¹IIIT Hyderabad ²University of California San Diego

MoRAG improves the quality of generation and retrieval tasks across various text descriptions.

Abstract

We introduce MoRAG, a novel multi-part fusion based retrieval-augmented generation strategy for text-based human motion generation.

The method enhances motion diffusion models by leveraging additional knowledge obtained through an improved motion retrieval process. By effectively prompting large language models (LLMs), we address spelling errors and rephrasing issues in motion retrieval. Our approach utilizes a multi-part retrieval strategy to improve the generalizability of motion retrieval across the language space. We create diverse samples through the spatial composition of the retrieved motions. Furthermore, by utilizing low-level, part-specific motion information, we can construct motion samples for unseen text descriptions. Our experiments demonstrate that our framework can serve as a plug-and-play module, improving the performance of motion diffusion models.

MoRAG-Diffuse (Generation)

MoRAG (Retrieval)

Prompt Strategy

We prompt large language models (LLMs) to generate part-specific motion descriptions, which are then used to retrieve corresponding motion sequences for each part. These sequences are composed to create a full-body motion sequence that aligns with the input text.

Prompt Examples

We illustrate the LLM-generated part-specific outputs for text descriptions alongside their corresponding top-1 retrieval results to demonstrate the effectiveness of our prompt strategy. The HumanML3D ID for the retrieved motions is indicated with the # symbol

Significance of Position

Specifically prompting for the position of body parts provides their global orientation, leading to improved motion retrieval.

Issue with left-right parts retrieval

We avoid retrieving separate motions for left and right body parts to prevent asynchronous movements in the composed full-body motion sequence.

Text Robustness

The usage of large language models (LLMs) in MoRAG's pipeline enables it to handle text changes such as rephrasing, substitution, and spell correction, thereby enhancing the model's ability to retrieve motion sequences with improved accuracy.

Spell Error

Rephrasing

Substitution

Spatial Composition

The composition workflow involves combining part-specific motions, R_part, retrieved from their respective part-specific databases, based on part-specific descriptions generated by the large language model(LLM).

Quantitative Analysis

We compare the results of text-to-motion generation between ours and the state-of-the-art diffusion based methods on HumanML3D dataset. Our method achieves better semantic relevance, diversity, and multimodality performances. Indicate best results, indicates second best results.

References: MDM[5]; MotionDiffuse[6]; MLD[2]; ReMoDiffuse[7]; FineMoGen[8].

Qualitative Analysis

We present qualitative analysis across three key aspects: (1) Generalizability, (2) Zero-shot performance, and (3) Diversity, for both MoRAG (retrieval model) and MoRAG-Diffuse (generative model).

MoRAG

We provide baseline comparisons for generalizability and zero-shot capabilities against TMR++.

Generalizability

Zero-shot

Diversity

MoRAG-Diffuse

We provide baseline comparisons for generalizability and zero-shot capabilities against ReMoDiffuse.

Generalizability

Zero-shot

Diversity

BibTeX

@InProceedings{MoRAG,
  author    = {Kalakonda, Sai Shashank and Maheshwari, Shubh and Sarvadevabhatla, Ravi Kiran},
  title     = {MoRAG - Multi-Fusion Retrieval Augmented Generation for Human Motion},
  booktitle   = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year      = {2025},
}