Open-Vocabulary 3D Affordance Understanding via Functional Text Enhancement and Multilevel Representation Alignment

Abstract

Understanding 3D affordance is essential for agents to effectively interact with real-world environments, encompassing tasks such as manipulation and navigation. Existing methods typically support open-vocabulary queries through label-based language descriptions but often suffer from under-generalization and insufficient discrimination in their representations. However, affordance understanding requires constructing a coherent semantic landscape from fragmented linguistic expressions—one that maintains intra-class diversity while minimizing inter-class overlap. To overcome these challenges, we introduce Aff3DFunc, a framework designed to enhance the alignment between affordance and 3D geometry. It begins with a functional text enhancement module grounded in the Information Bottleneck (IB) principle, which strategically enriches affordance semantics by maximizing both relevance and diversity. A dual-encoder architecture is then employed to extract embeddings from both point clouds and text. To bridge the modality gap, we further propose a multilevel representation alignment strategy that incorporates supervised contrastive learning, reinforcing semantic–geometric correspondence in a part-to-whole manner. Extensive experiments demonstrate that our approach significantly enhances the understanding of affordance complexity. The learned representations exhibit high adaptability to diverse text queries, particularly in zero-shot settings. Furthermore, the real-world robot validation confirms that our method improves affordance understanding, enabling more fine-grained manipulation tasks.

Video

Method Overview

The proposed framework Aff3DFunc includes: (a) Point Cloud Encoder, extracting geometric features from input point clouds; (b) Text Encoder, where the FTE module enriches affordance semantics via fine-grained descriptions; (c) Representation Alignment, aligning multimodal embeddings with cross-entropy and supervised contrastive losses across multiple levels; (d) Cross Attention, enhancing geometric features via point-wise relationship modeling using Multi-Head Attention

Safety-critical Manipulation

we conducted real-world experiments using a Unitree GO2 mobile platform equipped with a 6-DoF D1 robotic arm and a parallel gripper. We focus on manipulation tasks that require precise affordance understanding, such as distinguishing between a knife's handle (graspable) and blade (hazardous).

t-SNE visualizations for learned geometric embeddings

We compute the class centers and use the kNN to select the N/2 points nearest to each center, from which we construct convex hulls. As shown below, our method yields more distinct category boundaries and greater intra-class spread, reflecting the effects of Functional Text Enhancement in enhancing both separability and diversity.

BibTeX

@article{lin2025aff3dfunc,
  author    = {Lin Wu, Wei Wei, Peizhuo Yu, Jianglin Lan},
  title     = {Open-Vocabulary 3D Affordance Understanding via Functional Text Enhancement and Multilevel Representation Alignment},
  journal   = {ACM MM},
  year      = {2025},
}