Open-Vocabulary 3D Affordance Understanding via Functional Text Enhancement and Multilevel Representation Alignment

1University of Glasgow
ACMMM 2025

Aff3dFunc demo.

Abstract

Understanding 3D affordance is essential for agents to effectively interact with real-world environments, encompassing tasks such as manipulation and navigation. Existing methods typically support open-vocabulary queries through label-based language descriptions but often suffer from under-generalization and insufficient discrimination in their representations. However, affordance understanding requires constructing a coherent semantic landscape from fragmented linguistic expressions—one that maintains intra-class diversity while minimizing inter-class overlap. To overcome these challenges, we introduce Aff3DFunc, a framework designed to enhance the alignment between affordance and 3D geometry. It begins with a functional text enhancement module grounded in the Information Bottleneck (IB) principle, which strategically enriches affordance semantics by maximizing both relevance and diversity. A dual-encoder architecture is then employed to extract embeddings from both point clouds and text. To bridge the modality gap, we further propose a multilevel representation alignment strategy that incorporates supervised contrastive learning, reinforcing semantic–geometric correspondence in a part-to-whole manner. Extensive experiments demonstrate that our approach significantly enhances the understanding of affordance complexity. The learned representations exhibit high adaptability to diverse text queries, particularly in zero-shot settings. Furthermore, the real-world robot validation confirms that our method improves affordance understanding, enabling more fine-grained manipulation tasks.

Video

Method Overview

Interpolate start reference image.

The proposed framework Aff3DFunc includes: (a) Point Cloud Encoder, extracting geometric features from input point clouds; (b) Text Encoder, where the FTE module enriches affordance semantics via fine-grained descriptions; (c) Representation Alignment, aligning multimodal embeddings with cross-entropy and supervised contrastive losses across multiple levels; (d) Cross Attention, enhancing geometric features via point-wise relationship modeling using Multi-Head Attention


Safety-critical Manipulation

we conducted real-world experiments using a Unitree GO2 mobile platform equipped with a 6-DoF D1 robotic arm and a parallel gripper. We focus on manipulation tasks that require precise affordance understanding, such as distinguishing between a knife's handle (graspable) and blade (hazardous).

t-SNE visualizations for learned geometric embeddings

image

We compute the class centers and use the kNN to select the N/2 points nearest to each center, from which we construct convex hulls. As shown below, our method yields more distinct category boundaries and greater intra-class spread, reflecting the effects of Functional Text Enhancement in enhancing both separability and diversity.

BibTeX

@article{lin2025aff3dfunc,
  author    = {Lin Wu, Wei Wei, Peizhuo Yu, Jianglin Lan},
  title     = {Open-Vocabulary 3D Affordance Understanding via Functional Text Enhancement and Multilevel Representation Alignment},
  journal   = {ACM MM},
  year      = {2025},
}