Authors:
(1) Samson Yu, Dept. of Computer Science, National University of Singapore (samson.yu@u.nus.edu);
(2) Kelvin Lin. Dept. of Computer Science, National University of Singapore;
(3) Anxing Xiao, Dept. of Computer Science, National University of Singapore;
(4) Jiafei Duan, University of Washington;
(5) Harold Soh, Dept. of Computer Science, National University of Singapore and NUS Smart Systems Institute (harold@comp.nus.edu.sg).
Table of Links
- Abstract and I. Introduction
- II. Related Work
- III. PhysiClear - Tactile and Physical Understanding Training & Evaluation Suite
- IV. Octopi - Vision-Language Property-Guided Physical Reasoning
- V. Experimental Setup
- VI. Experimental Results
- VII. Ablations
- VIII. Conclusion and Discussion, Acknowledgements, and References
- Appendix for Octopi: Object Property Reasoning with Large Tactile-Language Models
- APPENDIX A: ANNOTATION DETAILS
- APPENDIX B: OBJECT DETAILS
- APPENDIX C: PROPERTY STATISTICS
- APPENDIX D: SAMPLE VIDEO STATISTICS
- APPENDIX E: ENCODER ANALYSIS
- APPENDIX F: PG-INSTRUCTBLIP AVOCADO PROPERTY PREDICTION
- Appendix for Octopi: Object Property Reasoning with Large Tactile-Language Models
III. PHYSICLEAR - TACTILE AND PHYSICAL UNDERSTANDING TRAINING & EVALUATION SUITE
This section describes PHYSICLEAR, which comprises a tactile dataset with physical property and object-part annotations, along with a training and evaluation suite.
A. Physical Property Selection
In this work, we focus on three object properties: hardness, roughness and bumpiness. We list each property’s description and categories in Table I. Briefly, hardness is characterized by the extent of surface deformation when subjected to pressure; roughness pertains to the texture of the surface; and bumpiness describes the prominence of surface protrusions. The hardness of an object correlates with its compliance and thermal characteristics. In contrast, roughness and bumpiness are attributes influenced by the surface’s friction coefficient [10].
The selection of hardness, roughness, and bumpiness as physical attributes in our research is grounded in their relevance for physical reasoning [43, 20, 38, 10, 5, 26]. Generally, static physical properties of objects are categorized into geometric (e.g., size), material (e.g., hardness), and affective (e.g., comfort) [41]. Our study predominantly addresses material properties, as we deemed geometric and affective properties too challenging to ascertain using the GelSight. The choice of these specific properties was also informed by the data collection methodology [27], tailored to the limitations and strengths of the GelSight sensor, including considerations for its sensitivity and durability.
B. Dataset Collection & Annotation
To facilitate the grounding of our physical reasoning on tactile inputs, we collected a dataset of 74 everyday objects, totalling 408 tactile videos and corresponding videos showing the object as the data was collected. These objects were selected to span across our three selected properties, with variations across object types and materials. Detailed comparisons between PHYSICLEAR and existing GelSight datasets can be found in Table II.
The GelSight data was collected by-hand to mitigate risk of damaging the sensors and due to the challenge of securing different parts of irregularly-shaped objects while performing the required sampling motions. For each selected object, we captured up to seven tactile videos for each distinct region identified by a human evaluator. This process involved a twostep procedure: initially pressing the GelSight sensor against the object to capture pressure readings, followed by rotating the sensor to acquire shear readings. Each video generated from a single GelSight sensor reading constitutes an individual data point within our dataset.
Annotations of the physical properties were carried out by three independent annotators, with the average score used as the final annotation for each data point. Annotators were provided with both the tactile videos and the objects. Each property has three categories, and annotators were given the following guidelines for labeling each property:
• Hardness: The label soft is for objects that are compressible with little force, moderately hard for objects that are compressible with moderate force, and hard for objects that are incompressible even with a large pressing force.
• Roughness: smooth is for objects that present very minimal or no resistance when we slide our finger across its surface, slightly rough for objects with slight resistance, and rough for objects with significant resistance.
• Bumpiness: no bumps is for objects with no visible protrusions on its surface, small bumps for objects with protrusions smaller than ≈ 1/4 of the tactile image upon contact, and big bumps for objects with protrusions larger than 1/4.
This process yielded over 1,200 annotations and we observed high inter-annotator agreement scores (ICC3k of 0.894 (hardness), 0.979 (roughness), and 0.792 (bumpiness)). For reference, a score above 0.75 is considered good or excellent reliability. The dataset was subsequently divided into three distinct subsets (training, validation, and testing) following an 80-10-10 split. This division resulted in 60 objects for training and 7 objects each for validation and testing.
C. Training & Evaluation Suite
PHYSICLEAR’s training and evaluation suite comprises five physical reasoning tasks (Table III). All five tasks use tactile data and natural language instructions as inputs (Table IV). Since the tactile data is in video form, we follow prior LVLM work and represent it as a sequence of frames: X1, ..., XN . We further detail each task’s motivation, setup, evaluation details and whether they are used for training [T] and/or evaluation [E] below:
Property Comparison (PC) [T, E]. Given two tactile videos, each of a different object, a specified physical property, and its comparative adjective, determine whether the comparative adjective accurately describes the two videos. From a training perspective, this task helps a model distinguish between the various descriptions of physical properties, thereby aligning its comprehension of physical characteristics with our defined categories of hardness, roughness, and bumpiness. This alignment ability may improve a model’s ability to interpret and reason about the physical world in a manner consistent with human understanding.
Property Superlative Selection (PSS) [T, E]. For three tactile videos, each of a different object, and a specified physical property and its superlative adjective (e.g. hardest for the hardness property), choose the video that the superlative adjective best describes. This task is similar to the PC task and helps the LLM align its physical understanding with that of our physical property descriptions. Furthermore, since prior work has shown that LLMs might perform differently when the polarity of the comparative adjective changes [52], this task seeks to enhance the LLM’s resilience to various comparative descriptions of physical properties.
Property-object Matching (POM) [T, E]. This task requires matching physical properties to objects: given three tactile
videos (each featuring a different object) and three specified objects, the goal is to correctly associate each video with an object. This helps to align a model’s existing knowledge of object properties with our haptic perception, as our annotations are based on human touch and serve as the reference for the physical properties and their labels.
Property Scenario Reasoning (PSR) [E]. We provide two tactile videos, each showcasing a different object, along with a real-world scenario that relies on one or more of our defined physical properties. The task is to choose the video that represents the object whose physical properties best meet the scenario’s demands. This approach allows us to assess a model’s physical reasoning capabilities. Details of the scenarios are presented in Table V.
This paper is available on arxiv under CC BY 4.0 DEED license.