Researchers Bridge Vision-Language Models and Touch with New Framework OCTOPI

13 Jun 2025

Authors:

(1) Samson Yu, Dept. of Computer Science, National University of Singapore ([email protected]);

(2) Kelvin Lin. Dept. of Computer Science, National University of Singapore;

(3) Anxing Xiao, Dept. of Computer Science, National University of Singapore;

(4) Jiafei Duan, University of Washington;

(5) Harold Soh, Dept. of Computer Science, National University of Singapore and NUS Smart Systems Institute ([email protected]).

Table of Links

Abstract and I. Introduction
II. Related Work
III. PhysiClear - Tactile and Physical Understanding Training & Evaluation Suite
IV. Octopi - Vision-Language Property-Guided Physical Reasoning
V. Experimental Setup
VI. Experimental Results
VII. Ablations
VIII. Conclusion and Discussion, Acknowledgements, and References
- Appendix for Octopi: Object Property Reasoning with Large Tactile-Language Models
  - APPENDIX A: ANNOTATION DETAILS
  - APPENDIX B: OBJECT DETAILS
  - APPENDIX C: PROPERTY STATISTICS
  - APPENDIX D: SAMPLE VIDEO STATISTICS
  - APPENDIX E: ENCODER ANALYSIS
  - APPENDIX F: PG-INSTRUCTBLIP AVOCADO PROPERTY PREDICTION

In this section, we briefly review prior work on tactile representation learning with the GelSight sensor, large vision language models (LVLMs) and language/vision-guided physical reasoning. There has been significant work in tactile based manipulation and physical reasoning, and we refer readers desiring information on these topics to relevant survey papers [37, 14, 66, 47, 28].

Tactile Representation Learning with GelSight. Tactile representation learning has advanced significantly in recent years as robotic manipulation often requires more precision beyond what can be provided by vision alone [44]. Among the available tactile sensors, vision-based sensors have gained popularity due to their high-resolution image outputs and versatility. In particular, the GelSight sensor has been used in recent work [32, 59, 60, 24, 61] for inferring physical properties (e.g. hardness, texture and liquid volume) and to manipulate objects [48]. A key benefit of the GelSight is that its image outputs can be easily processed by modern deep learning methods [24]. As a result, popular vision algorithms have been used for tactile representation learning with GelSight [62, 8]. In our work, we exploit recent advances in tactile representation learning to extend the capabilities of LVLMs to reason about vision-based tactile input.

Large Vision-Language Models. Recent advancements in LLMs have spurred a significant increase in efforts to integrate vision models with LLMs, exemplified by Flamingo [1], BLIP2 [29], and MiniGPT-v2 [9]. These Large Vision-Language Models (LVLMs) have shown remarkable effectiveness in utilizing web-scale image-text data for image-based reasoning, benefiting a range of applications from robotics [7, 15] to medical imaging [42]. Very recent work involves developing LVLMs that can process video content [30, 36], enabling reasoning over dynamic visual information, or integrate multimodal sensory data [64].

Physical Reasoning with Language and Vision as Context. The exploration of physical reasoning in conjunction with language predates the emergence of LLMs. Early studies focused on assessing model proficiency in physical reasoning. For example, the PIQA [6] benchmark evaluates models on physical common sense, whereas PROST[2] examines their understanding of physical reasoning concepts. Subsequent advancements in language grounding have led to works such CLEVRER [58], PIP [13], SPACE [12] and Phys101 [54], which investigate the acquisition of physical reasoning skills from visual inputs.

In the emerging LLM era, research has focused on objectcentric physical reasoning in LLMs. This involves evaluating various LLMs for their physical reasoning capabilities, e.g., NEWTON [52], and employing Vision-Language Models (VLMs) to predict physical properties that are then used to facilitate reasoning, as demonstrated in physically-grounded VLMs [17]. Unlike previous studies that primarily address physical reasoning through the integration of vision and language, OCTOPI stands out as the one of the first models capable of processing tactile images alongside language instructions to enable physical reasoning. There has been very recent work [22] that uses simulated tactile inputs with LLMs, but we focus on real tactile data. Concurrent work [16, 57] also explores real-world tactile data but our work features physical property annotations and a test suite comprising scenario reasoning tasks, and experiments using OCTOPI to evaluate the utility of physical property inference.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

Researchers Combine Touch and Language to Boost Robotic Reasoning

Up Next →

AI Learns Physical Properties from Touch Using PHYSICLEAR Framework

Researchers Bridge Vision-Language Models and Touch with New Framework OCTOPI

Table of Links

II. RELATED WORK