This AI Knows What It’s Touching—Because Scientists Tuned Its Senses

cover
13 Jun 2025

Authors:

(1) Samson Yu, Dept. of Computer Science, National University of Singapore (samson.yu@u.nus.edu);

(2) Kelvin Lin. Dept. of Computer Science, National University of Singapore;

(3) Anxing Xiao, Dept. of Computer Science, National University of Singapore;

(4) Jiafei Duan, University of Washington;

(5) Harold Soh, Dept. of Computer Science, National University of Singapore and NUS Smart Systems Institute (harold@comp.nus.edu.sg).

VII. ABLATIONS

In this section, we describe ablation studies to examine (i) the impact of the encoder’s learned representations on physical property prediction and (ii) the influence of end-to-end finetuning data quantity on physical reasoning. For the following sections, we report test accuracy on unseen objects.

A. Ablation: The Impact of Encoder Fine-tuning

We used vision-based tactile inputs in this work and pretrained vision foundation models (i.e. CLIP) have shown impressive performance on vision tasks. To test whether additional fine-tuning improves the pre-trained CLIP encoder’s representations for physical property prediction using tactile images, we conducted ablation experiments. We compared the performance of two OCTOPI versions — one trained with the off-the-shelf CLIP encoder and the other trained with the finetuned CLIP encoder.

TABLE XI. CLIP Fine-tuning Ablation Results on Physical Understanding Tasks. Using a fine-tuned CLIP improves OCTOPI’s performance in physical understanding tasks for both OCTOPI-7b and OCTOPI-13b.

TABLE XII. End-to-end Fine-tuning Physical Property Prediction Result Comparisons. End-to-end fine-tuning with LoRA generally improves physical property prediction accuracies.

In Table X, our Object Property Description results show that OCTOPI-7b trained with a fine-tuned CLIP encoder outperforms one trained with an unmodified CLIP encoder by 7.90% on combined accuracy. Similarly, OCTOPI-13b with the fine-tuned CLIP visual encoder performs better on the combined, roughness, and bumpiness predictions, with the combined accuracy being 5.26% higher. This suggests that a fine-tuned CLIP generally improves its learned representations for physical property prediction in an end-to-end LVLM.

We further tested both OCTOPI versions on physical understanding tasks with results in Table XI. For OCTOPI-7b, the version trained with a fine-tuned CLIP encoder performs better across the three physical understanding tasks (by 17.72% on PC, 32.00% on PSS, 8.03% on POM). Similarly, OCTOPI-13b with the fine-tuned CLIP encoder has a better performance for physical understanding tasks, which suggests that finetuning generally helps physical understanding and physical reasoning performance. Further encoder analysis can be found in Appendix E.

B. Ablation: The Impact of End-to-end Fine-tuning

Table XII shows OCTOPI’s performance on the property prediction task before and after end-to-end fine-tuning with LoRA. For both OCTOPI-7b and OCTOPI-13b, the fine-tuned variants generally performed better. We see sharp improvements for OCTOPI-13b with improvements across the properties. Our results suggest that end-to-end fine-tuning improves physical property prediction accuracy. Similar to the property prediction task, we observed that fine-tuning with LoRA also improves OCTOPI’s performance on physical understanding tasks (Table XIII).

TABLE XIII. End-to-end Fine-tuning Physical Understanding Result Comparisons. End-to-end fine-tuning for physical understanding tasks significantly improves physical understanding for both OCTOPI7b and OCTOPI-13b.

This paper is available on arxiv under CC BY 4.0 DEED license.