Vision and touch are two fundamental sensory
modalities for robots, offering complementary information
that enhances perception and manipulation tasks. Previous
research has attempted to jointly learn visual-tactile repre-
sentations to extract more meaningful information. However,
these approaches often rely on direct combination, such as
feature addition and concatenation, for modality fusion, which
tend to result in poor feature integration. In this paper,
we propose ConViTac, a visual-tactile representation learning
network designed to enhance the alignment of features during
fusion using contrastive representations. Our key contribution
is a Contrastive Embedding Conditioning (CEC) mechanism
that leverages a contrastive encoder pretrained through
self-supervised contrastive learning to project visual and
tactile inputs into unified latent embeddings. These
embeddings are used to couple visual-tactile feature fusion
through cross-modal attention, aiming at aligning the unified
representations and enhancing performance on downstream
tasks. We conduct extensive experiments to demonstrate the
superiority of ConViTac in real world over current state-of-the-art
methods and the effectiveness of our proposed CEC mechanism,
which improves accuracy by up to 12.0% in material classification
and grasping prediction tasks.
|