Recent Papers

2025

Surformer v2: A Multimodal Classifier for Surface Understanding from Touch and Vision

Multimodal surface material classification plays a critical role in advancing tactile perception for robotic manipulation and interaction. In this paper, we present Surformer v2, an enhanced multi-modal classification architecture designed to integrate visual and tactile sensory streams through a late(decision level) fusion mechanism. Building on our earlier Surformer v1 framework, which employed handcrafted feature extraction followed by mid-level fusion architecture with multi-head cross-attention layers, Surformer v2 integrates the feature extraction process within the model itself and shifts to late fusion.

Learning in Focus: Detecting Behavioral and Collaborative Engagement Using Vision Transformers

In early childhood education, accurately detecting behavioral and collaborative engagement is essential for fostering meaningful learning experiences. This paper presents an AI-driven approach that leverages Vision Transformers (ViTs) to automatically classify children’s engagement using visual cues such as gaze direction, interaction, and peer collaboration.

Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features

Surface material recognition is a key component in robotic perception and physical interaction, particularly when leveraging both tactile and visual sensory inputs. In this work, we propose Surformer v1, a transformer-based architecture designed for surface classification using structured tactile features and Principal Component Analysis (PCA)-reduced visual embeddings extracted via ResNet 50. The model integrates modality-specific encoders with cross-modal attention layers, enabling rich interactions between vision and touch. Currently, state-of-the-art deep learning models for vision tasks have achieved remarkable performance. With this in mind, our first set of experiments focused exclusively on tactile-only surface classification. Using feature engineering, we trained and evaluated multiple machine learning models, assessing their accuracy and inference time. We then implemented an encoder-only Transformer model tailored for tactile features.