VISAPP 2026 Abstracts


Area 1 - 3D Vision, Motion, Robotics, Application & Systems

Full Papers
Paper Nr: 43
Title:

Gamifying Social Robotics: The Impact of Connect 4 on Human-Robot Interaction

Authors:

Giuseppe De Simone, Luca Greco, Alessia Saggese and Mario Vento

Abstract: In recent years, Social Assistive Robots (SARs) have emerged as a promising solution to address the needs of various populations, offering innovative ways to support physical and mental well-being through health monitoring, companionship, and cognitive stimulation. Within this context, we propose gamifying a social robotic architecture by introducing a casual game, specifically Connect Four, to stimulate cognitive abilities and encourage social interaction. Furthermore, other than an interaction with the tablet we introduce an interaction based on visual gestures. This solution has been implemented and integrated into the Pepper robotic platform. Extensive experimentation was conducted with 40 participants in real-world settings, demonstrating the effectiveness of the proposed solution. Notably, the median score for overall satisfaction was 5 out of 5. Furthermore, the results unexpectedly revealed a user preference for gesture-based interaction.

Paper Nr: 44
Title:

3D Shrinkage Estimation for Printed Ceramics

Authors:

Ulugbek Alibekov, Nicholas Baraghini, Tamas Magyar, Pablo Eugui, Laurin Ginner and Nicole Brosch

Abstract: Dimensional deviations caused by anisotropic shrinkage during thermal post-processing remain a critical barrier in ceramic additive manufacturing. Existing compensation methods often rely on manual, iterative adjustments, limiting efficiency and scalability. We present an automated framework that estimates per-axis shrinkage from paired 3D scans of parts in the green and sintered states of a printed object. Our acquisition system performs high-resolution 3D reconstructions of ceramic parts immediately after printing (green state) and post-sintering (sinter state), using simultaneous red and blue lasers for single pass scanning. To address the challenge of aligning geometries with differing scales and deformations, we develop a robust registration algorithm tailored for cross-state point clouds. The final per-axis shrinkage estimation is performed by matching histograms of projected coordinates. Experimental evaluations show that the proposed framework achieves shrinkage estimations close to the reference solution, without manual intervention.

Paper Nr: 54
Title:

Consistent Multi-Lane Tracking with Temporally Recursive Spline Modeling

Authors:

Sanghyeon Lee, Donghun Kang and Min H. Kim

Abstract: Lane recognition and tracking are essential for autonomous driving, providing precise positioning and navigation data for vehicles. Existing single-image lane detection methods often falter in real-world conditions like poor lighting and occlusions. Video-based approaches, while leveraging sequential frames, typically lack continuity in lane tracking, leading to fragmented lane representations. We introduce a novel approach that addresses these challenges through temporally recursive spline modeling, a robust framework designed to maintain consistent, multi-lane tracking over time. Unlike traditional methods that limit tracking to adjacent lanes, our technique models lane trajectories as temporally recursive splines mapped in world space, capturing smooth lane continuity and enhancing long-term tracking fidelity across complex driving scenes. Our framework incorporates 2D image-based lane detections into a recursive spline model, facilitating accurate, real-time lane trajectory representation across frames. To ensure reliable lane association and continuity, we integrate a Kalman filter and an adaptive Hungarian algorithm, allowing our method to enhance baseline detectors and support consistent multi-lane tracking. Experimental results demonstrate that our temporally recursive spline modeling outperforms conventional approaches in lane detection and tracking metrics, achieving supe-rior continuous lane recognition in challenging driving environments.

Paper Nr: 63
Title:

Visual Autoregressive Modelling for Monocular Depth Estimation

Authors:

Amir El-Ghoussani, André Kaup, Nassir Navab, Gustavo Carneiro and Vasileios Belagiannis

Abstract: We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at ”https://github.com/AmirMaEl/VAR-Depth”.

Paper Nr: 67
Title:

MH-Flow: Multi-View and Occlusion-Robust Dense Correspondence Estimation Based on Homographic Decomposition

Authors:

Thomas Vincent Chang, Simon Seibt, Kay Hartmann, Bartosz von Rymon Lipinski, Konrad Kluwak and Marc Erich Latoschik

Abstract: Dense correspondence estimation is an essential computer vision task, however, existing methods often strug-gle to produce accurate and reliable dense flow fields when encountering significant occlusions. This is primarily due to scenes with high depth complexity and wide baselines between input images. To address this challenge, MH-Flow is introduced, a novel approach specifically designed to generate occlusion-robust dense flow fields from two to four images. It leverages iterative dense feature matching to perform a multi-homography (MH) decomposition for an image pair. This decomposition allows to effectively detect and analyze occluded and non-occluded pixel regions. Furthermore, MH-Flow incorporates a multi-view strategy and matching transitivity to extrapolate flow information from two additional input views and to enhance robustness, particularly in challenging occluded areas. Experimental results across multiple datasets demonstrate MH-Flow’s significant superiority in handling large viewpoint changes and occlusions compared to leading CNN-based approaches, achieving more accurate and complete dense flow fields.

Paper Nr: 68
Title:

A Comparative Study of 3D Person Detection: Sensor Modalities and Robustness in Diverse Indoor and Outdoor Environments

Authors:

Malaz Tamim, Andrea Matic-Flierl and Karsten Roscher

Abstract: Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera–LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.

Paper Nr: 73
Title:

Place-NeRFs: A Smart Approach to Divide Large and Complex Scenes into Multiple Regions of Locally Related Views

Authors:

Jose L. Huillca, Horácio Henriques, Andre Luiz da S. Pereira, Raphael dos S. Evangelista, Christian Erik Condori, Robinson Luiz Souza Garcia, Michelle Soares Pereira Facina, Maikon Bressani, Marcos Amaral de do Almeida, Lucas Bertelli Martins, Esteban Walter Gonzales Clua and Leandro A. F. Fernandes

Abstract: We present the Place-NeRFs, a scalable approach for large-scale 3D scene reconstruction that intelligently subdivides scenes into non-overlapping regions, allowing each region to be handled independently by off-the-shelf Neural Radiance Field (NeRF) models. This effectively balances reconstruction fidelity and optimized computational resource utilization. By leveraging rough single-view depth estimation and visibility graphs, Place-NeRFs effectively group spatially correlated photospheres. This strategy enables the generation of independent volumetric reconstructions, eliminating redundancies and enabling parallel processing. As a result, this approach significantly reduces processing time and enhances scalability during NeRF models’ training. We evaluate our approach in industrial environments and two public datasets of multi-room scenes. Our approach proves particularly effective in challenging industrial scenarios characterized by sparse views, intricate structures, and uneven distribution of visual data. Experiments conducted in real-world environments demonstrate that Place-NeRFs overcome traditional limitations, offering a robust and efficient solution for applications that require accurate, fast, and adaptable 3D reconstruction. Our contribution represents a significant advancement in integrating NeRFs into industrial pipelines and large-scale scenes.

Paper Nr: 97
Title:

Towards 3D Visualization of Insect Trajectories Using Stereo Event Camera Data and RGB-Based SfM Point Clouds

Authors:

Regina Pohle-Fröhlich, Tobias Bolten, Colin Gebler and Felix Lögler

Abstract: To investigate the factors influencing insect behaviour, a monitoring system is needed that can automatically record insect activity and environmental conditions over extended periods. For this purpose, a stereo setup with two event cameras to capture the flight paths of insects is used. In this paper, the visualization of these flight trajectories within the context of their environment is focused on. The processing steps required to automatically detect reference markers in both the event data and the corresponding RGB video frames captured by a smartphone camera are described. A coloured point cloud of the environment is reconstructed from the video data using Structure-from-Motion and is aligned with the recorded insect flight paths. This enables novel insights into insect–flower interactions by integrating environmental context into event data visualisa-tion, allowing analyses such as tracking the sequence of an individual insect’s flower visits or quantifying the number of visitors to a single flower within a given time frame.

Paper Nr: 99
Title:

FutrTrack: A Camera-LiDAR Fusion Transformer for 3D Multiple Object Tracking

Authors:

Martha Teiko Teye, Ori Maoz and Matthias Rottmann

Abstract: We propose FutrTrack, a modular camera-LiDAR multi-object tracking framework that builds on existing 3D detectors by introducing a transformer-based smoother and a fusion-driven tracker. Inspired by the query-based tracking frameworks, FutrTrack employs a multi-modal two-stage transformer refinement and tracking pipeline. Our fusion tracker integrates bounding boxes with multi-modal bird’s eye view (BEV) fusion features from multiple cameras and LiDAR without the need for an explicit motion model. The tracker assigns and propagates identities across frames, leveraging both geometry and semantic cues for robust re-identification under occlusion and viewpoint changes. Prior to tracking, we refine sequences of bounding boxes with a temporal smoother over a moving window to refine trajectories, reduce jitter, and improve spatial consistency. Evaluated on nuScenes and KITTI, FutrTrack demonstrates how query-based transformer tracking methods benefit greatly from multi-modal sensor features compared to previous single sensor modality approaches. With an aMOTA of 74.7 on the nuScenes test set, FutrTrack achieves strong performance on 3D MOT benchmarks, significantly reducing identity switches while maintaining competitive accuracy. Our approach offers an efficient framework to improve transformer-based trackers to compete with other neural network-based trackers even with limited data and without pre-training.

Paper Nr: 105
Title:

Semantic Segmentation of Point Clouds for Autonomous Driving: An Approach for Imbalanced Classes

Authors:

Beatriz Pinheiro de Lemos Lopes, Matheus Leonel de Andrade, Analucia Schiaffino Morales, Iwens Gervasio Sene Junior and Lucas Araújo Pereira

Abstract: Computer vision, particularly semantic segmentation, plays a crucial role in autonomous vehicle navigation by enabling the identification of objects in complex urban environments. However, class imbalance poses a significant challenge, as frequent categories such as road or vegetation dominate datasets like SemanticKITTI, while safety-critical classes such as cyclist and motorcyclist remain underrepresented. This imbalance biases models toward majority classes, degrading recognition of minority ones and compromising safety, since autonomous systems depend on reliable detection of vulnerable road users. To address this issue, this study evaluates the impact of several loss functions on LiDAR-derived 2D projections using the RIU-Net model. Each alternative is compared to standard Cross-Entropy to assess its ability to mitigate imbalance. The results reveal that different formulations introduce distinct trade-offs: some improve minority-class performance but reduce majority-class accuracy, while others maintain a more uniform balance. Among these, losses such as Weighted Cross-Entropy achieved more consistent accuracy across classes. These findings highlight the importance of loss function choice and suggest that complementary strategies, such as augmentation or resam-pling, may be necessary for more robust perception in autonomous driving.

Paper Nr: 115
Title:

BladeR: Scalable, Shape-Guided 3D Reconstruction of Wind Turbine Blades from UAV Imagery

Authors:

Jonathan Sterckx, Michiel Vlaminck and Hiep Luong

Abstract: The structural integrity of wind turbine blades is critical for safe, efficient energy production. Accurate 3D reconstruction allows precise localization and quantification of wear and early damage detection. While drone-based inspections are standard, their high-resolution imagery poses challenges such as weak texture, lighting variability, and high computational cost. We present BladeR, a scalable pipeline for high-quality 3D reconstruction from inspection imagery. A hierarchical sparse reconstruction strategy leverages GPS/IMU data and blade shape priors to guide advanced keypoint matchers on full-resolution images. For dense reconstruction, a tiled Gaussian Splatting framework processes 45 MP images at native resolution, isolates the blade from background, and adaptively enhances detail recovery along the leading edge. Validation against commercial and open-source baselines shows significant improvements in model completeness and detail.

Paper Nr: 116
Title:

DiGS-3D: Diffusion Transformer as Unstructured 3D Gaussian Splatting Generator

Authors:

Baptiste Engel, Florian Chabot, Mohamed Chaouch and Quoc-Cuong Pham

Abstract: Generation of 3D Gaussian Splatting (3D GS) representation suffers from the challenge of maintaining the visual quality while preserving the geometry of generated objects. We introduce DiGS-3D, a method for generating 3D Gaussian Splatting representations in the primitive space using a Diffusion Transformer architecture (DiT). While current 3D GS generative models use pre-computed proxy representations as a way to structure the point cloud before processing, our method does not modify the original 3D GS representation after the original optimization, keeping its original structure and reducing the computational overhead. We use an encoder-only transformer architecture to model both the parameters of 3D Gaussians and their relationships across the entire scene, and attain competitive metric values on the unconditional 3D GS generation task. We additionally introduce depth supervision for training 3D Gaussian Splatting generative diffusion models, which we showed improves the performance of our model.

Paper Nr: 176
Title:

Eyes on the Grass: Biodiversity-Increasing Robotic Mowing Using Deep Visual Embeddings

Authors:

Lars Beckers, Arno Waes, Aaron Van Campenhout and Toon Goedemé

Abstract: This paper presents a robotic mowing framework that actively enhances garden biodiversity through visual perception and adaptive decision-making. Unlike passive rewilding approaches, the proposed system uses deep feature-space analysis to identify and preserve visually diverse vegetation patches in camera images by selectively deactivating the mower blades. A ResNet50 network pretrained on PlantNet300K provides ecologically meaningful embeddings, from which a global deviation metric estimates biodiversity without species-level supervision. These estimates drive a selective mowing algorithm that dynamically alternates between mowing and conservation behavior. The system was implemented on a modified commercial robotic mower and validated both in a controlled mock-up lawn and on real garden datasets. Results demonstrate a strong correlation between embedding-space dispersion and expert biodiversity assessment, confirming the feasibility of deep visual diversity as a proxy for ecological richness and the effectiveness of the proposed mowing decision approach. Widespread adoption of such systems will turn ecologically worthless, monocultural lawns into vibrant, valuable biotopes that boost urban biodiversity.

Paper Nr: 200
Title:

Accuracy Improvement of 3D Point Cloud Segmentation Using Attention with Coordinate-Based Clustering

Authors:

Konan Sasaki, Junya Komori and Kazuhiro Hotta

Abstract: Recognizing 3D point clouds is challenging due to their sparsity and high computational cost, leading many models to focus on local relationships. However, capturing global dependencies is essential for understanding geometric features and overall structures. While relative coordinates help model spatial relationships, it is sensitive to noise caused by environmental factors or sensor errors, which can significantly affect local recognition. To overcome this limitation, we propose Coordinate Cluster Attention (CCA), which integrates coordinate-based clustering with attention mechanisms. CCA provides consistent coordinate information within local regions while enhancing the model’s ability to capture global relationships. By combining clustering-derived features with attention features, CCA effectively utilizes spatial information and mitigates the influence of coordinate noise. Experimental results on two indoor datasets demonstrate the effectiveness of our method. Specifically, CCA improves mIoU by 1.92 % and 1.83 % on Area 5 and the 6-Fold setting of S3DIS, and by 2.22 % on ScanNet V2 compared with PointNeXt. Moreover, CCA achieves a 1.1 % higher mIoU than Point Transformer V3 on S3DIS Area 5. We see that the proposed method enables more accurate segmentation by mutually complementing the weaknesses of the coordinate-based clustering and attention.

Paper Nr: 219
Title:

U3D: Unified Landmark-Displacement Framework for Real-Time Multi-Modal Emotion-Controllable 3D Facial Animation

Authors:

Laxmi Narayen Nagarajan Venkatesan, Rittik Panda, Rahulraj B. R., Dinesh Babu Jayagopi, Raj Tumuluri and Magnus Revang

Abstract: Audio-driven 3D facial animation enables natural communication for virtual avatars, digital assistants, and immersive media. However, existing systems are often limited to a single modality-either meshes, blend shapes, or rigs-making them difficult to deploy across heterogeneous platforms such as VR, AR, and game engines. We present U3D, a real-time and emotion-controllable framework that unifies these modalities within a single representation. The core component, a Landmark-Displacement Variational Autoencoder (LD-VAE), learns to model motion as relative displacements from a neutral face, producing an identity-invariant stochastic latent space that captures both speech and emotion dynamics. From this shared latent space, U3D can generate consistent, expressive outputs across meshes, blendshapes, and rigs using lightweight decoders. Since standard landmarks cannot capture complex surface deformations, we introduce barycentric landmark embeddings with tunable densities (68-468 points) and analyze the density-modality optima. Experiments on multiple datasets demonstrate improved lip synchronization, expressiveness, realism, and generalization compared to existing approaches, while maintaining real-time performance (approximately 30 FPS on a single RTX 2080 Ti). U3D thus offers a unified and scalable solution for expressive 3D facial animation across multiple platforms and representation formats.

Paper Nr: 238
Title:

3DThermoScan: Simultaneous 3D Reconstruction and Temperature Measurement with a Single Camera

Authors:

Tom Fevrier, Yvain Quéau, Thierry Sentenac and Florian Bugarin

Abstract: We introduce 3DThermoScan, a proof-of-concept method employing a single camera for full 3D thermal reconstruction, i.e., joint geometry and temperature recovery. Most existing approaches rely on infrared– visible camera pairs, requiring cross-system calibration and strong assumptions on surface emissivity. Our method instead inverts a unified image formation model that accounts for both reflection and emission. This approach allows us to estimate both geometry and temperature from a single set of images captured by one camera, thereby overcoming these limitations. From images captured under controlled active lighting, 3DThermoScan integrates photometric stereo and thermoreflectometry to estimate surface normals, albedo, and radiance temperature. With Lambertian reflectance as the only optical assumption, we recover surface emissivity and true temperature from these estimates, and map them onto the 3D model. We performed preliminary experiments on synthetic and real diffuse surfaces with diverse geometric and thermal properties. Noise sensitivity tests showed the robustness of our method with errors under 5 ° for normals and under 1 °C for temperature. When comparing the result of our real-world experiment to the result of a commercial scanner, we found an average 3D reconstruction error of 0.1 mm, which confirmed the potential of this method.

Paper Nr: 245
Title:

MVTOP: Multi-View Transformer-Based Object Pose-Estimation

Authors:

Lukas Ranftl, Felix Brendel, Bertram Drost and Carsten Steger

Abstract: We present MVTOP, a novel transformer-based method for multi-view rigid object pose estimation. Through an early fusion of the view-specific features, our method can resolve pose ambiguities that would be impossible to solve with a single view or with a post-processing of single-view poses. MVTOP models the multi-view geometry via lines of sight that emanate from the respective camera centers. While the method assumes the camera interior and relative orientations are known for a particular scene, they can vary for each inference. This makes the method versatile. Using lines of sight enables MVTOP to correctly predict the correct pose with the merged multi-view information. To show the model’s capabilities, we provide a synthetic data set that can only be solved with such holistic multi-view approaches since the poses in the dataset cannot be solved with just one view. Our method outperforms single-view and all existing multi-view approaches on our dataset and achieves competitive results on the YCB-V dataset. To the best of our knowledge, no holistic multi-view method exists that can resolve such pose ambiguities reliably. Our model is end-to-end trainable and does not require any additional data, e.g., depth. The dataset will be publicly available at (https://www.mvtec.com/company/research/datasets/mvtec-mv-ball).

Paper Nr: 305
Title:

Towards AI-Assisted Kidney Visual Biopsy: Automatic Estimation of Banff Score and IFTA from the Analysis of Kidney Ultrasound Images

Authors:

Arturo Fuentes, Valentina M. Blanco-Martínez, Javier Juega-Mariño and Jorge Bernal

Abstract: The incidence and prevalence of chronic kidney disease are rising in developed countries, and its progressive nature often leads to terminal renal failure, resulting in high morbidity, mortality, and increased healthcare costs. The assessment of renal graft evolution is traditionally performed through specific scoring systems linked to kidney biopsies, which are costly, invasive, and focus on only a specific part of the kidney. The context of this work is the development and validation of a computational support system aiming at the automatic, non-invasive assessment of kidney transplant status by predicting two key histological metrics: the Banff classification for rejection and Interstitial Fibrosis and Tubular Atrophy (IFTA). In this work we introduce a first study in which features extracted from B-mode ultrasound images with already segmented anatomical structures (kidney, cortex and medulla) are analysed to estimate Banff score and IFTA. To validate this approach, a new dataset comprising real-life transplant cases was curated and annotated with pixel-wise precision using a custom-developed annotation tool. Experimental results demonstrate the potential of the proposed system as part of a ”kidney visual biopsy” processing pipeline, revealing that predictive performance is strongly dependent on the specific anatomical structure analyzed and the fine-tuning strategy applied. The study indicates that medulla and whole kidney offer robust features for Banff and IFTA prediction respectively.

Paper Nr: 319
Title:

MAIA-D: A Multi-Patch Anomaly Inspection Approach Based on DINOv2

Authors:

Juan F. Flores Jarrin, Fabian Sturm and Tobias Bocklet

Abstract: Reliable defect detection remains a significant challenge in flexible industrial environments, particularly in contract manufacturing characterized by high product variability. To address this, we propose MAIA-D, a Multi-patch Anomaly Inspection Approach based on DINOv2. Our system extracts patch-level embeddings from test and reference images and compares them using cosine similarity, enabling simultaneous defect classification and localization. A critical advantage of MAIA-D is that it requires no prior training and is highly generalizable across diverse objects and textures. This work presents an analysis of MAIA-D’s performance across different hyperparameter configurations, alongside a comparison involving a preprocessing module for segmentation and alignment on the MVTec AD and VisA datasets. Our approach achieves average AUC values of approximately 95.6% for MVTec AD and 88.9% for VisA.

Paper Nr: 326
Title:

PGT-NeRF: Physics-Guided Thermal Neural Radiance Fields

Authors:

Deepika Rani Kaliappan Mahalingam, Ajay Kumar Sigatapu, Deep Doshi and Ganesh Sistu

Abstract: Reconstructing accurate 3D thermal fields from sparse imagery is essential for applications such as predictive maintenance, energy diagnostics, and autonomous navigation in degraded environments. Existing thermal NeRF approaches, however, either depend on RGB supervision or struggle to impose meaningful structure from thermal-only input. We propose PGT-NeRF, a Neural Radiance Field framework for thermal reconstruction that introduces two lightweight, physics-inspired priors rather than attempting full physical simulation. A diffusion-motivated smoothing operator encourages spatial coherence in the temperature field, while an emissivity-inspired scaling factor captures material-dependent variation. Both operate in the thermal camera's calibrated apparent-temperature domain and serve as inductive biases, not physically exact models. Combined with an edge-aware gradient loss (Structural Similarity Gradient Loss), these priors improve robustness, denoise predictions, and preserve structural detail. Across indoor and outdoor scenes, PGT-NeRF achieves +1.5 dB PSNR and +3\% SSIM over prior baselines, establishing a new benchmark for single-modality thermal view synthesis as compared against multi-modal baselines.

Short Papers
Paper Nr: 12
Title:

3D Gaussian Splatting with Fisheye Images: Field of View Analysis and Depth-Based Initialization

Authors:

Ulas Gunes, Matias Turkulainen, Mikhail Silaev, Juho Kannala and Esa Rahtu

Abstract: We present the first evaluation of 3D Gaussian Splatting methods on real fisheye imagery with fields of view above 180°. Our study evaluates Fisheye-GS (Liao et al., 2024) and 3DGUT (Wu et al., 2025) on indoor and outdoor scenes captured with 200° fisheye cameras, with the aim of assessing the practicality of wide-angle reconstruction under severe distortion. By comparing reconstructions at 200°, 160°, and 120° field-of-view, we show that both methods achieve their best results at 160°, which balances scene coverage with image quality, while distortion at 200° degrades performance. To address the common failure of Structure-from-Motion (SfM) initialization at such wide angles, we introduce a depth-based alternative using UniK3D (Universal Camera Monocular 3D Estimation) (Piccinelli et al., 2025). This represents the first application of UniK3D to fisheye imagery beyond 200°, despite the model not being trained on such data. With the number of predicted points controlled to match SfM for fairness, UniK3D produces geometrically accurate reconstructions that rival or surpass SfM, even in challenging scenes with fog, glare, or open sky. These results demonstrate the feasibility of fisheye-based 3D Gaussian Splatting and provides a benchmark for future research on wide-angle reconstruction from sparse and distorted inputs.

Paper Nr: 19
Title:

MST Based Ordering of Point Clouds and Curve Fitting

Authors:

Christoph Dalitz and Alexander Jongbloed

Abstract: We present a new method to fit a curve through a set of unordered points in 3D space. The method first computes the distance of each point to an end point in the longest path in the MST. This distance is then used as a predictor in a LOWESS regression where the number of neighbors is locally chosen on basis of a linearity index. Apart from computing a fitted curve, the method can also be used to order the points in the cloud. Compared to other approaches based on thinning, subsampling and spline interpolation, our method has the advantages of simplicity, that it does not depend on a good guess for parameters, and that it automatically yields a natural order on all points by assigning each point a hidden time value.

Paper Nr: 25
Title:

Multi-View Instance Segmentation of 3D Repeating Fine-Scale Structure for Precision Mapping of Coral Reefs

Authors:

Fabio Ganovelli, Gaia Pavoni, Elena Cenni, Massimiliano Corsini, Somnath Dutta and Paolo Cignoni

Abstract: We present an automated framework for 3D instance segmentation and counting of small, repetitive structures in photogrammetric 3D models. The method combines deep learning–based 2D instance segmentation with a multi-view geometric fusion strategy, projecting 2D masks onto a reconstructed 3D mesh to identify and measure individual elements. Exploiting redundancy across multiple viewpoints enables robust segmentation even with approximate geometry, making the approach suitable for challenging environments such as underwater surveys. We apply the framework to the analysis of Cladocora caespitosa corallites, whose dense arrangement and transparent polyps complicate manual counting. Validation in controlled and in situ settings demonstrates high accuracy and substantial reductions in expert analysis time. An open-source visualization tool supports interactive inspection and metric extraction.

Paper Nr: 35
Title:

ROI-NeRFs: Hi-Fi Visualization of Objects of Interest Within a Scene by NeRFs Composition

Authors:

Quoc-Anh Bui, Gilles Rougeron, Géraldine Morin and Simone Gasparini

Abstract: Efficient and accurate 3D reconstruction is crucial for cultural heritage applications. This study addresses the challenge of visualizing objects in complex scenes at high levels of detail (LOD) using Neural Radiance Fields (NeRFs), improving visual fidelity for selected objects while maintaining computational efficiency. The proposed ROI-NeRFs framework divides the scene into a Scene NeRF, capturing the overall scene at moderate detail, and multiple Region Of Interest (ROI) NeRFs, focusing on user-defined objects. An object-focused camera selection module automatically groups relevant cameras for each NeRF during the decomposition phase, while a Ray-level Compositional Rendering technique in the composition phase integrates Scene and ROI NeRFs for simultaneous multi-object rendering. Experiments on two real-world datasets, including a complex eighteenth-century cultural heritage room, demonstrate superior performance over baseline methods, enhancing LOD in object regions, minimizing artifacts, and maintaining efficient inference.

Paper Nr: 78
Title:

Robust Multi-View Self-Calibration from Dense Matches

Authors:

Johannes Hägerlind, Bao-Long Tran, Urs Waldmann and Per-Erik Forssén

Abstract: Estimating camera intrinsics and extrinsics is a fundamental problem in computer vision, and while advances in structure-from-motion (SfM) have improved accuracy and robustness, open challenges remain. In this paper, we introduce a robust method for pose estimation and calibration. We consider a set of rigid cameras, each observing the scene from a different perspective, which is a typical camera setup in animal behavior studies and forensic analysis of surveillance footage. Specifically, we analyse the individual components in a structure-from-motion (SfM) pipeline, and identify design choices that improve accuracy. Our main contributions are: (1) we investigate how to best subsample the predicted correspondences from a dense matcher to leverage them in the estimation process. (2) We investigate selection criteria for how to add the views incrementally. In a rigorous quantitative evaluation, we show the effectiveness of our changes, especially for cameras with strong radial distortion (79.9% ours vs. 40.4% vanilla VGGT). Finally, we demonstrate our correspondence subsampling in a global SfM setting where we initialize the poses using VGGT. The proposed pipeline generalizes across a wide range of camera setups, and could thus become a useful tool for animal behavior and forensic analysis.

Paper Nr: 83
Title:

A Comprehensive Analysis of Monocular Depth Estimation Models Performance in Automotive Scenarios

Authors:

Frederike Hörcher, Mikel García, Martí Sánchez, Nagore Barrena and Unai Elordi

Abstract: Accurate perception of a vehicle’s surroundings is crucial for autonomous driving systems. Although LiDAR sensors provide highly precise 3D data, they are significantly more expensive than other sensors. Recent advances in Monocular Depth Estimation (MDE) have improved the accuracy and inference time of state-of-the-art methods. This work evaluates such MDE methods in the automotive domain, assessing their potential as a low-cost, higher-resolution alternative using a single-camera image. We present a comprehensive frame-work to evaluate the potential of state-of-the-art MDE models for 3D reconstruction in automotive scenarios. We quantitatively assess four leading models (Depth Anything V2, Depth Pro, UniDepth V2 and UniK3D) on the nuScenes dataset, analysing both 2D depth map and 3D point cloud metrics. Our findings indicate that MDE models can reconstruct automotive scene geometry with high fidelity, although accuracy decreases with distance. UniDepth V2 emerged as the most consistent and efficient model, achieving near-real-time performance. The results suggest that MDE models are a promising, cost-effective solution for near-distance perception, capable of supporting critical functionalities in Advanced Driver Assistance Systems.

Paper Nr: 84
Title:

Predicting Driver Maneuvers with Driver's Gaze and Vehicle Dynamics

Authors:

Farzan Heidari, Taufiq Rahman and Michael A. Bauer

Abstract: Modeling driver behavior is pivotal for the advancement of Advanced Driver Assistance Systems (ADAS), contributing to road safety and driving efficiency. The correlation between a driver’s visual behavior and their environment provides valuable insights for enhancing predictive models of driver behavior. However, estimating where a driver is looking while driving and modeling this temporal information pose unique challenges. This study focuses on leveraging the driver’s Point of Gaze (PoG) in correlation with vehicle dynamics data to predict the next driver’s maneuver. Two predictive models, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), known for their ability to handle temporal dependencies, are employed and evaluated. Additionally, we present a temporal data preparation approach to enhance the models’ ability to capture the dynamic nature of driving scenarios and improve prediction timeliness. Our experimental results demonstrate the effectiveness of using the driver’s PoG in correlation with vehicle dynamics for predicting the next maneuvers. Both LSTM and GRU models yielded promising results, with the GRU model exhibiting superior performance.

Paper Nr: 85
Title:

A Framework for Vehicular Instrumentation and Dataset Creation for Designing ADAS and Driver Intention Prediction Based on Eye Data

Authors:

Farzan Heidari, Farhad Dalirani, Taufiq Rahman, Daniel Singh Cheema and Michael A. Bauer

Abstract: Advanced driver assistance systems (ADAS) can have a significant impact on driving safety through timely warnings and even taking control of the vehicle in critical situations. Further advances in ADAS can move beyond sensing and reacting to approaches that incorporate analyses of the driver and their interaction with the environment, both inside and outside the vehicle. These capabilities rely on algorithms such as computer vision, scene analysis, and driver intention prediction, which in turn require rich, synchronized data from diverse sensors. In this paper, we present a scalable and multimodal framework for vehicle instrumentation and end-to-end dataset creation to facilitate the analysis and prediction of driver intention and behavior, with a specific emphasis on leveraging the driver’s eye data in conjunction with other vehicle sensor data. Our contributions include a modular hardware/software platform designed for straightforward deployment and expansion, a pipeline for high-fidelity data capture and precise temporal synchronization across heterogeneous streams, and an annotation technique for event-level labeling of driver maneuvers. Using detailed vehicle instrumentation, data collection, extraction, and annotation procedures, we have created a real-world driving dataset that forms the foundation for our subsequent analyses and modeling.

Paper Nr: 87
Title:

PM25Vision: A Large-Scale Benchmark Dataset for Visual Estimation of Air Quality

Authors:

Han Yang

Abstract: We introduce PM25Vision (PM25V), the largest and most comprehensive dataset to date for estimating air quality-specifically PM2.5 concentrations-from street-level images. The dataset contains over 11,114 images matched with timestamped and geolocated PM2.5 readings across 3,261 AQI monitoring stations and 11 years, significantly exceeding the scale of previous benchmarks. The spatial accuracy of this dataset has reached 5 kilometers, far exceeding the city-level accuracy of many datasets. We describe the data collection, synchronization, and cleaning pipelines, and provide baseline model performances using CNN and transformer architectures. Our dataset is publicly available.

Paper Nr: 114
Title:

Fast-HaMeR: Boosting Hand Mesh Reconstruction Using Knowledge Distillation

Authors:

Hunain Ahmed Jillani, Ahmed Tawfik Aboukhadra, Ahmed Elhayek, Jameel Malik, Nadia Robertini and Didier Stricker

Abstract: Fast and accurate 3D hand reconstruction is essential for real-time applications in VR/AR, human-computer interaction, robotics, and healthcare. Most state-of-the-art methods rely on heavy models, limiting their use on resource-constrained devices like headsets, smartphones, and embedded systems. In this paper, we investigate how the use of lightweight neural networks, combined with Knowledge Distillation, can accelerate complex 3D hand reconstruction models by making them faster and lighter, while maintaining comparable reconstruction accuracy. While our approach is suited for various hand reconstruction frameworks, we focus primarily on boosting the HaMeR model, currently the leading method in terms of reconstruction accuracy. We replace its original ViT-H backbone with lighter alternatives, including MobileNet, MobileViT, ConvNeXt, and ResNet, and evaluate three knowledge distillation strategies: output-level, feature-level, and a hybrid of both. Our experiments show that using lightweight backbones that are only 35% the size of the original achieves 1.5x faster inference speed while preserving similar performance quality with only a minimal accuracy difference of 0.4mm. More specifically, we show how output-level distillation notably improves student performance, while feature-level distillation proves more effective for higher-capacity students. Overall, the findings pave the way for efficient real-world applications on low-power devices. The code and models are publicly available under https://github.com/hunainahmedj/Fast-HaMeR.

Paper Nr: 127
Title:

Neuro-Symbolic Multi-Head Learning for Spatial Relation Classification

Authors:

Katia Abdou, Ala Mhalla and Olivier Caspary

Abstract: Learning spatial relationships from visual data requires balancing perceptual accuracy with logical consistency. We propose a neuro-symbolic framework based on Logic Tensor Networks that integrates first-order spatial axioms directly into neural network training. Our multi-head architecture factorizes spatial reasoning into specialized prediction modules for horizontal orientation, vertical orientation, depth, and relative distance, constrained by differentiable axioms encoding antisymmetry, transitivity, and mutual exclusion. Experiments on CLEVR and REPLICA datasets demonstrate that axiom-constrained learning achieves near-perfect accuracy on elementary predicates and 94% on two-relation compositions-a 12 percentage point improvement over unconstrained baselines-while ensuring high logical consistency. Systematic ablation studies validate the contributions of logical regularization, multi-head factorization, and axiom weighting strategies.

Paper Nr: 133
Title:

Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements

Authors:

Kemal Alperen Çetiner and Hazım Kemal Ekenel

Abstract: Estimating the 6D pose of objects from a single RGB image is a critical task for robotics and extended reality applications. However, state-of-the-art multi stage methods often suffer from high latency, making them unsuitable for real time use. In this paper, we present Yolo-Key-6D, a novel single stage, end-to-end framework for monocular 6D pose estimation designed for both speed and accuracy. Our approach enhances a YOLO based architecture by integrating an auxiliary head that regresses the 2D projections of an object’s 3D bounding box corners. This keypoint detection task significantly improves the network’s understanding of 3D geometry. For stable end-to-end training, we directly regress rotation using a continuous 9D representation projected to SO(3) via singular value decomposition. On the LINEMOD and LINEMOD-Occluded benchmarks, YOLO-Key-6D achieves competitive accuracy scores of 96.24% and 69.41%, respectively, with the ADD(-S) 0.1d metric, while proving itself to operate in real time. Our results demonstrate that a carefully designed single stage method can provide a practical and effective balance of performance and efficiency for real world deployment.

Paper Nr: 137
Title:

R3GW: Relightable 3D Gaussians for Outdoor Scenes in the Wild

Authors:

Margherita Lea Corona, Wieland Morgenstern, Peter Eisert and Anna Hilsmann

Abstract: 3D Gaussian Splatting (3DGS) has established itself as a leading technique for 3D reconstruction and novel view synthesis of static scenes, achieving outstanding rendering quality and fast training. However, the method does not explicitly model the scene illumination, making it unsuitable for relighting tasks. Furthermore, 3DGS struggles to reconstruct scenes captured in the wild by unconstrained photo collections featuring changing lighting conditions. In this paper, we present R3GW, a novel method that learns a relightable 3DGS representation of an outdoor scene captured in the wild. Our approach separates the scene into a relightable foreground and a non-reflective background (the sky), using two distinct sets of Gaussians. R3GW models view-dependent lighting effects in the foreground reflections by combining Physically Based Rendering with the 3DGS scene representation in a varying illumination setting. We evaluate our method quantitatively and qualitatively on the NeRF-OSR dataset, offering state-of-the-art performance and enhanced support for physically-based relighting of unconstrained scenes. Our method synthesizes photorealistic novel views under arbitrary illumination conditions. Additionally, our representation of the sky mitigates depth reconstruction artifacts, improving rendering quality at the sky-foreground boundary.

Paper Nr: 140
Title:

Almost 28 Years later: Mobile Visual Search Using Large Datasets on Handheld Hardware

Authors:

Timon Kanik, Philipp Fleck, Thomas Kernbauer and Clemens Arth

Abstract: In this work, we revisit the topic of Mobile Visual Search (MVS), which was wrongly neglected in recent years. Image retrieval was sparked almost 28 years ago and different variants of MVS exist in SDKs for mobile devices for 10+ years, such as in Vuforia, ARKit, or ARCore. However, there are still severe limitations that hamper the practical use of MVS in modern AR applications, such as the maximum number of image targets. We investigate a modern and tunable MVS pipeline and provide an integration mechanism, which leverages vendor-based features from ARKit or ARCore to overcome these limitations. Finally, we provide a Unity plugin and associated training utilities for users to build huge databases for mobile handheld usage.

Paper Nr: 144
Title:

2VA: Vessel-Aware Retinal Image Anonymization that Preserves Diabetic Retinopathy Utility

Authors:

Rishabh Shukla, Pansuriya Uttambhai Bharatbhai and Harkeerat Kaur

Abstract: Retinal images are being openly shared to build and compare automated screening systems for diabetic retinopathy (DR). However, the retinal vasculature itself is a powerful biometric, allowing re-identification across datasets and breaching patient confidentiality. We introduce a two-stage, vessel-aware anonymization (2VA) pipeline that de-structures identity-carrying vascular structure yet leaves the clinical signal necessary to evaluate DR intact. We apply in Stage A stochastic, mask-conditioned micro-occlusions along the vessel tree that has been segmented, followed by classical inpainting and intra-vessel noise mixture to disrupts local geometry without messing with surrounding pathology. We then in Stage B inpaint the full vessel mask within retinal field-of-view. We instantiate a threat model where an attacker has a gallery of originals and can use vessel-skeleton matchers. Privacy is measured through open-set verification metrics, and utility is evaluated based on DR grading performance, stability of lesion detection, and agreement among grader regions. Our proposed method decreases linkability with strong attacks present while retaining diagnostic utility, and we demonstrate the hybrid design over one of the individual stages. The pipeline is reproducible, compute-light, and works with existing clinical workflows, rendering it a practical basis for responsible data sharing.

Paper Nr: 145
Title:

Geometry-Based Differentiable Camera Placement for Optimal 3D Coverage

Authors:

A. Moraza, X. Lin, D. Mejia-Parra and N. Barrena

Abstract: Optimal camera placement is fundamental for inspection and 3D reconstruction tasks. We address this traditionally discrete problem by reformulating it into a continuous optimization task using Differentiable Rendering (DR), a technique that enables gradient computation from 3D camera poses to a final loss value. Assuming a known scene, our pipeline treats camera poses as learnable parameters. We implement a novel differentiable visibility formulation that compares observed depths from the rendered scene, with the expected depth of the mesh vertices. Based on this, we define a geometry-based loss function using two specific coverage criteria: ”At-Least-Once” (ALO) and ”Exactly-Once” (EO), which find optimal placement configurations. Experiments conducted on diverse 3D objects and camera configurations demonstrate that our approach outperforms the SOTA baseline Neural Observation Field Guided Hybrid Optimization of Camera Placement (NeOF). Our methods yield superior quality, as demonstrated by our quantitative evaluation, and achieve lower computation times. Furthermore, the quality and efficiency of the generated poses show strong potential for future integration into trajectory planning systems.

Paper Nr: 170
Title:

Multi-Pathology Segmentation in Lumbar Spine MRI: A Comparative Deep Learning Approach

Authors:

Claudio Leite, Samuel Felipe dos Santos and Jurandy Almeida

Abstract: Low back pain is a leading cause of disability worldwide. Magnetic Resonance Imaging (MRI) is a cornerstone for diagnosis, yet deep learning methods are needed to overcome limitations such as diagnostic overlap, where a single anatomical location presents with multiple pathologies. This paper presents a comprehensive empirical study on the segmentation of multiple pathologies in lumbar intervertebral discs. We systematically compare three distinct strategies for handling diagnostic overlap: (i) binary class segmentation, a baseline that treats each pathology independently; (ii) multi-class segmentation, mapping 70 disease combinations to unique classes (non-overlapping masks); and (iii) multi-label segmentation, which uses binary channels to explicitly model the coexistence of multiple diagnoses (overlapping masks). These strategies are evaluated across five state-of-the-art architectures and four loss functions, encompassing over 200 distinct training pipelines. Our results demonstrate that the proposed multi-label segmentation strategy achieves a superior trade-off between accuracy and computational efficiency, outperforming the costly binary class approach and establishing a practical guideline for future research.

Paper Nr: 180
Title:

PreVolE: A Robust Data-Driven Framework for Text-Guided Food Volume Estimation

Authors:

Umair Haroon, Ahmad AlMughrabi, Ricardo Marques and Petia Radeva

Abstract: Accurate food volume estimation is crucial for medical nutrition management and health monitoring. However, achieving precise 3D reconstruction is difficult due to noisy images that lead to inaccurate point clouds and geometries. To address these challenges, we introduce PreVolE, a robust, data-driven pipeline that includes a novel preprocessing stage to eliminate defocus-blurred and near-duplicate images, resulting in a clearer dataset. To diminish pose ambiguity and enhance the quality of pose estimation, we leverage deep feature extraction and matching within a hierarchical localisation framework to generate more reliable and comprehensive point clouds. Our framework utilises refined point clouds and text-guided segmentation for accurate 3D mesh reconstruction. Experiments show our framework outperforms state-of-the-art methods in reconstruction fidelity and volume accuracy, reducing MAPE from 2.82% to 2.52% on MTF (absolute improvement of 0.3%) and enhancing computational efficiency by minimising redundant data. Our framework strikes a balance between accuracy and efficiency, making it a promising tool for dietary assessment in real-world scenarios.

Paper Nr: 187
Title:

Decentralized Privacy-Preserving Federated Learning of Computer Vision Models on Edge Devices

Authors:

Damian Harenčák, Lukáš Gajdošech and Martin Madaras

Abstract: Collaborative training of a machine learning model comes with a risk of sharing sensitive or private data. Federated learning offers a way of collectively training a single global model without the need to share client data, by sharing only the updated parameters from each client’s local model. A central server is then used to aggregate parameters from all clients and redistribute the aggregated model back to the clients. Recent findings have shown that even in this scenario, private data can be reconstructed using only information about model parameters. Current efforts to mitigate this are mainly focused on reducing privacy risks on the server side, assuming that other clients will not act maliciously. In this work, we analyzed various methods for improving the privacy of client data concerning both the server and other clients for neural networks. Some of these methods include homomorphic encryption, gradient compression, gradient noising, and discussion on possible usage of modified federated learning systems such as split learning, swarm learning or fully encrypted models. We have analyzed the negative effects of gradient compression and gradient noising on the accuracy of convolutional neural networks used for classification. We have shown the difficulty of data reconstruction in the case of segmentation networks. We have also implemented a proof of concept on the NVIDIA Jetson TX2 module used in edge devices and simulated a federated learning process.

Paper Nr: 193
Title:

MExECON: Multi-View Extended Explicit Clothed Humans Optimized via Normal Integration

Authors:

Fulden Ece Uğur, Rafael Redondo, Albert Barreiro, Stefan Hristov and Roger Marí

Abstract: This work presents MExECON, a novel pipeline for 3D reconstruction of clothed human avatars from sparse multi-view RGB images. Building on the single-view method ECON, MExECON extends its capabilities to leverage multiple viewpoints, improving geometry and body pose estimation. At the core of the pipeline is the proposed Joint Multi-view Body Optimization (JMBO) algorithm, which fits a single SMPL-X body model jointly across all input views, enforcing multi-view consistency. The optimized body model serves as a low-frequency prior that guides the surface reconstruction, where geometric details are added via normal map integration. MExECON integrates normal maps from both front and back views to accurately capture fine-grained surface details such as clothing folds and hairstyles. All multi-view gains are achieved without network re-training. Experimental results show that MExECON consistently improves fidelity over the single-view baseline and achieves competitive performance compared to modern few-shot 3D reconstruction methods.

Paper Nr: 210
Title:

From Captions to Queries: A VLM Pipeline for Queryable Metadata in Synthetic Environments

Authors:

Christopher May, Abhijith Raghavan Shaji, Jörg Franke and Sebastian Reitelshöfer

Abstract: The growing demand for detailed simulated environments in robotics has increased the need for automated methods to manage 3D assets. However, current Vision-Language Model (VLM) annotation methods are designed to generate descriptive text captions, primarily for training generative models. This output is insufficient for automated simulation workflows, which cannot query assets by specific, structured attributes like object class, size, or material. Here we present a fully automated pipeline that addresses this gap by generating both natural-language descriptions and this essential, structured, queryable metadata. Our pipeline uses standardized multi-view rendering and a multi-stage VLM process to extract and consolidate asset attributes. Evaluations on 1,000 Objaverse and AI-generated assets show that our pipeline's semantic descriptions are comparable to existing captioning-focused methods, while additionally extracting structured attributes with an overall accuracy of ̃76%. By embedding this structured metadata in a vector database, our pipeline enables the hybrid, similarity-based, and attribute-filtered retrieval required for scalable robotics simulation.

Paper Nr: 225
Title:

Towards Understanding 3D Vision: The Role of Gaussian Curvature

Authors:

Sherlon Almeida da Silva, Davi Geiger, Luiz Velho and Moacir Antonelli Ponti

Abstract: Recent advances in computer vision have predominantly relied on data-driven approaches that leverage deep learning and large-scale datasets. Deep neural networks have achieved remarkable success in tasks such as stereo matching and monocular depth reconstruction. However, these methods lack explicit models of 3D geometry that can be directly analyzed, transferred across modalities, or systematically modified for controlled experimentation. We investigate the role of Gaussian curvature in 3D surface modeling. Besides Gaussian curvature being an invariant quantity under change of observers or coordinate systems, we demonstrate using the Middlebury stereo dataset that it offers a sparse and compact description of 3D surfaces. Furthermore, we show a strong correlation between the performance rank of top state-of-the-art stereo and monocular methods and the low total absolute Gaussian curvature. We propose that this property can serve as a geometric prior to improve future 3D reconstruction algorithms.

Paper Nr: 226
Title:

Adversarial Domain Adaptation for Penile Cancer Diagnosis in Histopathological Images

Authors:

Rick Eick Vieira dos Santos, Victor José Beltrão Almajano Martinez, Marcos Gabriel Mendes Lauande and Geraldo Braz Júnior

Abstract: This study presents a deep learning approach based on Adversarial Discriminative Domain Adaptation for the diagnosis of penile cancer in histopathological images. The proposed method addresses the challenges posed by small, imbalanced, and unlabeled datasets, particularly in the target domain represented by the PCPAm dataset. Leveraging a source dataset (LC25000) and transfer learning via DenseNet-201, the model undergoes supervised pre-training followed by adversarial domain adaptation to align latent feature distributions across domains. A 10-fold cross-validation scheme was employed to ensure statistical robustness, and data augmentation was tailored to each domain to enhance generalization. Experimental results demonstrate competitive performance, with the best configuration achieving an F1-score of 96.8% and accuracy of 96.9%, matching state-of-the-art models in the literature. These findings confirm the method’s effectiveness in reducing domain shift and improving classification reliability in medical imaging tasks with limited sample sizes. This work highlights the potential of adversarial adaptation strategies in clinical applications and opens avenues for future research into new architectures and real-world deployment.

Paper Nr: 229
Title:

Estimating the 3D Position of Hidden Humans Using Reflections on Vehicle Bodies

Authors:

Hiroto Kozawa, Fumihiko Sakaue and Jun Sato

Abstract: This paper proposes a method for detecting human bodies located in the driver’s blind spots on the road and estimating their 3D positions in the scene. When driving a vehicle on the road, it is often possible to observe the 3D scene reflected on the bodies of other vehicles. These reflections on the other vehicles contain information about blind spot areas that cannot be directly seen by the driver. In this paper, we detect hidden human bodies reflected on other vehicles and estimate their 3D positions on the road. Since the vehicle body is not a flat surface and distortions occur in the reflected human body, it is not easy to estimate the 3D position of a human from its reflection on the vehicle body. In this paper, we propose a method that combines deep learning with geometric constraints on reflection to estimate the 3D position of hidden human bodies. We test our method using images from real driving environments and demonstrate its effectiveness.

Paper Nr: 230
Title:

3D Reconstruction of Hidden Objects from Simultaneous Recovery of Light Source and Environment

Authors:

Yuma Matsubara, Fumihiko Sakaue and Jun Sato

Abstract: This paper proposes a method to reconstruct a set of 3D light sources located in a hidden position from infrared images observed through a reflective surface, such as walls. Observing objects that cannot be directly seen is called Non-Line-of-Sight (NLoS) imaging. Existing NLoS reconstruction methods use active methods, which illuminate the object with laser light through a wall and reconstruct the object from the reflected light obtained through that wall. However, these methods require expensive systems that perform laser scanning. Thus, in this research, we exploit the fact that objects generally emit far-infrared light and reconstruct hidden objects, such as humans, by observing reflected infrared light from the objects through refractive surfaces, such as walls. The proposed method achieves accurate light source reconstruction by performing self-supervised learning between the observed image and the reprojection image, simultaneously recovering 3D luminous objects and scene parameters. Furthermore, we combine self-supervised learning using real images and supervised learning using synthetic images to achieve accurate reconstruction.

Paper Nr: 231
Title:

Binocular Visual Distortion Correction on Display

Authors:

Yan Yingxin, Fumihiko Sakaue and Jun Sato

Abstract: This paper proposes an image display method that allows users to see clear images with their naked eyes without wearing glasses or contact lenses. While some techniques for correcting vision on the display side have been proposed previously, they only address mild refractive distortions and assume that both eyes have the same visual characteristics. Therefore, existing methods cannot provide sufficient vision correction for observers with strong refractive distortions, such as astigmatism, or those with significantly different visual characteristics between the left and right eyes. Thus, we propose a vision correction method on the display side that accommodates visual systems with different higher-order visual distortions in the left and right eyes. For this objective, higher-order refractive errors such as astigmatism are represented using Zernike polynomials, and binocular fusion is modeled for cases where the visual characteristics of the left and right eyes differ. By using this vision simulation model, we achieve optimal image display for observers with complex visual distortions.

Paper Nr: 241
Title:

GGD-Tex: Geometry-Guided Diffusion for 3D Texture Synthesis from Single RGB Image

Authors:

Jit Chatterjee, Gijs Fiten and Maria Torres Vega

Abstract: Traditional texture generation for 3D content requires multi-view capture or manual artistry, which are both expensive and time-consuming. Thus, using a single RGB image to generate high-quality textures for 3D objects could simplify the texture creation task. However, this process involves generating unseen regions while preserving both geometric and semantic consistency. Existing methods struggle with incomplete texture coverage, geometric inconsistencies, and semantic incoherence when transferring appearance from 2D images to 3D surfaces. To address these challenges, we present GGD-Tex, a novel geometry-guided diffusion framework. It comprises two stages. First, it establishes the 2D-3D correspondence by passing the RGB image through dual-branch networks for mask and depth estimation. These and UV coordinates from the input mesh generate an incomplete texture map. Second, a stable diffusion model generates the complete texture using our multi-modal cross-attention conditioning comprising image and geometric features. This enables spatially-adaptive feature fusion, allowing texture regions to selectively attend to relevant semantic, geometric, and visual information. Experiments demonstrate that our method generates high-quality textures with superior geometric alignment and visual fidelity compared to the state-of-the-art methods. In fact, GGD-Tex outperforms the leading state-of-the-art method by 9.38 points in the Fréchet Inception Distance (FID) score.

Paper Nr: 252
Title:

Caption-Matching: A Multimodal Approach for Cross-Domain Image Retrieval

Authors:

Lucas Iijima, Nikolaos Giakoumoglou and Tania Stathaki

Abstract: Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Existing CDIR methods rely either on supervised learning with labeled cross-domain correspondences or on methods that require training or fine-tuning on target datasets, often struggling with substantial domain gaps and limited generalization to unseen domains. This paper introduces a novel CDIR approach that incorporates textual context by leveraging publicly available pre-trained vision-language models. Our method, Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or further training. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in plug-and-play settings with consistent improvements on Office-Home and DomainNet over previous methods. We also demonstrate our method’s effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.

Paper Nr: 253
Title:

Automatic Seal Quality Inspection Using Deep Learning in Mono Material Flexible Packaging

Authors:

Tajmout Ilyas, Hamid Ladjal, Fayez Shakil Ahmed, Hammouri Hassan and Roumanet Frédéric

Abstract: The automated detection of seal flaws with thermal imaging represents one of the major problems in the manufacturing process. This paper suggests a thermal imaging based deep learning method of detecting sealing faults of mono-material flexible packaging. We compare multiple pre-trained Convolutional Neural Networks to detect more accurately and run faster. Three experimental tests are carried out to enhance the classification process. Another way of reducing model size and computational load with pruning and quantization is required to run it on edge devices. With high-performance NVIDIA GPUs, the adjusted models have an accuracy of 98.7%, precision of 80.69%, and recall of 88.89%. This method performs effectively in real-time seal defect identification and may be applied in packaging quality checks in the industrial line. The model requires tuning to operate more effectively and adapt to a variety of production circumstances and packaging.

Paper Nr: 256
Title:

Detecting 3D Line Segments for 6DoF Pose Estimation with Limited Data

Authors:

Matej Mok, Lukáš Gajdošech, Michal Mesároš, Martin Madaras and Viktor Kocur

Abstract: The task of 6DoF object pose estimation is one of the fundamental problems of 3D vision with many practical applications such as industrial automation. Traditional deep learning approaches for this task often require extensive training data or CAD models, limiting their application in real-world industrial settings where data is scarce and object instances vary. We propose a novel method for 6DoF pose estimation focused specifically on bins used in industrial settings. We exploit the cuboid geometry of bins by first detecting intermediate 3D line segments corresponding to their top edges. Our approach extends the 2D line segment detection network LeTR to operate on structured point cloud data. The detected 3D line segments are then processed using a simple geometric procedure to robustly determine the bin’s 6DoF pose. To evaluate our method, we extend an existing dataset with a newly collected and annotated dataset, which we make publicly available. We show that incorporating synthetic training data significantly improves pose estimation accuracy on real scans. Moreover, we show that our method significantly outperforms current state-of-the-art 6DoF pose estimation methods in terms of the pose accuracy (3 cm translation error, 8.2◦rotation error) while not requiring instance-specific CAD models during inference.

Paper Nr: 272
Title:

PDNet: Exploring Pruning and Distillation for Real-Time Worker Activity Recognition in Industrial Assembly

Authors:

Malek Baba, Thinesh Thiyakesan Ponbagavathi and Alina Roitberg

Abstract: In this work, we address real-time fine-grained human activity recognition (HAR) in industrial assembly and introduce PDNet, a new framework that jointly exploits structured pruning and knowledge distillation for 3D convolutional networks. We start by training a 3D-CNN backbone in a supervised manner. Then, PDNet applies channel-wise structured pruning to obtain a more compact network. In the third step, our method is optimized using knowledge distillation from the large unpruned teacher to recover accuracy lost due to pruning. On the challenging MECCANO dataset with an X3D-M backbone, the pruned model without distillation reaches a Top-1 accuracy of 25.54 %, while PDNet raises this to 36.49 %, with only 11.82 MB model size and 3.98 GFLOPs. These results show that distillation within PDNet is highly effective in restoring pruned model performance and highlight its high potential for resource-constrained HAR in industrial assembly.

Paper Nr: 273
Title:

POM-NeRF: Perception-Oriented Modification of Neural Radiance Fields for Enhanced 3D Volumetric Rendering from Multi-View Images

Authors:

Joy Datta, Rawhatur Rabbi, Md. Shaleh Islam Tonmoy, Puja Saha and Chad Mourning

Abstract: Neural Radiance Fields (NeRF) provide an efficient framework for 3D volumetric rendering from multi-view images. The paper introduces POM-NeRF, a perception-oriented modification that incorporates residual connections, channel-wise feature recalibration, and Mish activation instead of ReLU into the standard NeRF architecture. This proposed architecture improves the quality of radiance representations and renders high-fidelity 3D scenes without significantly increasing computational cost. Across all eight scenes of the NeRF synthetic dataset, POM-NeRF consistently outperforms the baseline, yielding PSNR gains ranging from approximately 4 dB to over 12 dB. For example, performance on the Hotdog scene improves from 24.59 dB to 36.84 dB, and on the Mic scene from 27.63 dB to 37.95 dB. SSIM scores similarly rise by 0.08 to 0.18, with notable improvements such as 0.82 to 0.95 on the Lego scene and 0.74 to 0.93 on the Drums scene, demonstrating substantially enhanced perceptual fidelity. These results show that targeted architectural refinement can significantly boost reconstruction quality without materially increasing model complexity.

Paper Nr: 277
Title:

Non-invasive Growth Monitoring of Small Freshwater Fish in Home Aquariums via Stereo Vision

Authors:

Clemens Seibold, Anna Hilsmann and Peter Eisert

Abstract: Monitoring fish growth behavior provides relevant information about fish health in aquaculture and home aquariums. Yet, monitoring fish sizes poses different challenges, as fish are small and subject to strong refractive distortions in aquarium environments. Image-based measurement offers a practical, non-invasive alternative that allows frequent monitoring without disturbing the fish. In this paper, we propose a non-invasive refraction-aware stereo vision method to estimate fish length in aquariums. Our approach uses a YOLOv11-Pose network to detect fish and predict anatomical keypoints on the fish in each stereo image. A refraction-aware epipolar constraint accounting for the air-glass-water interfaces enables robust matching, and unreliable detections are removed using a learned quality score. A subsequent refraction-aware 3D triangulation recovers 3D keypoints, from which fish length is measured. We validate our approach on a new stereo dataset of endangered Sulawesi ricefish captured under aquarium-like conditions and demonstrate that filtering low-quality detections is essential for accurate length estimation. The proposed system offers a simple and practical solution for non-invasive growth monitoring and can be easily applied in home aquariums.

Paper Nr: 284
Title:

Simultaneous Reconstruction of Scene Information and Boundary Surface Geometry from Images Captured through Dynamic Refractive Interfaces

Authors:

Mao Okamura, Fumihiko Sakaue and Jun Sato

Abstract: In this study, we propose a method to simultaneously reconstruct the NeRF of a scene behind a refractive surface and the shape of the refractive surface itself from a sequence of distorted images captured through dynamically changing surfaces such as water. To achieve this, we introduce a shape model capable of representing dynamic boundary surfaces. By performing ray tracing on the captured images using this model, it becomes possible to render accurate images even in scenes involving refraction. Furthermore, by optimizing the rendered images to match the input images, both the boundary surface shape and the NeRF of the target scene can be reconstructed simultaneously.

Paper Nr: 286
Title:

Light Field Representation Using a DLP Projector and Focus Tunable Lens

Authors:

Kizuku Miyazawa, Fumihiko Sakaue and Jun Sato

Abstract: This study aims to realize a more convenient and natural three-dimensional (3D) image presentation technique by investigating a method that employs a high-speed DLP projector and a focus-tunable lens (FTL). Although various 3D devices such as VR systems are widely used, achieving a more natural sense of depth requires a volumetric representation that directly presents the distribution of light rays in space. Therefore, this study focuses on the fact that the observed image depends on the presented light field and explores a method to temporally generate arbitrary light fields using a set of binary images. This approach enables the reproduction of volumetric representations using an FTL and a DLP projector.

Paper Nr: 287
Title:

Monocular Image-Based 3D Seated Posture Estimation Integrated with a Seat-Angle Adjustable Smart Chair

Authors:

Hitoshi Shimomae and Noriko Takemura

Abstract: This paper proposes a monocular image-based 3D seated posture estimation method, integrates it into a smart chair with adjustable seat angle, and evaluates the system. First, an SMPL-based 3D body model is estimated from side RGB images using HMR2, and 3D skeleton keypoints excluding limb joints (SMPL NoLimbs) are extracted. Subsequently, a graph convolutional network classifies the seated posture into three classes (good, slightly poor, poor). Experiments using a large-scale dataset comprising 30 participants demonstrated that the proposed SMPL NoLimbs configuration outperformed silhouette images, 2D keypoints (YOLO), and 3D keypoints (VideoPose3D), achieving a macro F1 score of 0.595 for the three-class classification. Furthermore, we integrated the proposed method into a smart chair with a seat surface tiltable in multiple directions and angles, constructing and evaluating an integrated recognition-intervention seated support system. An evaluation experiment with 10 subjects demonstrated that the proposed system improved seated posture for many users and contributed to reducing pain and fatigue.

Paper Nr: 310
Title:

Fasciculation Detection in Ultrasound Images for ALS Diagnosis Based on Black Box Variational Inference

Authors:

Turrnum Shahzadi, Junfeng Zhou, Kota Bokuda and Norio Tagawa

Abstract: To assess the characteristics of fasciculation, which causes irregular muscle contractions that are clinically significant for the diagnosis of amyotrophic lateral sclerosis (ALS), this study introduces a novel variational inference framework for fasciculation detection in muscle ultrasound images, where accurate estimation of the rotation center is crucial. Black-box variational inference (BBVI) with Green’s theorem and the reparameterization trick are combined to enable robust, probabilistic estimation of rotational motion. Green’s theorem is at first leveraged to obtain reliable initial estimates of the rotation center, which are further refined through variational inference to model fasciculation motion as posterior distributions. By directly applying BBVI to finite, nonlinear motion, the proposed approach overcomes the limitations of traditional linearized techniques and enables uncertainty-aware inference under complex motion dynamics. Automatic differentiation is used for efficient, low-variance gradient estimation as compared to REINFORCE-based methods and a hierarchical multi-resolution strategy is adopted to avoid local optima and handle large motion. Experimental results demonstrate reliable fasciculation detection. Compared with our previous optical-flow-based approach, where treating muscle motion as infinitesimal introduces measurement errors. The proposed method yields reliable fasciculation detection.

Paper Nr: 311
Title:

Neural BRDF Decomposition for Consistent Material Recovery in Inverse Rendering

Authors:

Anant Kumar and Parag Chaudhuri

Abstract: Current methods for material recovery from multi-view images yield inconsistent results for the same object when captured under different lighting, leading to erroneous novel rendering. We introduce a method to consistently recover Spatially Varying Bidirectional Reflectance Distribution Functions (SVBRDF) across multiple lighting conditions, given object geometry. Our approach uses pairs of Reflectance (RF) and Ambient Occlusion (AO) neural networks to decompose the BRDF. RF networks encode the BRDF’s parameter-dependent part, while AO networks encode the light direction-dependent part (precomputing geometry-dependent light transport). The RF networks are optimized across multiple lighting inputs. Using synthetic and real multi-lighting datasets, our method demonstrates superior consistency against baseline state-of-the-art methods and matches the best novel view rendering when only single-lighting data is available.

Paper Nr: 325
Title:

Face2Cam: A User-to-Webcam Distance Estimation Image Dataset

Authors:

Yixuan Zhai and Arryn Robbins

Abstract: Estimating the distance between a user and their device’s camera is important for tasks such as gaze-aware interaction, adaptive interface scaling, and accessibility tools. Specialized depth sensors are effective but rarely available in everyday computing devices; by contrast, webcams are ubiquitous yet lack native distance information. To advance accessible distance-aware computing, we present Face2Cam, an openly available dataset of webcam images annotated with user-to-camera distances. Thirty-six adults were photographed under controlled variation of head orientation, gaze direction, and illumination, yielding 3,240 labeled images spanning distances from 30–80 cm. As a baseline, we trained a convolutional neural network as a lightweight reference to classify images into discrete distance intervals, achieving promising accuracy. These results demonstrate both the feasibility of webcam-only distance estimation and the value of Face2Cam as a resource for training and benchmarking distance-aware models, while leaving room for future architectural improvements.

Paper Nr: 334
Title:

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

Authors:

Giuseppe Lando, Rosario Forte and Antonino Furnari

Abstract: We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.

Paper Nr: 344
Title:

A Learnable 3D Perlin-Noise Representation for Image-Domain 3D Reconstruction

Authors:

Yuchen Liu, Eiji Kamioka and Phan Xuan Tan

Abstract: We propose 3DPN, a learnable 3D scene representation based on Perlin Noise for image-domain 3D reconstruction. 3DPN models scenes as a spatially continuous volumetric field defined by optimizable gradient vectors on a regular lattice, and rendered via differentiable volumetric rendering. We introduce a complete optimization pipeline, including automatic scene initialization from camera poses, a monotonic and gapless indexing scheme to support both bounded and unbounded scene configurations. To improve reconstruction capacity while maintaining a compact parameterization, we adopt a mixed-resolution formulation that combines multiple noise frequencies within a single volume. Owing to its analytic structure and low parameter size, 3DPN enables efficient rendering with decent frame rates. Experiments on public benchmarks and real-world captures show that 3DPN is particularly effective for compact scenes dominated by a single object, while the unbounded configuration is better suited for larger-scale scenes with extended spatial structure.

Paper Nr: 15
Title:

Early Detection of Colour Blindness Using Video Games

Authors:

James O'Connell and Kenneth Dawson-Howe

Abstract: This paper presents research into the detection of colour blindness in children and young adults through the use of modified video games. Existing gamification in this area has primarily considered very young children (under 7) and so more advanced games are necessary for the attention of older children and young adults. Initial testing with 50 participants shows significant potential for the detection of colour blindness using adapted retro games.

Paper Nr: 18
Title:

A Simple and Efficient 3D Gaussians Storage Format

Authors:

Denis Chaplygin, Juho Kannala and Esa Rahtu

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled photorealistic real–time rendering of complex scenes. However, the explicit storage of millions of Gaussian primitives leads to gigabyte–scale models, which hinders sharing, transmission, and deployment on resource-constrained devices. We propose a simple yet effective storage format tailored to 3DGS that exploits spatial redundancy through property-wise organization, delta coding, and lightweight quantization. Unlike generic compression schemes or learning-based methods, our approach requires no scene–specific optimization and introduces negligible computational overhead. Experiments on standard 3DGS benchmarks show that our format achieves 50–60% storage reduction while preserving rendering quality (SSIM ≥ 0.88, PSNR within 1–2 dB of the original). Compared to generic tools such as gzip, our method achieves substantially higher compression while remaining compatible with advanced quantization strategies. This makes it a practical solution for 3D applications such as VR/AR, gaming, and digital twins, where storage efficiency and fast loading are critical. The implementation is publicly available at https://github.com/akashihi/splatpb.

Paper Nr: 24
Title:

PairAnom: A Benchmark Dataset for Pairwise Fine-Grained Multi-Class Video Anomaly Detection

Authors:

Brajesh Kumar, Kamakshya Prasad Nayak, Kamalakar Vijay Thakare, Priti Pal, Manvendra Singh and Debi Prosad Dogra

Abstract: Video anomaly detection (VAD) in dynamic environmental conditions can be highly challenging. In particular, when normal and anomaly events become indistinguishable, VAD methods do not perform as expected. Moreover, existing VAD datasets are typically organized in terms of abnormal classes without any normal-abnormal pairing. We argue that a pairing of anomalous events with closely matching non-anomalous events will be more effective while training the classifiers. To accomplish this, we introduce and release PairAnom, a new VAD dataset comprising 2,132 real-world videos across eight highly important and fine-grained categories. The dataset contains six pairwise anomalous events, e.g. violent protest, firework festivity, chain/purse snatching, friendly mixing, etc. It addresses a critical gap of the existing datasets and methods that do not perform consistently to distinguish critical anomalies from normal events exhibiting similar visual features. We also present a new baseline that performs multi-class abnormality classification by leveraging two-stream I3D features combined via PCA and transformer frameworks. We have used PairAnom in binary as well as multi-class classifications settings for evaluation and benchmarked it against popular VAD methods (using binary settings) including DyAnNet (90.83%), RareAnom (87.31%), MIST (94.55%), OE-CTST (85.37%), REWARD (98.87%), etc. The baseline outperforms existing methods with AUC of 99.24% even with a full eight-class classification setup. Overall, PairAnom and the proposed baseline in combination establishes the first benchmark that makes distinction between similar looking normal-abnormal events with multi-class anomaly classification capability. It paves the way for development and evaluation of robust VAD techniques in challenging scenarios. The dataset can be found at PairAnom.

Paper Nr: 57
Title:

GenPlan: Generation Vector Residential Plans Based on the Textual Description

Authors:

Egor Bazhenov, Stepan Kasai, Viacheslav Shalamov and Valeria Efimova

Abstract: Computer graphics is fundamental in science, industry, and digital communications. It consists mainly of vector and raster components. Raster graphics, characterized by pixel-based representations, are easy to obtain and edit, but cannot be scaled losslessly. Conversely, vector graphics use mathematical structures to define shapes and lines, thus rendering them scalable without resolution degradation, but this type of graphics is often more complex to create and edit. For designers and architects, vector graphics is more versatile and convenient, although it requires more time and computational resources to create. In this paper, we present a novel method GenPlan for generating vector residential plans based on textual description. GenPlan surpasses all existing solutions by about 5% in visual quality based on CLIPScore metrics, as it works with right angles and has flexible usage settings. We also present a new residential plan vectorization algorithm, which creates vector structured images based on a given raster plan.

Paper Nr: 61
Title:

Talking Motion Signatures for Identity Recognition

Authors:

Ivan Samarskyi and Jan Čech

Abstract: We propose identifying individuals using only facial motion patterns during speech, independent of visual appearance. From short video clips, we extract temporal sequences of Action Units and head poses, and learn a discriminative embedding-the Talking Motion Signature (TMS)- with a ResNet50 backbone. We compare a standard softmax objective to ArcFace, finding that ArcFace yields more compact, better-separated identity clusters and stronger robustness across changes in compression, resolution, and clip duration. Against baseline handcrafted correlation-based TMSs, our learning-based approach improves accuracy. We validate on two complementary datasets: a curated VoxCeleb2 subset (636 identities, 20 video clips each) and a custom TalkingCelebs set (5 identities, 100 video clips each). The model attains up to 95% Top-1 accuracy on 5 identities and 25% Top-1 (34% Top-3) on 636 identities. Proof-of-concept applications include deepfake detection and impersonation quality assessment, indicating that motion-based identity cues can complement appearance-based media forensics.

Paper Nr: 70
Title:

Inpainting-Driven Depth Completion for Safe and Adaptive Collaborative Robotics

Authors:

Ambra Vandone, Andrea Quattrini, Carlotta Pizzolato and Anna Valente

Abstract: Collaborative robotic environments require fluid interaction between robots and human operators working simultaneously. Central to this interaction is the ability of perception systems to continuously estimate the operator’s position, gesture, cognitive load and ensure smooth adaptation of robot motion and planned task. Speed and Separation Monitoring (SSM) provides a strategy to regulate robot behaviour according to human proximity, but its effectiveness relies on the continuous availability of complete perception data. Thus, we propose a novel approach that integrates inpainting methods to retrieve a complete human position when occlusions occur caused by the robot structure itself moving in the scene, leveraging RGB–D cameras that cover the common workspace from multiple viewpoints. Our algorithm predicts the position of missing body parts by processing RGB frames with a deep-learning model to extract human and robot regions, while occluded areas are reconstructed using inpainting techniques. Once a complete 2D representation is recovered, a monocular depth estimation model infers the overall depth map up to a scale factor. An innovative depth rescaling follows, leveraging ToF sensor measurements to obtain physically consistent metric depth values. This process enables a more truthful human–robot distance computation and allows the robot to adapt its velocity without compromising productivity. Experiments in a collaborative robotic cell confirm that the method supports seamless interaction.

Paper Nr: 95
Title:

Bridging the Dimensionality Gap: A Taxonomy and Survey of 2D Vision Model Adaptation for 3D Analysis

Authors:

Akshat Pandya and Bhavuk Jain

Abstract: The remarkable success of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in 2D vision has spurred significant research in extending these architectures to the complex domain of 3D analysis. Yet, a core challenge arises from a fundamental dichotomy between the regular, dense grids of 2D images and the irregular, sparse nature of 3D data such as point clouds and meshes. This survey provides a comprehensive review and a unified taxonomy of adaptation strategies that bridge this gap, classifying them into three families: (1) Data-centric methods that project 3D data into 2D formats to leverage off-the-shelf 2D models, (2) Architecture-centric methods that design intrinsic 3D networks, and (3) Hybrid methods, which synergis-tically combine the two modeling paradigms to benefit from both rich visual priors of large 2D datasets and explicit geometric reasoning of 3D models. Through this framework, we qualitatively analyze the fundamental trade-offs between these families concerning computational complexity, reliance on large-scale pre-training, and the preservation of geometric inductive biases. We discuss key open challenges and outline promising future research directions, including the development of 3D foundation models, advancements in self-supervised learning (SSL) for geometric data, and the deeper integration of multi-modal signals.

Paper Nr: 98
Title:

Hierarchical Generation of Robot Programs Based on the Integration of Visual Affordance Recognition and Language Model

Authors:

Kazuki Yamada, Iori Kokabu, Fumiya Murase, Kotaro Kato, Koki Maruyama, Shuichi Akizuki and Manabu Hashimoto

Abstract: We propose a method for the automatic generation of executable robot motion programs from natural language task instructions. The proposed approach hierarchically applies a large language model (LLM) through three stages. In the first stage, the LLM is applied to the task instruction together with affordance information, such as grasping or insertion of parts, that is recognized from the scene images, in order to predict the scene after task completion. In the second stage, the LLM estimates the motion plan required to transition to the predicted scene. In the final stage, the LLM uses this plan to generate a Python-based robot motion program which is executable on a real world. In experiments with a real robot, the system generated robot motion programs of approximately 150 lines in response to four types of task instructions, such as “Insert the bolt into the screw hole” and “Put the white block into the white box”, and achieved an average task success rate of 68%.

Paper Nr: 103
Title:

Robust Optical Flow Computation: A Higher-Order Differential Approach

Authors:

Chanuka Algama and Kasun Amarasinghe

Abstract: In the domain of computer vision, optical flow stands as a cornerstone for unraveling dynamic visual scenes. However, the challenge of accurately estimating optical flow under conditions of large nonlinear motion patterns still remains an open question. The image flow constraint is vulnerable to rapid spatial transformations, while inaccurate approximations inherent in numerical differentiation techniques can further amplify such intricacies. In response, this research proposes an innovative algorithm for optical flow computation, by incorporating second-order Taylor series approximation of the brightness constancy constraint, augmented with perturbation-theoretic correction scheme within a differential estimation framework. By embracing this mathematical underpinning, the research seeks to extract more information about the behavior of the function under complex real-world scenarios and estimate the motion of areas with a lack of texture. Although deep learning-based optical flow computation models have shown impressive results, their high resource demands, large training datasets, and black-box nature limit their interpretability and applicability in real-world scenarios, especially in cases not well-represented in the training data. An impressive showcase of the proposed algorithm’s capabilities emerges through its performance on renowned optical flow benchmarks such as KITTI (2015) and Middlebury. The average endpoint error (AEE), which computes the Euclidian distance between the calculated flow field and the ground truth flow field, stands notably diminished, validating the effectiveness of the algorithm in handling complex motion patterns.

Paper Nr: 108
Title:

Development of an Intelligent AI-Powered Cooking Assistant

Authors:

Manel Hentati, Raida Hentati, Ines Chaabane and Rim Ben Salah

Abstract: As individuals age, they often face a variety of health, cognitive, and mobility challenges that can significantly impact their well-being, independence, and overall quality of life. Daily tasks such as cooking may become increasingly difficult due to physical limitations, memory decline, or reduced sensory abilities. To address these challenges, this research presents an intelligent AI-powered cooking assistant designed to seamlessly integrate into kitchen environments, providing natural multimodal interaction tailored to the needs of older adults. The system, deployed on a Raspberry Pi 4, offers step-by-step recipe guidance, ingredient substitution suggestions, and real-time monitoring to enhance safety and usability. It is based on a YOLOv8s model, pre-trained and fine-tuned on a custom kitchen-specific dataset to ensure accurate detection and identification of a wide range of kitchen objects. To enable an intuitive and hands-free experience, the assistant incorporates Speech Recognition and Text-to-Speech technologies, allowing smooth voice-based interaction. Experimental evaluations demonstrate the effectiveness of the system, showing high accuracy in object recognition and robust performance in real-world kitchen environments. By promoting autonomy and reducing potential safety risks, this AI-driven solution has strong potential to support independent living for older adults, ensuring both safety and convenience in everyday cooking tasks.

Paper Nr: 113
Title:

Enhancing QDTrack with Self-Attention in Autonomous Driving Environments

Authors:

Diego Gragnaniello, Antonio Greco, Antonio Parziale and Mario Vento

Abstract: Reliable and safe navigation of self-driving cars requires multi-object tracking algorithms to estimate the trajectories of moving objects on the road. The performance of tracking algorithms can be improved by optimizing each component of the detector-tracker pipeline. A valuable method to improve detectors is exploiting attention mechanisms, which imitate how humans find salient regions in a scene. In this paper, we have integrated self-attention mechanisms into Faster R-CNN, the detector included in QDTrack, a state-of-the-art tracker that follows the tracking-by-detection paradigm. We have evaluated the performance of the enhanced multi-object tracking system on the BDD100K dataset. Results show that integrating attention mechanisms into the detector improves QDTrack tracking performance, particularly in terms of mMOTA, at the cost of in-creased inference time and model complexity. The results highlight an explicit accuracy–efficiency trade-off.

Paper Nr: 126
Title:

Semantic Segmentation of LiDAR Point Clouds: A Hybrid Approach Combining Geometric Processing with Deep Learning

Authors:

Matheus Leonel de Andrade, Beatriz Pinheiro de Lemos Lopes, Savio Salvarino Teles De Oliveira and Lucas Araujo Pereira

Abstract: This work investigates whether massive dimensionality reduction through geometric clustering preserves sufficient semantic information for accurate point cloud classification. The proposed pipeline combines classical geometric processing with deep learning: ground segmentation via Ground Plane Fitting (GPF), object clustering through Scan Line Run (SLR), extraction of 24-dimensional geometric feature vectors, and semantic classification using a modified PointNet architecture. Experimental validation on SemanticKITTI demonstrates that reducing point clouds from approximately 121,000 points to 1,400 clusters (98% reduction) preserves 94% semantic consistency and enables 93% classification accuracy. These results validate that geometric clustering preserves essential semantic information for effective classification, demonstrating that strategic integration of traditional geometric methods with deep learning constitutes a viable solution for efficient 3D perception.

Paper Nr: 141
Title:

Transformer-Based Framework for 3D Human Pose Estimation Using YOLO Backbone

Authors:

Miguel F. Lima, Ana Filipa Rodrigues Nogueira, Cláudia D. Rocha, Luís F. Teixeira and Hélder Oliveira

Abstract: Estimating 3D human poses from monocular videos is a crucial task for applications in healthcare, augmented reality, and robotics, yet it is challenged by occlusions and depth ambiguity. We introduce a new framework that utilises the YOLOv11 as a backbone for robust 2D keypoint detection and the TCPFormer, an innovative transformer-based architecture that leverages spatial and temporal transformers, to lift 2D poses to 3D. By integrating multi-scale attention, TCPFormer effectively captures local joint relationships and global sequence context, surpassing the accuracy of previous models. Evaluated on the MPI-INF-3DHP dataset, our approach presents an end-to-end pipeline for 3D pose estimation from image sequences, achieving superior performance compared to existing methods, with a mean per joint position error of 70.65 mm, an area under the curve of 59.05%, and 90.40% of keypoints correctly estimated within a 150 mm threshold.

Paper Nr: 165
Title:

Track-UGV-SYNC: A Framework for Synchronizing Sports User-Generated Videos Using Multi-Object Tracking Features

Authors:

Elton Alencar and Rosiane de Freitas

Abstract: The number of people using their mobile devices to record videos during public events, such as sport matches, has increased. During these events, multiple users often capture videos from different perspectives and then share them on video-sharing platforms like YouTube, resulting in a rich but unsynchronized collection of multi-view UGVs. The synchronization of these videos can enable a multi-view analysis. However, synchronizing multi-view videos, often without any temporal reference (timestamps), represents a significant challenge. This work presents Track-UGV-SYNC, the proposed framework for automatic synchronization of multi-view user-generated sports videos using only visual features extracted by a deep learning-based multi-object tracking (MOT) model. Unlike traditional synchronization methods that rely on timestamps or audio cues, the proposed approach aligns videos through the similarity of object trajectories (tracklets) returned by MOT-UGV, an adapted version of YOLO-World and StrongSORT. The experiments conducted with MUVY datasets (composed of public YouTube sports UGVs), demonstrate that visual features alone can effectively estimate temporal offsets and overlapping scenes between unsynchronized videos. Results indicate that the MOT-UGV tracker yields robust visual features for temporal alignment, opening new possibilities for automated multi-perspective sports video analysis.

Paper Nr: 194
Title:

RIGATING: Radar-Image Gated Attention Fusion for Multimodal Semantic Segmentation in Autonomous Mobile Systems

Authors:

Patrick Ziegler, Jonas Mehl, Jörg Franke and Sebastian Reitelshöfer

Abstract: Autonomous mobile systems (AMS) require robust environmental perception, particularly in challenging environments where individual sensors exhibit limitations due to their inherent physical measurement principles. Therefore, this paper examines different fusion strategies from various multimodal domains for the semantic segmentation task and introduces RIGATING, a novel mid-level fusion architecture that integrates sparse radar point clouds with RGB images for semantic segmentation. Using gated attention mechanisms, our RIGATING architecture combines dual encoders, DeepLabV3+ with ResNet-101 backbone for RGB and PointNet++ for radar feature extraction, fused at high and low levels to dynamically weigh complementary information, including radar cross-section and velocity. Evaluated on the Zenseact Open Dataset (ZOD), RIGATING matches the RGB-baseline in interference-free scenarios, while demonstrating superior robustness under perturbations such as noise, blur, brightness and full sensor ablations, achieving over 70 percent IoU on radar inputs alone. These advancements enhance perception reliability for mobile systems in adverse conditions and lay the groundwork for more intelligent, human-like sensor data processing.

Paper Nr: 282
Title:

An Agentic Framework for Context-Guided Industrial Label Inspection

Authors:

Ravneet Kaur, Sai Rama Raju Penmatsa and Joydeep Acharya

Abstract: Traditional computer vision methods for industrial quality inspection rely on dedicated supervised training for each new product configuration, resulting in high development costs and limited scalability. These limitations persist because many inspection tasks, particularly defect detection, require product-specific specifications that do not generalize well. In contrast, this work examines another broad subclass of industrial inspection: verifying the labels applied to finished products. We show that this problem can be addressed using an inspection framework that can extract the underlying context of the label specifications using agentic AI without the need for dedicated supervised training for each label type. The proposed framework incorporates a Knowledge-Guided Reasoning stream that interprets multimodal documents (e.g., inspection manuals, CAD drawings) to autonomously extract the semantic context required for label verification. This contextual knowledge guides a vision driven analytics stream that localizes, segments, and geometrically refines label regions for accurate inspection. A human-in-the-loop mechanism further refines both components by incorporating targetted feedback from subject matter experts (SMEs). The framework reduces deployment time from days to hours while improving reliability and adaptability across a wide range of label inspection problems across different types of industrial verticals.

Paper Nr: 306
Title:

Towards GNSS-Denied Geo-Positioning Using Search Area Refinement

Authors:

Signe Møller-Skuldbøl, Kasper Kjærsgaard Lauritsen, Maria Daliana Moga, Omkar Arun Korgaonkar, Peter Guld Leth and Andreas Møgelmose

Abstract: An alternative for GNSS is pivotal in allowing UAVs to continue tasks without reliable satellite access. This work presents two novel methods capable of geo-positioning UAVs when performing operations in GNSS-denied areas. Our two algorithms rely on different methodologies for providing the UAV geo-position, with one (FVL-SAR) relying on two neural networks, one fully convolutional and one deep NN, whilst the other (TVL-SAR) utilises the attention-only structure provided by the vision transformer. Both methods are evaluated on a subset of UAV-VisLoc and the entirety of the 106.19km VPAIR trajectory, with FVL-SAR providing an average measurement positioning error of 32.39 meters, on 62.36% of the UAV images. TVL-SAR provides an error of 39.44 meters on 67.31% of images. To ensure continuous state estimations, an EKF is used, which introduces search area refinement. The EKF is affected by long stretches of feature-sparse environments, which results in FVL-SAR achieving a mean error of 61.00 meters, while TVL-SAR achieves 79.53 meters. To the author’s knowledge, these are the first published results on mean error during flight through UAV-VisLoc and VPAIR. (Code: https://github.com/xdKazer/P7 Drone Geolocalization).

Paper Nr: 318
Title:

SkipFusion: Multimodal Process-Aware Segmentation for Steel Surface Inspection

Authors:

Lars De Pauw and Toon Goedemé

Abstract: In the production of sheet steel, surface quality is a critical factor that directly impacts functional and aesthetic product quality. While modern vision-based inspection systems achieve high accuracy, they totally overlook valuable contextual information from the production process itself. This paper introduces SkipFusion, a mul-timodal segmentation framework that fuses image data with process parameters such as temperature, thickness, and production settings to enhance steel surface defect segmentation. We systematically investigate two lightweight fusion mechanisms, namely concatenation and multiplication, inserted at multiple encoder and decoder locations within the U-Net backbone. To address instability during heterogeneous fusion training, we propose a novel skip-initialized transfer learning strategy that preserves pretrained feature integrity while enabling efficient multimodal adaptation, which outperforms all baselines in terms of accuracy and training time. Experiments on a large-scale industrial dataset of 4,184 high-resolution images with 19 defect classes demonstrate that process-aware skip-initialized fusion improves defect segmentation by up to 4.6% mIoU compared to the image-only baseline, with limited additional training time. The proposed framework offers an effective, deployable solution for real-time, process-informed industrial surface inspection.

Paper Nr: 332
Title:

AI-Driven Enhancement of Depth Map Consistency

Authors:

Jakub Kit, Dominika Klóska, Dawid Mieloch and Adrian Dziembowski

Abstract: Virtual Reality (VR) and Free-Viewpoint Television (FTV) systems rely on accurate depth maps for high-quality virtual view synthesis. However, standard depth estimation methods often suffer from temporal inconsistencies, particularly around moving objects, leading to flickering and visual artifacts. This paper proposes a novel, three-step process to enhance the consistency of depth maps. The approach involves extracting a static background using temporal median filtering, followed by AI-driven segmentation and tracking of moving objects using Detectron2, and finally, fusing independently estimated depth layers. An experimental evaluation of Common Test Conditions (CTC) for MPEG Immersive Video (MIV) sequences demonstrates that the proposed method significantly improves temporal stability and visual quality. While objective quality gains (IV-PSNR) are modest, the method achieves substantial coding efficiency improvements, with BD-rate savings of up to 15.6% compared to the anchor. Notably, the superior quality of the generated depth maps led to their adoption by the ISO/IEC MPEG group as the new reference for the Fencing sequence.

Area 2 - Foundations & Representation Learning

Full Papers
Paper Nr: 31
Title:

GMAR++: Efficient Gradient Enhanced Attention Rollout

Authors:

Sohambhai Joita and Tatyana Ivanovska

Abstract: Interpreting Vision Transformers (ViTs) is challenging due to their reliance on multi-head self-attention and the absence of spatially localized representations. Existing attribution methods either trace attention flow or exploit gradients, but fail to jointly capture structural reasoning and spatial sensitivity. We propose GMAR++, a gradient-enhanced attention rollout method that integrates positive gradient filtering, attention–gradient interaction, and head-wise relevance weighting. GMAR++ unifies the structural interpretability of attention rollout with the spatial fidelity of gradient-based attribution, yielding sharper and more class-specific visual explanations. Experiments on Tiny-ImageNet with ViT-L/16 show that GMAR++ consistently outperforms Rollout, GMAR, and LeGrad, achieving lower Average Drop, higher Average Increase, higher Average Gain, and improved Insertion/Deletion Area Under the Curve. These results highlight the effectiveness and scalability of GMAR++ and its potential applicability to larger datasets and advanced transformer architectures. Our code is available at https://github.com/soham5498/GMAR-Plus-Efficient-Gradient-Enhanced-Attention-Rollout.git.

Paper Nr: 55
Title:

PoET: Lightweight Pose Encoder Transformer for Online Sign Language Recognition

Authors:

Julie Lascar, Jules Françoise, Michèle Gouiffès, Annelies Braffort and Diandra Fabre

Abstract: With the growing demand for accessible technological tools in the field of sign language (SL) processing, numerous recognition models have been developed in recent years. This paper presents a pose-based lightweight Transformer encoder designed for real-time isolated sign recognition. We provide a detailed description of the model’s architecture, together with a dedicated preprocessing pipeline. We present a comprehensive evaluation of our approach on widely used video datasets, including WLASL, AUTSL, ASL Citizen, which encompass diverse environments, multiple scales, and different SL. Our approach matches or even surpasses recent state-of-the-art methods, requiring fewer parameters, less memory, and lower computational cost. Additionally, we conduct an experiment via an online web platform, with deaf users. This experiment assesses the performance of our model in a real-world context where conditions differ between training and inference. This paper highlights the strong potential of lightweight pose-based architectures for practical SL recognition systems and provides concrete technical guidance for effective implementation.

Paper Nr: 72
Title:

NeRF-Seg3D: Integrating Semantic Segmentation, 3D Reconstruction, and Camera Geometry for Enhanced Rust Analysis on Oil & Gas Platforms

Authors:

Jose L. Huillca, Raphael dos S. Evangelista, Christian Erik Condori Mamani, Horácio Brescia Macêdo Henriques, Caio Marcus Monteiro de Oliveira, João Vitor Mendonça de Moraes, Robinson Luiz Souza Garcia, Michelle Soares Pereira Facina, Maikon Bressani, Eduardo C. Vasconcellos, Lucas Bertelli Martins, Marcos Amaral de do Almeida, Raul Queiroz Feitosa, Patrick Nigri Happ, Esteban Walter Gonzales Clua and Leandro A. F. Fernandes

Abstract: Rust poses a significant threat to offshore oil & gas platforms due to the corrosive nature of the maritime environment. Degraded areas are usually assessed by trained personnel through visual inspection, a costly and subjective process. We propose an innovative method using deep learning and 3D reconstruction to automatically estimate and compute rust-affected areas per industrial object type on oil & gas production platforms. Our approach uses 360ºpanoramic images for object type and rust segmentation and reconstructs the 3D geometry of the scene for posterior surface area estimation. Conducted experiments demonstrate the feasibility of the proposed method, showing highly correlated results with human assessments, thus reducing costs and minimizing subjectivity.

Paper Nr: 79
Title:

Anomaly Detection Using a Lite PatchCore for Mobile Robotic Industrial Application

Authors:

Lelio Lardon, H.Guis Vincente and Eric Busvelle

Abstract: This study presents an enhanced anomaly detection approach based on PatchCore, developed in the context of autonomous robotic inspection by Acwa Robotics. The proposed method is designed to run on embedded single board computer, drastically reducing memory usage while maintaining high detection performance. By integrating early stopping during coreset construction, multi-class unification and dimensionality reduction, our method compresses high-dimensional features and simplifies memory management. Experiments on the MVTec-AD datasets show an over 3% increase in AUROC instance score alongside a memory bank size reduction exceeding 98%. The approach also demonstrates robust performances under deteriorated conditions and on the VisA dataset.

Paper Nr: 100
Title:

A Conformal Loss Correction Approach for Label-Noise Learning

Authors:

Marcos Mota, Helio Pedrini and André Santanchè

Abstract: Label noise is a pervasive challenge in applying supervised learning in real-world datasets, particularly for deep neural networks that can easily overfit to corrupted labels. Loss correction methods, such as T-Revision, address this issue by estimating a transition matrix that models the noise process. However, in practice, these approaches typically rely on heuristics to identify anchor points, samples that are reliably clean, which can undermine their performance. In this work, we introduce the CP T-Revision, a novel method that integrates conformal prediction into loss correction to provide statistically grounded anchor points and improve transition matrix initialization. Our approach leverages prediction sets to identify reliable anchor point samples and apply the reweight loss accordingly, enhancing robustness against different types and levels of noise. We applied CP T-Revision on benchmark image classification datasets (MNIST, CIFAR-10, CIFAR-100) under synthetic symmetric and asymmetric noise, also noisy human-annotated datasets (CIFAR-N). Experimental results demonstrate that CP T-Revision consistently outperforms T-Revision and baseline cross-entropy methods across varying noise intensities, achieving higher classification accuracy and stability. These findings highlight the potential of conformal prediction as a primary tool to advance label noise learning.

Paper Nr: 111
Title:

Domain-Aware Diffusion for Synthetic-to-Real Data Augmentation

Authors:

Salahidine Lemaachi, Anaïs Druart and Nicolas Winckler

Abstract: Synthetic data is increasingly used to train deep models when large-scale annotated real datasets are unavailable, but performance often degrades due to the domain gap between synthetic and real images. We propose a diffusion-based framework for synthetic-to-real style transfer that produces realistic images while preserving semantic structure. Our method builds on latent diffusion models with ControlNet and introduces three key ideas. First, we design a dual-control representation that fuses segmentation maps with Canny edges, ensuring both semantic layout fidelity and fine-grained detail preservation while improving efficiency by avoiding multiple control passes. Second, we introduce domain-aware prompting, where lightweight tokens (synthetic” or real”) are added to prompts to control domain style in image translation. Third, we adopt an iterative refinement loop in which generated images with artifacts are progressively reintroduced into training, allowing the model to correct its own errors. Experiments on GTA-to-Cityscapes show that our approach reduces the domain gap, improves mean IoU, and trains significantly faster than GAN-based baselines. Our code and data are available at https://github.com/bds-ailab/syn2real.

Paper Nr: 129
Title:

Realistic Evaluation of Test-Time Adaptation: Unsupervised Model Selection

Authors:

Sebastian Cygert, Damian Sójka, Tomasz Trzciński and Bartłomiej Twardowski

Abstract: Test-Time Adaptation (TTA) has recently emerged as a promising strategy for improving model robustness under distribution shifts by adapting models during inference without access to labeled data. Due to the complexity of the task, hyperparameters play a critical role in the effectiveness of adaptation. However, the literature has largely overlooked the challenge of optimal hyperparameter selection in realistic, label-free scenarios. In this work, we address this problem by evaluating existing TTA methods using unsupervised hyperparameter selection strategies to obtain a more realistic evaluation of their performance. We show that some of the recent state-of-the-art methods exhibit inferior performance compared to the previous algorithms when using our more realistic evaluation setup. Further, we show that forgetting is still a problem in TTA as the only method that is robust to hyperparameter selection resets the model to the initial state at each step. We analyze various unsupervised selection strategies and observe that while many perform reasonably well in specific scenarios, consistent performance is only achieved when some form of supervision is introduced. Our findings highlight the need for more rigorous benchmarking practices in TTA research, including explicit reporting of model selection strategies. To support this, we publish our evaluation framework online.

Paper Nr: 150
Title:

Multimodal Embedding for Scientific Image Caption Generation

Authors:

Jose L. Huillca and Leandro A. F. Fernandes

Abstract: Image caption generation is a process that emerged from the combination of Computer Vision and Natural Language Processing techniques. The solution to this problem has typically been applied to enhance comprehension of natural images by producing short descriptions. However, the presence of scientific images differs from that of natural images in publication. Furthermore, scientific images are often associated with detailed descriptions in the manuscripts. This paper presents MMICap, a deep-learning-based solution that uses multi-modal inputs comprised of scientific images paired with detailed texts to automatically generate captions that effectively summarize the content depicted in these images. This paper also presents a new dataset for image captioning, ElsCap, constructed from open-access articles retrieved from ScienceDirect. ElsCap contains 1,088,728 scientific images, each with its respective caption and descriptive paragraph. Experiments with the ElsCap dataset demonstrate that MMICap leverages the integration of image and text inputs to enhance the quality of generated captions. In our experiment, we used the BLEU, METEOR, ROUGE, and CIDEr metrics to compare the results produced by BLIP and LSTM networks with those produced by these networks when integrated as the backbone of the MMICap paradigm.

Paper Nr: 154
Title:

Self‑Paced Percentile Distillation for Fast and Accurate LiDAR Person Detection

Authors:

Avesalon Razvan Marian and Miron Radu

Abstract: Autonomous robots increasingly rely on 2D LiDAR, but edge deployment demands compact, real‑time detectors. We study knowledge distillation for 2D LiDAR person detection and introduce a simple, effective scheme that emphasizes hard examples while remaining stable late in training. Using a DR‑SPAAM teacher and channel‑reduced students, we combine multi‑level feature supervision with an adaptive, percentile‑guided error method. This quantile‑calibrated weighting preserves a meaningful easy–hard split as feature errors shrink, preventing over‑focalization and maintaining the balance with task losses. On DROW v2, our main student performs on par with the teacher while increasing throughput from 78.9 FPS to 111.2 FPS on an RTX A2000. We also devise a smaller variant of the student offering favourable trade‑offs between efficiency and accuracy. We show that the adaptive quantile scheme outperforms the standard knowledge-distillation techniques, closing the gap between the student and the teacher. The resulting model is lightweight and deployment‑ready, retaining the teacher’s temporal strengths while advancing the accuracy–efficiency Pareto frontier for 2D LiDAR perception.

Paper Nr: 159
Title:

ProxiVideoFriends: Revisiting Proxemics through Temporal and Social Reasoning

Authors:

Isabel Jiménez-Velasco, Rafael Muñoz-Salinas, Vicky Kalogeiton and Manuel J. Marín-Jiménez

Abstract: Proxemics, the study of personal space in human interactions, plays a pivotal role in the development of socially aware AI systems, especially in the fields of robotics and human-computer interaction. While much progress has been made in areas like action recognition, the dynamics of proxemics in real-world, interactive settings remain largely unexplored. This paper introduces ProxiVideoFriends, the first dataset designed specifically for proxemics estimation in videos, derived from the sitcom Friends. This dataset contains annotated video sequences that capture various physical contacts and social relationships, providing a valuable resource for future research. We propose a temporal multitask model that jointly predicts proxemics and social relationships, establishing the first state-of-the-art in video-based proxemics classification. By leveraging temporal dependencies, our model outperforms static-image approaches and demonstrates the power of multitask learning to improve both proxemics and relationship recognition. We also explore the impact of audio features, revealing that while they enhance relationship classification, they do not contribute to proxemics estimation, which remains primarily a visual task. Our findings highlight the importance of temporal reasoning and multitask learning in advancing proxemics classification. The ProxiVideoFriends dataset and our proposed model provide new insights into the complex dynamics of human interactions, advancing the development of AI systems capable of understanding and responding to social contexts in a natural and intuitive way.

Paper Nr: 164
Title:

Contrastive Self-Supervised Learning with Distance-Guided Positive and Negative Pair Selection

Authors:

Bruno A. Shimura, Priscila T. M. Saito and Pedro H. Bugatti

Abstract: Contrastive self-supervised learning (SSL) has emerged as a powerful paradigm for visual representation learning. However, existing methods often rely on randomly sampled positive and negative pairs, which fail to fully exploit the underlying geometric structure of the latent space. To address this limitation, we propose Distance-Guided Contrastive Learning (DGCL), a novel approach that systematically selects informative sample pairs based on pairwise distances within a manifold projected using t-distributed Stochastic Neighbor Embedding (t-SNE). In DGCL, each image in the dataset serves as an anchor from which the farthest intra-class samples (hard positives) and the nearest inter-class samples (hard negatives) are dynamically identified and used in contrastive optimization. By iteratively updating the embedding space and recomputing these relationships over successive learning cycles, the proposed method progressively refines the latent representation, enhancing both intra-class compactness and inter-class separability. Experiments on the CIFAR-10 dataset demonstrate that DGCL consistently outperforms conventional convolutional neural networks. The approach improves classification accuracy from 34.03% to 70.69% with a CNN backbone (ResNet50) against the traditional ResNet50. Qualitative analyses of the evolving t-SNE projections reveal that DGCL produces embeddings with significantly higher discriminative coherence and smoother convergence dynamics. These results highlight that explicitly guiding the contrastive process through distance-based pair selection yields more robust, semantically structured representations, offering a simple yet effective enhancement to self-supervised learning pipelines in computer vision.

Paper Nr: 169
Title:

Diversity by Chance: Rethinking the Need for Determinantal Point Processes in Active Learning

Authors:

Brecht Deprez, Michiel Vandecappelle, Steven Lauwereins and Toon Goedemé

Abstract: Determinantal Point Processes (DPPs) provide an elegant way to sample diverse subsets and are widely used in active learning. Yet, despite their theoretical appeal, DPPs introduce significant computational complexities and rely on strong assumptions about feature embeddings. This work investigates when such structured diversity modeling is truly beneficial, and when simple random sampling suffices, through an analysis of downstream image classification. This study systematically evaluates random sampling, as well as two DPP-based sampling methods on both CIFAR-10 and GTSRB with regard to embeddings and dataset skew. Diversity of the selected subset is evaluated through quantifiable metrics, and its downstream effects are measured on image classification accuracy. Across all experiments, DPP-based methods achieve slightly higher diversity but yield insignificant improvements in downstream performance. Random sampling achieves a similar performance at only a fraction of the computational cost, suggesting that in many realistic scenarios, diversity by chance is preferable.

Paper Nr: 181
Title:

Unveiling Degradations with Deep Learning Models in Face Recognition Systems

Authors:

Leandro Dias Carneiro and Flavio de Barros Vidal

Abstract: This study investigates the real-world performance of facial recognition systems using deep learning models, particularly in challenging scenarios with suboptimal and degraded images, such as public surveillance and low-light conditions. Facial recognition systems encounter difficulties in uncontrolled environments with adverse image capture conditions, leading to consistently lower performance than in controlled settings, as observed in numerous studies. The challenges become particularly critical when these systems are employed in law enforcement for public safety policies. The primary objective is to identify degradation in facial images under simulated real-world scenarios, emphasizing uncontrollable environmental factors. A proposed methodology involves subjecting an original face image dataset to a degradation sequence. Training 20 state-of-the-art convolutional neural network (CNN) models are carried out to classify degradation type and intensity in face images, utilizing ten CNN architectures, including ResNet, DenseNet, VGG, Inception, Xception, and MobileNet. Training is conducted from scratch and with transfer learning using the Labeled Faces in the Wild (LFW) dataset, which undergoes various degradations. The experiment involves degrading the LFW dataset in 14 types of degradations at six intensity levels, resulting in training 20 CNN models (ten from scratch and ten with transfer learning). Models are evaluated based on the accuracy and stability of the learning curve. The results demonstrate impressive performance, with the most efficient models achieving 94% accuracy and the least performing model reaching 71% accuracy in the overall evaluation. These findings highlight the capability of various state-of-the-art models to accurately classify facial image degradation, regardless of the specific architecture used. The study contributes to public safety, law enforcement, and social justice by enabling the identification of degradations in forensic examination images, leading to a more precise interpretation of results from facial recognition systems.

Paper Nr: 185
Title:

Exploiting Test-Time Augmentation in Federated Learning for Brain Tumor MRI Classification

Authors:

Thamara Leandra de Deus Melo, Rodrigo Moreira, Larissa Ferreira Rodrigues Moreira and André R. Backes

Abstract: Efficient brain tumor diagnosis is crucial for early treatment; however, it is challenging because of lesion variability and image complexity. We evaluated convolutional neural networks (CNNs) in a federated learning (FL) setting, comparing models trained on original versus preprocessed MRI images (resizing, grayscale conversion, normalization, filtering, and histogram equalization). Preprocessing alone yielded negligible gains; combined with test-time augmentation (TTA), it delivered consistent, statistically significant improvements in federated MRI classification (p<0.001). In practice, TTA should be the default inference strategy in FL-based medical imaging; when the computational budget permits, pairing TTA with light preprocessing provides additional reliable gains.

Paper Nr: 195
Title:

From Graphs to Images: Non-Parametric PPI Context Integration for Vision-Based Protein Function Prediction

Authors:

Yeonatan Mauhnoom, Gabriel Bianchin de Oliveira, Helio Pedrini and Zanoni Dias

Abstract: Protein function annotation is critical for biomedical discovery, yet most proteins remain unannotated. Recent protein language models provide universal sequence embeddings; however, leveraging protein–protein interaction (PPI) networks alongside these embeddings remains challenging. PPI networks are inherently sparse and noisy, especially for understudied organisms, while existing approaches typically rely on parametric graph neural networks, introducing computational complexity and interpretability challenges. We propose a fixed, non-parametric approach that contextualizes sequence embeddings via multi-hop message passing over PPI graphs, then encodes enriched representations as RGB images for vision models. This strategy avoids the computational overhead of complex GNN architectures and yields interpretable predictions grounded in explicit image features. We evaluate our method against three baselines using CAFA5 data across three Gene Ontology domains. Our approach achieves wFmax of 53.21% on BP, 62.98% on CC, and 68.97% on MF, exceeding DeepNF by more than 25 percentage points. The source code is publicly available at https://github.com/ymauhn/PPI-context-integration-for-Protein-Function-Prediction.

Paper Nr: 205
Title:

EMTA: End-to-End Multi-Task Audio-Driven Talking-Head Animation

Authors:

Laxmi Narayen Nagarajan Venkatesan, Divyam Choudhary, Sai Madhavan G, Subhajeet Lahiri, Rahulraj B. R, Rittik Panda, Dinesh Babu Jayagopi, Raj Tumuluri and Magnus Revang

Abstract: Audio-Driven facial animation has advanced rapidly, yet most methods remain task-specific (e.g., reenactment, dubbing) and domain-locked (2D or 3D). This fragmentation limits flexibility across applications like telep-resence and AR/VR. We introduce EMTA, an end-to-end, identity-preserving framework that unifies multiple audio-driven tasks within a unified architecture. EMTA decouples audio-to-geometry from geometry-to-appearance generation: (1) Audio2Mesh predicts expressive, temporally coherent 3D facial landmarks from audio and employs a novel rotation decoder module, directly usable for 3D avatars; and (2) LaPix-GAN synthesizes photorealistic 2D frames through an identity-aware generator, a novel Audio-Gated Spatial Attention (AGSA) module, and enhanced Facial-Inpainted Landmark (FIL) sketches. EMTA achieves robust and competitive results across 3D mesh prediction, 2D reenactment, talking portraits, and dubbing tasks against baselines on multiple datasets, validating it as a flexible and practical solution (25 FPS).

Paper Nr: 213
Title:

Automatic Classification of Whole-Body PET Scans Using Deep Features

Authors:

Celso Luiz Silva Soares Filho, Darlan Bruno Pontes Quintanilha, Alison Corrêa Mendes, Anselmo Cardoso de Paiva, Alexandre Pessoa, Ramsey D. Badawi, Vivek Swarnakar and Cláudio de Souza Baptista

Abstract: Cancer remains a critical global health challenge, with the World Health Organization (WHO) estimating 20 million new cases and 9.7 million deaths in 2022. Positron Emission Tomography (PET) with 18F-FDG is a key imaging modality for diagnosing and monitoring various types, such as Melanoma, Lymphoma, and Lung Cancer. However, despite the effectiveness of 3D PET scans, their high computational cost and the significant shortage of expert radiologists create a bottleneck for large-scale screening. To address these challenges, this work leverages Maximum Intensity Projection (MIP) and propose a method using deep features extracted from coronal and sagittal MIP images of 18F-FDG PET scans, followed by an XGBoost classifier for feature selection and finally binary classification (cancerous vs. healthy). The approach achieved results of 81.19% accuracy, 80.81% F1-Score, 81.63% recall, and 80.00% precision. These results demonstrate the viability of the proposed approach, successfully balancing the computational efficiency of reduced-dimensionality representations with the discriminative capability of deep features for PET scan analysis.

Paper Nr: 216
Title:

Comparing Euclidean and Hyperbolic K-Means for Generalized Category Discovery

Authors:

Mohamad Dalal, Thomas B. Moeslund and Joakim Bruslund Haurum

Abstract: Hyperbolic representation learning has been widely used to extract implicit hierarchies within data, and recently it has found its way to the open-world classification task of Generalized Category Discovery (GCD). However, prior hyperbolic GCD methods only use hyperbolic geometry for representation learning and transform back to Euclidean geometry when clustering. We hypothesize this is suboptimal. Therefore, we present Hyperbolic Clustered GCD (HC-GCD), which learns embeddings in the Lorentz Hyperboloid model of hyperbolic geometry, and clusters these embeddings directly in hyperbolic space using a hyperbolic K-Means algorithm. We test our model on the Semantic Shift Benchmark datasets, and demonstrate that HC-GCD is on par with the previous state-of-the-art hyperbolic GCD method. Furthermore, we show that using hyperbolic K-Means leads to better accuracy than Euclidean K-Means. We carry out ablation studies showing that clipping the norm of the Euclidean embeddings leads to decreased accuracy in clustering unseen classes, and increased accuracy for seen classes, while the overall accuracy is dataset dependent. We also show that using hyperbolic K-Means leads to more consistent clusters when varying the label granularity.

Paper Nr: 228
Title:

How to Sample High Quality 3D Fractals for Action Recognition Pre-Training?

Authors:

Marko Putak, Thomas B. Moeslund and Joakim Bruslund Haurum

Abstract: Synthetic datasets are being recognized in the deep learning realm as a valuable alternative to exhaustively labeled real data. One such synthetic data generation method is Formula Driven Supervised Learning (FDSL), which can provide an infinite number of perfectly labeled data through a formula driven approach, such as fractals or contours. FDSL does not have common drawbacks like manual labor, privacy and other ethical concerns. In this work we generate 3D fractals using 3D Iterated Function Systems (IFS) for pre-training an action recognition model. The fractals are temporally transformed to form a video that is used as a pre-training dataset for downstream task of action recognition. We find that standard methods of generating fractals are slow and produce degenerate 3D fractals. Therefore, we systematically explore alternative ways of generating fractals and finds that overly-restrictive approaches, while generating aesthetically pleasing fractals, are detrimental for downstream task performance. We propose a novel method, Targeted Smart Filtering, to address both the generation speed and fractal diversity issue. The method reports roughly 100 times faster sampling speed and achieves superior downstream performance against other 3D fractal filtering methods.

Paper Nr: 232
Title:

Disentangled Representation Learning with Variational Autoencoders for Facial Image Analysis

Authors:

Lucas Massa, Tiago Vieira, Allan Martins, William Schwartz, Rodrigo Paes and Willy Tiengo

Abstract: This work investigates disentangled representation learning for facial image analysis using a Variational Autoencoder framework. Building upon the Disentangled VAE (DisVAE), the proposed model integrates correlation and inverse classification losses, dual supervised heads for identity and expression, and a phased training strategy to enhance subspace independence. Experiments on CK+48 and Oulu-CASIA datasets show that the phased configuration achieves the best disentanglement performance, reducing semantic leakage while preserving reconstruction quality and controllable expression interpolation. The results demonstrate that the combination of correlation-based regularization and progressive optimization effectively improves the interpretability and stability of latent representations in VAE-based models.

Paper Nr: 244
Title:

ShareDAB-DETR: Memory-Efficient Object Detection through CentralBank Parameter Sharing

Authors:

Khaled Chikh, Roberto Cavicchioli and Alessandro Capotondi

Abstract: Transformer-based object detectors achieve state-of-the-art accuracy but require substantial memory, making deployment on edge devices impractical. We introduce CentralBank, a novel parameter sharing mechanism that consolidates all transformer projection weights into a single memory-efficient table accessed through deterministic offset mapping. Applied to DAB-DETR, our approach achieves up to 84% reduction in transformer parameters (19.67M to 3.21M) and 67% reduction in training memory while retaining 99% of baseline accuracy (41.8 vs 42.2 AP on COCO). Unlike traditional compression methods, CentralBank is integrated at the architectural level, requiring no post-training optimization. Our best configuration achieves 41.8 AP with only 8.65M transformer parameters and 2.5 GB training memory, enabling practical deployment on edge devices. Extensive evaluation across five configurations reveals that larger hidden dimensions (512 vs 256) provide superior parameter efficiency, achieving an 11.2:1 parameter-to-FLOP reduction ratio. ShareDAB-DETR demonstrates that architectural parameter sharing can effectively bridge the gap between transformer performance and edge deployment constraints.

Paper Nr: 249
Title:

Early-Retry Regeneration Framework for Improving Object-Sufficiency in Text-to-Image Generation

Authors:

Shogo Ishii, Tomoaki Yamazaki, Kengo Murata, Seiya Ito and Kouzou Ohara

Abstract: Text-to-Image generation is a task of generating an image corresponding to a given textual description. Although existing Text-to-Image models can generate images with high visual fidelity, they do not always render all objects specified in the input text due to both text–image inconsistencies and the inherently probabilistic nature of the models. Consequently, users often need to repeatedly regenerate images until obtaining one that satisfies the text, resulting in additional computational cost and user inconvenience. To address this issue, we propose an early-retry regeneration framework that performs intermediate retries based on object-sufficiency judgements, reducing regeneration cost while improving object coverage. Object-sufficiency judgements are implemented in two ways: using an external model applied to rendered images, or leveraging internal representations of the generative model without requiring rendering. Experiments demonstrate that the proposed framework, using either or both judgement methods for retries, generates images with high object coverage while reducing regeneration cost.

Paper Nr: 255
Title:

Deep Ordinal Partial-Label Learning for Cost-Effective Small Diamond Quality Grading

Authors:

Arno Waes, Bram Claes and Toon Goedemé

Abstract: Accurate quality grading of diamonds - Color, Cut, Clarity, and Carat - is critical for valuation, but individual assessment of small diamonds is prohibitively expensive, leaving only coarse quality ranges available. In this work, we address diamond quality estimation under ambiguous weak supervision, introducing, to the best of our knowledge, the first formulation of deep partial label learning for ordinal data. We propose a training strategy and evaluation protocol tailored to ordinal partial labels, and show that all 4Cs can be predicted without access to an individually certified ground truth. Our proposed generalization of the quadratic weighted kappa metric enables principled evaluation under ordinal distributional predictions, while also providing a differentiable alternative promising for future research. Experiments on a large-scale image dataset demonstrate that the Jensen–Shannon Divergence effectively enforces ordinal structure in the label space. Incorporating additional ambiguous data, using a sampling strategy based on label entropy, shows beneficial for training performance and stability. Clarity remains the most challenging quality to estimate due to the minuscule scale of imperfections. Overall, our approach achieves a pragmatic and cost-effective small diamond grading solution and establishes a first framework for deep ordinal partial label learning, applicable to other domains with ambiguous data.

Paper Nr: 262
Title:

Ego-Exo Temporal Action Segmentation in Industrial Environments

Authors:

Alessandro Passanisi, Giovanni Maria Farinella and Francesco Ragusa

Abstract: Recognizing actions is fundamental for facilitating operator training by providing guidance on machine usage, improving safety by reducing the risk of incidents, and monitoring each step of a procedure to prevent mistakes. In this work, we address the task of Temporal Action Segmentation (TAS) in the challenging industrial domain. We extend the ENIGMA-51 dataset by adding 51 exocentric videos temporally synchronized with the existing 51 egocentric videos, enabling the study of user actions from complementary viewpoints. These videos are manually annotated with temporal boundaries and corresponding action categories. We define a taxonomy of 68 actions and annotate a total of 5,721 action instances, corresponding to 22 hours of video. Building on this dataset, we propose a benchmark for TAS in industrial environments, considering both samedomain settings (i.e., EGO→EGO, EXO→EXO) and cross-domain settings (i.e., EGO→EXO, EXO→EGO). Our experiments reveal the limitations of current state-of-the-art models when applied to realistic industrial scenarios, highlighting the need for further research on cross-domain generalization.

Short Papers
Paper Nr: 21
Title:

Breast Cancer Detection Using Multi-View Infrared Images and CNN Models

Authors:

Gustavo Pioli Resende, Renata dos Santos Melo, Henrique Coelho Fernandes and André Ricardo Backes

Abstract: Breast cancer is a major health issue worldwide. While mammography is the standard screening method, it lacks sensitivity in younger women. In this context, thermography emerges as a complementary, non-invasive, and radiation-free imaging modality capable of capturing thermal asymmetries associated with abnormal vascularization. This work investigates the application of convolutional neural networks (CNNs) to thermographic images for breast cancer detection. We compared DenseNet121, ResNet50, and EfficientNetB0 architectures using the DMR-IR dataset. A key contribution of this work is the integration of three distinct viewing angles (front, left oblique, right oblique) of the breast region, enhancing the visibility of anatomical and thermal patterns from multiple perspectives and providing richer information to the models. Our findings indicate that multi-view image inputs significantly improve model performance, with DenseNet121 achieving an F1-score of 94.74% and 100% sensitivity, highlighting its potential for supporting early detection strategies and reinforcing the viability of thermography combined with deep learning as an adjunct tool in breast cancer screening.

Paper Nr: 37
Title:

Identification of Water Tanks and Swimming Pools as Potential Dengue Breeding Sites Using Satellite Imagery

Authors:

Rafael de Camillo Masson and André Ricardo Backes

Abstract: This work develops and evaluates neural network-based methods to detect and segment potential breeding sites of the Aedes aegypti mosquito in satellite images. We explored deep learning architectures like U-Net, PSPNet, and LinkNet for segmenting water tanks and swimming pools, comparing them with Faster R-CNN. Using annotated public datasets from Belo Horizonte, we assessed models with IoU (Intersection over Union) and AP (Average Precision) metrics. Many tested models outperformed existing approaches, highlighting greater effectiveness in identifying dengue hotspots. The developed methodology is expected to support scalable monitoring for disease control.

Paper Nr: 40
Title:

Noise-Aware Named Entity Recognition for Historical VET Documents

Authors:

Alexander M. Esser and Jens Dörpinghaus

Abstract: This paper addresses Named Entity Recognition (NER) in the domain of Vocational Education and Training (VET), focusing on historical, digitized documents that suffer from OCR-induced noise. We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Three complementary strategies, training on noisy, clean, and artificial data, are systematically compared. Our method is one of the first to recognize multiple entity types in VET documents. It is applied to German documents but transferable to arbitrary languages. Experimental results demonstrate that domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. We provide publicly available code for reproducible noise-aware NER in domain-specific contexts.

Paper Nr: 52
Title:

ScrewCount: A Dataset and Benchmark for Exemplar Efficiency and Text-Guided Few-Shot Object Counting

Authors:

Farnaz Delirie, Afshin Dini, Amirmasoud Molaei and Leila Sadeghi

Abstract: General object detection methods struggle to detect large numbers of small, overlapping objects, such as screws and nuts, in industrial inspection. Moreover, creating dense annotations in these applications is difficult and costly, motivating the need for few-shot object counting approaches that can generalize with minimal supervision. While methods like Learning to Count Everything and CountGD have achieved progress, the interaction between exemplar efficiency, exemplar robustness, and text guidance remains unknown. In this paper, we present ScrewCount, a new dataset for dense small-object counting in manufacturing contexts. Using ScrewCount, we conduct a systematic study of exemplar selection, analyzing how the number and quality of exemplars affect few-shot counting performance. Our experiments show diminishing returns beyond a small number of exemplars and sensitivity to annotation noise. We further evaluate a text-guided counting method, examining the influence of prompt phrasing on the results. Findings reveal that while text offers flexibility, performance is highly dependent on the prompt design, significantly affecting the method’s performance in some cases. ScrewCount establishes a benchmark for dense small-object counting and provides new insights into exemplar efficiency, robustness, and text guidance under limited annotation.

Paper Nr: 56
Title:

Graph-Based Point Cloud Surface Reconstruction Using B-Splines

Authors:

Stuti Pathak, Rhys Evans, Gunther Steenackers and Rudi Penne

Abstract: Generating continuous surfaces from discrete point cloud data is a fundamental task in several 3D vision applications. Real-world point clouds are inherently noisy due to various technical and environmental factors. Existing data-driven surface reconstruction algorithms rely heavily on ground truth normals or compute approximate normals as an intermediate step. This dependency makes them extremely unreliable for noisy point cloud datasets, even if the availability of ground truth training data is ensured, which is not always the case. B-spline reconstruction techniques provide compact surface representations of point clouds and are especially known for their smoothening properties. However, the complexity of the surfaces approximated using B-splines is directly influenced by the number and location of the spline control points. Existing spline-based modeling methods predict the locations of a fixed number of control points for a given point cloud, which makes it very difficult to match the complexity of its underlying surface. In this work, we develop a Dictionary-Guided Graph Convolutional Network-based surface reconstruction strategy where we simultaneously predict both the location and the number of control points for noisy point cloud data to generate smooth surfaces without the use of any point normals. We compare our reconstruction method with several well-known as well as recent baselines by employing widely-used evaluation metrics, and demonstrate that our method outperforms all of them both qualitatively and quantitatively. The code is available at https://github.com/stutipathak5/SPNet.

Paper Nr: 64
Title:

Beyond the Black Box: A Heuristic-Based Framework for Employing Real-World Reasoning towards Explainable and Reliable Results

Authors:

Murilo Regio, Thomas Trappenberg and Isabel Manssour

Abstract: Deep learning models excel in many tasks but often operate as black boxes, sometimes leading to results that are logically inconsistent or physically implausible. This behavior may erode the end user’s trust in the system, thereby harming its long-term usability. We propose a reasoning framework to improve the reliability and explainability of deep neural networks in computer vision applications. We achieve this by post-processing the output of a pre-trained model and refining the results through a combination of contextual features and real-world constraints. Our framework is validated through two case studies: firearm threat detection, in which we improved precision and F1-score by filtering false positives using pose and temporal consistency, and traffic accident detection, in which we more accurately identified the moment the accident started.

Paper Nr: 71
Title:

Frequency-Domain Detection of AI-Generated Images via Radial Spectrum Templates

Authors:

Yuta Sato, Takayoshi Yamashita, Hironobu Fujiyoshi and Tsubasa Hirakawa

Abstract: The rapid progress of generative models such as Generative Adversarial Networks (GANs) and diffusion models has made detecting AI-generated images increasingly challenging. Motivated by the observation that background regions often contain distinctive artifacts exploited by detectors, we investigate frequency-domain characteristics of real versus generated images. Building on prior findings, we propose a simple yet effective method that constructs a spectral template from real images and evaluates test images by measuring deviations in the radial spectrum. Unlike conventional supervised detectors trained on generated data, our approach relies only on real images and applies a threshold-based decision rule, enabling robust detection across multiple generators without retraining. Experiments on GenImage show that the proposed method achieves competitive accuracy against deep learning baselines while remaining interpretable and avoiding overfitting to specific generators. These results highlight frequency-domain statistical artifacts as a lightweight, complementary cue for fake-image detection.

Paper Nr: 75
Title:

QAC: Quality-Aware Clustering for Data-Centric Seismic Segmentation

Authors:

Rahma Aloui, Pranav Martini, Pandu Devarakota, Apurva Gala and Shishir K. Shah

Abstract: Although seismic data are traditionally analyzed as one-dimensional signals, their two-dimensional image representation contains rich structural and textural information that can be exploited by modern vision-based methods. Deep learning has shown promising results in seismic image segmentation; however, performance is often degraded by low-quality, noisy, or structurally complex slices that remain difficult for models to in-terpret. Addressing this challenge calls for data-centric strategies that improve training reliability beyond architectural refinements. In this work, we introduce Quality-Aware Clustering (QAC), a data-centric framework that integrates hierarchical clustering with no-reference image quality assessment (IQA) to automatically identify unreliable seismic slices. Candidate dendrogram cuts are evaluated using a composite criterion balancing structural coherence (Silhouette score), inter-cluster compactness (Davies–Bouldin index), and perceptual separability based on IQA variance. This IQA-guided strategy enables automatic selection of the optimal number of clusters and systematic removal of low-quality subsets. Experimental results demonstrate that the proposed framework improves segmentation accuracy and consistently outperforms common filtering heuristics. By coupling structural grouping with perceptual quality cues, QAC provides a principled and automated approach to data quality assessment, enabling more robust and reliable seismic segmentation workflows.

Paper Nr: 77
Title:

ORDINALBENCH: A Benchmark Dataset for Diagnosing Generalization Limits in Ordinal Number Understanding of Vision-Language Models

Authors:

Yusuke Tozaki and Hisashi Miyamori

Abstract: Vision-Language Models (VLMs) have advanced across multimodal benchmarks but still show clear gaps in ordinal number understanding, i.e., the ability to track relative positions and generalize to large indices. We present ORDINALBENCH, a diagnostic benchmark that standardizes ordinal number understanding as an evaluation task for VLMs. The core task is N-th object identification, defined by a starting reference and traversal rule. Task difficulty is controlled along three axes: (i) ordinal magnitude (from small numbers to extreme cases up to 300), (ii) arrangement complexity (from single loops to maze-like paths), and (iii) object count. The benchmark provides 39,000 question–answer pairs, each annotated with a ground-truth reasoning trajectory, balanced across difficulty, supporting controlled yet large-scale testing. Beyond answer-only evaluation, our framework requires models to generate structured, stepwise traces of the counting process and supplies an open evaluation toolkit measuring both final accuracy and step-level path consistency. Zero-shot evaluations of GPT-5, Gemini 2.5 Flash Lite, Qwen2.5-VL, InternVL3.5, and Molmo reveal sharp degradation under large-ordinal and complex-path conditions, highlighting weak generalization despite strong scores on standard multimodal tasks. By framing ordinal number understanding as a core target, ORDINALBENCH offers a reproducible testbed and actionable diagnostics to drive development of VLMs with stronger sequential reasoning. All data and code are available at https://ordinalbench.github.io.

Paper Nr: 90
Title:

A Data Centric Framework for Seismic Data Selection Using Latent Space Distribution Analysis

Authors:

Samiha Mirza, Apurva Gala, Pandu Devarakota and Shishir Shah

Abstract: Seismic interpretation, which includes tasks such as salt segmentation, fault identification, etc., is a crucial step in seismic imaging workflows. Salt segmentation is essential for delineating hydrocarbon traps and building velocity models. Despite the advancements in automating salt segmentation through deep learning models, challenges persist due to large variations in seismic data quality, including low signal-to-noise ratios, different acquisition & processing techniques, complex geologies and distinct geo-locations. These factors affect model generalization capability and in this work, we introduce a data-centric framework for seismic data selection based on latent space distribution analysis. By visualizing and analyzing latent spaces of seismic datasets from diverse geological regions, we establish correlations between dataset characteristics and model performance, enabling targeted data selection to improve robustness and generalization. Furthermore, we validate the universality of our framework by extending it to lung segmentation in chest X-rays, demonstrating its applicability across non-seismic domains.

Paper Nr: 96
Title:

Identification of Collided Vehicles in Indian Traffic Accidents Using Hierarchical Deep Learning Framework

Authors:

Shrusti Porwal and Preety Singh

Abstract: In Intelligent Transportation Systems (ITS), detecting accidents accurately and identifying the involved vehicles is crucial for assessing accident severity. This paper focuses on the challenge of identification of vehicles in accidents on Indian roads. We introduce an hierarchical detection model where the image is analyzed for accident detection, and if detected, the image is further processed to identify the involved vehicles in collision. We implemented various deep learning object detectors for accident and vehicle detection, finding that YOLOv8m performed best with an F1-score of 0.824 for accident detection and 0.827 for vehicle detection. The outputs of these detection units are integrated to identify collided vehicles, achieving an F1-score of 0.716 for this task.

Paper Nr: 119
Title:

GloFM: A GLORYS Flow-Matching Emulator for Spatio-Temporal Ocean Data Assimilation

Authors:

Pierre Garcia, Théo Archambault, Dominique Béréziat and Anastase Charantonis

Abstract: Providing regular and physically consistent predictions of the ocean state is critical for numerous scientific, operational, and societal needs. Observations of the ocean surface are gathered through various remote sensing and in situ instruments, and are typically assimilated into numerical models to reconstruct the ocean state. However, this often involves millions of data points, making it computationally intensive, which suggests deep learning may be a cheaper alternative. Deterministic data-driven approaches typically learn about ocean dynamics from numerical simulations or sparse observational data. However, such methods often lack physical realism in uncertain settings. Due to mode averaging, they produce non-physical or overly simplified states. Generative models offer a promising approach to generating physically realistic ocean states. We present GloFM: a Glorys Flow-Matching emulator for spatio-temporal ocean data assimilation. Our generative model produces coherent estimates of ocean surface fields. GloFM uses flow matching to assimilate observational data for nowcasting of surface currents, sea surface height (SSH), and sea surface temperature (SST). Compared to deterministic regression-based approaches, GloFM demonstrates improved realism metrics, capturing finer-scale variability and more physically plausible ocean states.

Paper Nr: 123
Title:

Optimal Camera Placement for Dynamic Scenes via Reinforcement Learning in Virtual Environments

Authors:

Grigorios Tsipouridis, Nikolaos Dimitriou, Efstathia Martinopoulou, Iason Karakostas and Dimitrios Tzovaras

Abstract: A practical problem for applications analysing human activity, is the placement of cameras in order to maximize computer vision performance for tasks such as action recognition, pose estimation and instance segmentation. Common practices follow a trial and error approach where different setups are tested on site, until a satisfactory configuration is reached based on empirically measuring area coverage and visibility of scene’s occupants. In this work we propose a Reinforcement Learning (RL) framework to automate and optimize this process taking into account camera specifications, scene geometry and human motion. It comprises a virtual 3D representation of a monitored space as the learning environment, and an RL agent that decides on each camera’s position and orientation trying to maximize the visibility of moving human avatars in the scene. We propose and evaluate alternative visibility metrics using ray casting from each camera, and defining appropriate reward functions for RL, while comparing the performance of Twin Delayed Deep Deterministic Policy Gradient (TD3) and Soft Actor Critic (SAC) algorithms. Through extensive experiments we evaluate the performance of the proposed framework in different scenes, exploring the effect of different algorithmic choices while achieving intuitive results even in complex scenes.

Paper Nr: 139
Title:

A Computational Approach for Generative Modeling and Supervised Classification: Comparative Evaluation of GAN and Diffusion Models for Histological Data Augmentation

Authors:

Bianca Lanconi de Oliveira Garcia, Guilherme Botazzo Rozendo, Thiago Leal Pozati, Thaína Aparecida Azevedo Tosta, Marcelo Zanchetta do Nascimento and Leandro Alves Neves

Abstract: Histological datasets are often limited in size due to the high cost and expertise required for manual annotation, which restricts the performance and generalization of supervised models. Artificial data generation and augmentation have therefore become essential strategies to overcome data scarcity in computational pathology. This study presents a computational approach for the generation, augmentation, and evaluation of histological images, combining state-of-the-art generative models with explainability mechanisms. We considered different generation paradigms: GANs, XGANs, and diffusion models, such as DDIM and NCSN++. In the XGAN formulation, explanations derived from explainable artificial intelligence (XAI) techniques were incorporated into the generator’s loss function to enhance the structural consistency of synthetic images. Synthetic images were applied to augment small H&E datasets of colorectal cancer, breast cancer, and liver tissue, yielding improvements in supervised classifiers. Notably, ViT accuracy increased from 81.29% to 88.40% on the colorectal dataset, with StyleGAN3 augmentation. Diffusion models achieved the lowest FID scores of 33.36 (colorectal), 34.22 (liver), and 40.86 (breast), while XGANs and StyleGAN3 demonstrated competitive KID and IS metrics, indicating both high-quality synthesis and diversity. These results highlight the effectiveness of integrating generative modeling with explainability to enhance dataset augmentation, morphological fidelity, and classifier performance in histopathology, providing a promising pathway for interpretable computational pathology workflows.

Paper Nr: 143
Title:

DevPrompt: Deviation-Based Prompt Learning for One-Normal Shot Image Anomaly Detection

Authors:

Morteza Poudineh and Marc Lalonde

Abstract: Few-normal shot anomaly detection (FNSAD) aims to detect abnormal regions in images using only a few normal training samples, making the task highly challenging due to limited supervision and the diversity of potential defects. Recent approaches leverage vision-language models such as CLIP with prompt-based learning to align image and text features. However, existing methods often exhibit weak discriminability between normal and abnormal prompts and lack principled scoring mechanisms for patch-level anomalies. We propose a deviation-guided prompt learning framework that integrates the semantic power of vision-language models with the statistical reliability of deviation-based scoring. Specifically, we replace fixed prompt prefixes with learnable context vectors shared across normal and abnormal prompts, while anomaly-specific suffix tokens enable class-aware alignment. To enhance separability, we introduce a deviation loss with Top-K Multiple Instance Learning (MIL), modeling patch-level features as Gaussian deviations from the normal distribution. This allows the network to assign higher anomaly scores to patches with statistically significant deviations, improving localization and interpretability. Experiments on the MVTecAD and VISA benchmarks demonstrate superior pixel-level detection performance compared to PromptAD and other baselines. Ablation studies further validate the effectiveness of learnable prompts, deviation-based scoring, and the Top-K MIL strategy.

Paper Nr: 160
Title:

Hybrid Ensemble of Convolutional and Transformer Models for Retinal Image Quality Evaluation

Authors:

Luís Felipe Rocha Pereira, João Marcello Mendes Moreira, João Dallyson Sousa de Almeida, Anselmo Cardoso de Paiva, Aristófanes Corrêa Silva and Elaine de Paula Fiod Costa

Abstract: The quality of retinal fundus images is a crucial factor for reliable ophthalmic diagnosis, yet it is often compromised by artifacts such as blur, uneven illumination, and poor contrast. This work proposes a deep learning-based method for automated retinal image quality assessment (RIQA) using an unweighted model averaging ensemble that combines complementary architectures: Swin Transformer V2, MobileNetV3, and ConvNeXt. The proposed ensemble leverages the strengths of convolutional and transformer-based models to enhance robustness and generalization in multiclass classification of retinal image quality (Good, Usable, Rejected). Experiments were conducted on the EyeQ dataset, using 5-fold cross-validation and weighted loss to mitigate class imbalance. The ensemble achieved an F1-Score of 0.8887 ± 0.0025 and a Kappa score of 0.8395 ± 0.0038, having competitive results with the literature. Grad-CAM visualizations demonstrate that the ensemble effectively focuses on clinically relevant retinal regions (optic disc, macula, and vessel structures), reinforcing its interpretability and clinical relevance. The results confirm that simple ensemble strategies can deliver competitive performance while maintaining computational efficiency, offering a robust framework for automated quality control in retinal image analysis.

Paper Nr: 161
Title:

Frame Sequence Classification in Maritime UAV Surveillance: A Comparative Study of Temporal Deep Learning Architectures

Authors:

Joniel Bastos Barreto, Filipe F. Caetano, Juliana M. F. da Silva and Carlos H. Q. Forster

Abstract: Unmanned Aerial Vehicles (UAVs) have become indispensable tools in maritime Search and Rescue (SAR) operations, enabling wide-area monitoring and rapid response. Deep learning-based vision systems, especially Convolutional Neural Networks (CNNs) and hybrid architectures, dominate current maritime image analysis pipelines for detection and recognition. However, most existing approaches treat each video frame independently, fundamentally ignoring the temporal dependencies that capture object persistence, motion, and trajectory, crucial cues in dynamic maritime environments characterized by illumination variations, sea reflections, and motion blur. This work investigates the impact of temporal modeling on maritime UAV video classification through a comparative study between frame-based and sequence-based deep learning architectures. Three representative models are evaluated: a 3D CNN (R3D-18), a CNN-LSTM hybrid, and a CNN-Transformer model, all initialized with pre-trained weights and optimized via a gradual unfreezing fine-tuning strategy. The experiments employ short temporal clips (e.g., N=10 frames) extracted from real maritime UAV footage to evaluate classification robustness under challenging sea conditions. Results demonstrate that incorporating temporal information consistently improves classification accuracy and stability compared to single-frame baselines, emphasizing the importance of sequential modeling for reliable maritime video understanding. This study contributes to the development of robust and efficient temporal-aware models, advancing the application of deep learning to UAV-based maritime surveillance and SAR scenarios.

Paper Nr: 186
Title:

The Threshold Paradox: Why Calibrating on the Test Set Introduces Bias in Anomaly Detection

Authors:

Aurélie Cools, Sédrick Stassin and Sidi Ahmed Mahmoudi

Abstract: Anomaly detection plays a central role in quality control systems within Industry 4.0. While recent deep learning approaches report promising AUROC scores, current evaluation protocols calibrate detection thresholds using test data annotations-a practice that violates unsupervised learning principles. We demonstrate through concrete examples and detailed analysis that standard benchmarks often include annotation errors, and test-calibrated methods systematically adapt to these errors, masking real defects. This creates a performance overestimation compared to deployment conditions. In this paper, we propose a hybrid approach combining an extension of Dinomaly, a self-supervised anomaly detection method, with reconstruction capabilities and a statistical Kernel Density Estimation-based threshold calibration using only normal training data. Our method achieves a recall of 98.1% on MVTec AD without test supervision, matching the state of the art while revealing annotation inconsistencies. This ensures truly unsupervised evaluation aligned with industrial constraints.

Paper Nr: 188
Title:

Separation-Aware Downsampling for Vehicle Detection

Authors:

Yupei Guo, Yota Yamamoto and Yukinobu Taniguchi

Abstract: Vehicle detection faces significant challenges due to severe occlusion and scale variation, especially in images captured by Unmanned Aerial Vehicle (UAV) and Closed-Circuit Television (CCTV). To address these issues, we propose a novel Separation Strategy that rethinks downsampling operations in object detectors. ur approach introduces two specialized downsampling modules: Phase-Split Downsampling (PSD) for spatial separation during feature extraction, preserving fine-grained details essential for small vehicle detection; and Wavelet Transform Downsampling (WTD) for frequency separation during feature fusion, allowing distinct processing of global context and local details. By integrating these modules into a one-stage detection framework, we enhance the model’s ability to handle occlusion and scale variation effectively. Extensive experiments on UAV and CCTV datasets demonstrate that our method outperforms state-of-the-art detectors in both accuracy and efficiency. We improve mAP by up to +2-3% over baseline on VisDrone dataset and +6% on MLITcctv dataset. This results validate the effectiveness of our Separation Strategy in challenging vehicle detection scenarios.

Paper Nr: 190
Title:

ControlNet-Guided Diffusion for Realistic Plant Leaf Generation

Authors:

Georgios Tsoumplekas, Mia Pham, Yannis Spyridis, Thomas Lagkas, Georgios Th. Papadopoulos, Panagiotis Sarigiannidis and Vasileios Argyriou

Abstract: Deep learning models for plant disease recognition require large and diverse datasets, yet annotated leaf imagery is costly to collect and often imbalanced for rare diseases. Synthetic augmentation can alleviate scarcity, but existing generative models frequently fail to preserve canonical leaf morphology such as contour, venation and lesion placement. To address this, we propose a structure-guided synthesis framework based on Stable Diffusion augmented with ControlNet, where segmentation masks serve as explicit morphological priors during denoising. This conditioning constrains generation to the valid leaf manifold while preserving high-frequency disease texture. Experiments on the PlantVillage dataset show that our method achieves higher fidelity and diversity than unguided diffusion and performs on par with or better than StyleGAN on quantitative metrics while producing qualitatively superior samples in terms of anatomical consistency. These results demonstrate that morphology-aware diffusion is an effective strategy for plant image generation and can serve as a reliable augmentation tool for downstream plant disease classification.

Paper Nr: 206
Title:

Phys-3D: Physics-Constrained Real-Time Crowd Tracking and Counting on Railway Platforms

Authors:

Bin Zeng, Johannes Künzel, Anna Hilsmann and Peter Eisert

Abstract: Accurate, real-time crowd counting on railway platforms is essential for safety and capacity management. We propose to use a single camera mounted in a train, scanning the platform while arriving. While hardware constraints are simple, counting remains challenging due to dense occlusions, camera motion, and perspective distortions during train arrivals. Most existing tracking-by-detection approaches assume static cameras or ignore physical consistency in motion modeling, leading to unreliable counting under dynamic conditions. We propose a physics-constrained tracking framework that unifies detection, appearance, and 3D motion reasoning in a real-time pipeline. Our approach integrates a transfer-learned YOLOv11m detector with EfficientNet-B0 appearance encoding within DeepSORT, while introducing a physics-constrained Kalman model (Phys-3D) that enforces physically plausible 3D motion dynamics through pinhole geometry. To address counting brittleness under occlusions, we implement a virtual counting band with persistence. On our platform benchmark,MOT-RailwayPlatformCrowdHead Dataset(MOT-RPCH), our method reduces counting error to 2.97%, demonstrating robust performance despite motion and occlusions. Our results show that incorporating first-principles geometry and motion priors enables reliable crowd counting in safety-critical transportation scenarios, facilitating effective train scheduling and platform safety management.

Paper Nr: 207
Title:

Dynamic Kernel Linear Attention Mechanism

Authors:

Zhiwei Guo, Qinxia Hu and Xiao Hu

Abstract: Softmax attention mechanism in Transformer delivers strong performance in computer vision but suffers from complexity O(N2) in long-sequence scenarios. While linear attention reduces complexity to O(N) via kernel approximation, it underperforms softmax attention due to attention dispersion and query magnitude loss. In this paper, dynamic kernel module (DKM) and query-aware normalization (QN) are designed and then are integrated into a Dynamic Kernel Linear Attention Mechanism (DKLA). DKM generates a novel kernel function with query-specific parameters for semantic-adaptive key-value transformation for making attention more focused, while QN preserves query magnitude via sigmoid function. On COCO 2017 object detection, DKLA achieves 46.2% APb. For semantic segmentation on ADE20K 2016, it reaches 47.0% mIoU. Ablation studies demonstrate that DKM and QN effectively improve linear attention mechanism.

Paper Nr: 211
Title:

Inpainting of Sparse Tracks Image Satellite Using Plug and Play and Learned Prior

Authors:

Dmitrii Drozdov, Pierre Garcia, Dominique Béréziat and Anastase Charantonis

Abstract: We propose a novel method for reconstruction of high-resolution Sea Surface Height (SSH) fields from sparse along-track satellite altimetry. We explore the usage of an observation-driven deep-learning method for inpainting, with a focus on diffusion-based generative models for spatial reconstruction and data assimilation. The study includes data preparation from reanalyses and satellite observations, the definition of evaluation metrics relevant to oceanography, and comparison with baselines. We discuss model design choices, uncertainty characterization through stochastic sampling, and limitations in real-world deployment. Effectively, in this work, we develop and validate an observation-driven prior, allowing us to sample from the ground-truth distribution of SSH. By not relying on simulation results for training, we propose a step towards observation-driven Deep-Learning analysis of SSH and its uncertainties at small scales.

Paper Nr: 212
Title:

Application of GANs for Segmenting Cancerous Masses in Mammograms

Authors:

Afifa Dahmane, Yanis Hammouche and Hadja Faiza Haned Khellaf

Abstract: Early detection of breast cancer plays a pivotal role in improving patient survival rates. Mammography is the most widely used screening technique, yet accurate tumor segmentation remains challenging due to the limited availability of annotated data and the variability of image quality. In this work, we introduce a GAN-based transfer learning approach for breast tumor segmentation using limited supervision. A Pix2Pix-type Generative Adversarial Network is first trained to reconstruct mammography images, enabling the generator to learn meaningful structural and textural representations of breast tissue. The encoder weights of the trained generator are then transferred to initialize a U-Net–based segmentation model, enhancing its feature extraction capabilities and accelerating convergence. Our goal is to show that incorporating GANs into the training pipeline can make segmentation models more robust, more discriminative, and less dependent on large annotated datasets.

Paper Nr: 220
Title:

Video-Based Locomotion Analysis for Fish Health Monitoring

Authors:

Timon Palm, Clemens Seibold, Anna Hilsmann and Peter Eisert

Abstract: Monitoring the health conditions of fish is essential, as it enables the early detection of disease, safeguards animal welfare, and contributes to sustainable aquaculture practices. Physiological and pathological conditions of cultivated fish can be inferred by analyzing locomotion activities. In this paper, we present a system that estimates the locomotion activities from videos using multi object tracking. The core of our approach is a YOLOv11 detector embedded in a tracking-by-detection framework. We investigate various configurations of the YOLOv11-architecture as well as extensions that incorporate multiple frames to improve detection accuracy. Our system is evaluated on a manually annotated dataset of Sulawesi ricefish recorded in a home-aquarium-like setup, demonstrating its ability to reliably measure swimming direction and speed for fish health monitoring. The dataset will be made publicly available upon publication.

Paper Nr: 222
Title:

When Reality Is Not Enough: Synthesizing Single Frame Defects for Improving Defect Segmentation in Film

Authors:

Martin Korb, Stefanie Onsori-Wechtitsch and Peter Schallauer

Abstract: We improve the segmentation and restoration of semi-transparent single-frame defects (SFDs) in film by training a pixel segmenter on a mixture of synthetic and scarce real data. For synthesis, we use an image with SFDs, a manually restored version and estimate per-pixel defect foreground F and opacity map Λ in the compositing model I = ΛF + (1 − Λ)B. Λ is first initialized by a monochromatic least-squares baseline and then variationally refined, with Laplace smoothing for F and Λ. From the reconstructed mattes, we create an SFD pool and generate opacity-aware composites via alpha blending into clean backgrounds, with controllable sampling according to defect size and contrast categories. For segmentation, we adapt Mask2Former to a nine-channel input – 3 consecutive RGB frames – to utilise temporal context without motion compensation. Evaluation is conducted using overall and size-stratified Intersection over Union (IoU). Experiments indicate that opacityaware synthesis improves the robustness of segmentation under data scarcity, while producing visually natural insertions.

Paper Nr: 227
Title:

Lightweight and Efficient Deep Learning Architecture for Retinal Vessel Segmentation

Authors:

Henda Boudegga, Rostom Kachouri and Yaroub Elloumi

Abstract: Retinal vessel segmentation plays a major step in computer-aided diagnosis of various ocular and systemic diseases. While deep learning models, particularly U-Net-based encoder-decoder architectures have demonstrated efficient performance in medical fields, there are discriminated by high computational cost. These limitations prevent achieving real-time segmentation and restrict using medical-aided-screening systems on a large scale due to the raised environment resource. In this study, we propose a novel efficient and lightweight network for retinal vessel segmentation. Our contribution consists of optimizing convolutional blocks, while minimizing the depth of the encoder-decoder design and incorporating learnable upsampling strategies. The proposed model is evaluated on DRIVE database and confirms an efficient tradeoff between accuracy and computation cost comparable to traditional U-Net models, ensuring an average accuracy, sensitivity and specificity respectively in the order of 95% and 75.14% and 95.31in 0.2 second per fundus image.

Paper Nr: 246
Title:

Are Superpixels the Missing Ingredient for Robust ViT Pruning?

Authors:

Ahmed Soulmani, Martyna Poreba and Michal Szczepanski

Abstract: Vision Transformers (ViTs) rely on a rigid grid of square patches that fails to reflect natural image structure. To address this issue, recent studies have explored superpixel-based tokenization, which enforces spatially coherent grouping. In parallel, feature-driven patch merging has emerged as an effective technique for lowering the token count, offering a model-aware mechanism to cut computational cost. Against this backdrop, we introduce Merger, a descriptor-guided patch merging strategy that forms irregular, content-aware groupings reminiscent of superpixels, standing in clear contrast to existing methods. Under the SuiT framework, we directly compare two paradigms: Merger against widely used FastSLIC superpixelization. This allows us to examine whether pixel-level spatial coherence or feature-level affinity provides a better foundation for token pruning, particularly under aggressive compression. Our findings indicate that superpixels constitute an effective and robust pruning mechanism, outperforming patch-level merging by up to 1.6% in accuracy retention. We also find that allowing patch groups to grow with unconstrained shapes and sizes leads to heterogeneous aggregations that degrade final accuracy, highlighting the critical role of spatial constraints once patches become coarse aggregation units.

Paper Nr: 248
Title:

A PeBERT Model for Generating Keywords for Retinography Images

Authors:

Pedro Victor de A. Fonseca, João Dallyson Sousa de Almeida and Geraldo Braz Junior

Abstract: The automatic analysis of retinography images is essential for optimizing the diagnosis of eye diseases. Generating keywords is a crucial task for indexing exams and creating structured reports, but it poses significant challenges, especially given the diversity of clinical findings. This work presents an encoder-decoder architecture called PeBERT., which combines the extraction of visual features from the Perception Encoder with a Bert text decoder, in the task of multi-label image tagging using the DeepEyeNet (DEN) dataset in two different scenarios: one where there is sampling control based on the top 10 most frequent keywords in the vocabulary present in the database, and another that acts on the complete vocabulary. The proposed architecture achieves 0.52 ± 0.01 for ROUGE-L and 0.50 ± 0.02 for CIDEr in the complete vocabulary, and 0.71 ± 0.01 for both metrics in the frequency-controlled space, demonstrating superior performance compared with traditional architectures. The output obtained by PeBERT across two vocabularies, yielding keywords for retinography exams in two types of vocabularies, presents robust results.

Paper Nr: 257
Title:

Sin Loss for Class Imbalance and Hard Samples

Authors:

Reiji Saito and Kazuhiro Hotta

Abstract: Supervised multi-illumination anomaly detection is essential in industrial inspection. However, since anomalous samples are significantly fewer than normal ones, severe class imbalance inevitably occurs. As a result, models tend to overfit normal regions, leading to insufficient learning of anomalous regions and ultimately degrading anomaly detection performance. To address this problem, we propose a novel loss function called Sin loss. Sin loss allocates more learning capacity to regions where the predicted probability of the ground truth class is low, such as hard samples or minority sample classes, while suppressing learning in regions where the predicted probability of the ground truth class is high. This mechanism effectively enhances the discriminative ability for hard samples or minority sample classes while preventing overfitting. In our experiments, we evaluate the proposed method on the CSEM-MISD benchmark used for supervised multi-illumination anomaly detection and the CIFAR100 benchmark used for image classification. Although CIFAR100 is not an imbalanced dataset, we include it to verify whether Sin loss is also effective for hard samples in general classification tasks. Compared with conventional loss functions designed to address class imbalance, our method achieves state-of-the-art performance on both benchmarks.

Paper Nr: 263
Title:

GlovEgo-HOI: Bridging the Synthetic-to-Real Gap for Industrial Egocentric Human-Object Interaction Detection

Authors:

Alfio Spoto, Rosario Leonardi, Francesco Ragusa and Giovanni Maria Farinella

Abstract: Egocentric Human-Object Interaction (EHOI) analysis is crucial for industrial safety, yet the development of robust models is hindered by the scarcity of annotated domain-specific data. We address this challenge by introducing a data generation framework that combines synthetic data with a diffusion-based process to augment real-world images with realistic Personal Protective Equipment (PPE). We present GlovEgo-HOI, a new benchmark dataset for industrial EHOI, and GlovEgo-Net, a model integrating Glove-Head and Keypoint-Head modules to leverage hand pose information for enhanced interaction detection. Extensive experiments demonstrate the effectiveness of the proposed data generation framework and GlovEgo-Net. To foster further research, we release the GlovEgo-HOI dataset, augmentation pipeline, and pre-trained models at: https://github.com/NextVisionLab/GlovEgo-HOI.

Paper Nr: 267
Title:

Beyond Accuracy: A Counterfactual Analysis of Plant Disease Classifiers

Authors:

Hajer Mejri and Sébastien Roy

Abstract: What is the causal effect of Northern Leaf Blight (NLB) lesions on an image plant disease classifier’s output? how can we assess the trustworthiness of its decisions while accounting for confounders? These are causal questions that many explainability methods, which rely purely on correlations between features rather than causation, fail to answer. This reliance on correlation may result in potentially misleading explanations. In this paper, we propose a causal inference evaluation framework called ’Beyond Accuracy’(BA), which is based on causal interventions and counterfactual reasoning to measure the causal effect (CE) of the absence / presence of lesions on image classifier’s predictions for plant disease classification. Estimating the causal effect can be challenging since it relies on the removal of lesions for its intervention, which is an open problem. To achieve this, we propose and compare two methods for concept removal. A heuristic approach that relies on pixel masking with black boxes, and a generative method utilizes a diffusion model. We evaluate our method on the Northern Leaf Blight disease dataset. Our results show that our method helps to measure the causal effect of lesions on the classifier’s output while accounting for confounders and highlight the extent to which lesions contribute to the decision making.

Paper Nr: 300
Title:

Explaining the Inexplicable: An Explainable AI Approach to Boost Knowledge Transfer

Authors:

Pietro Manganelli Conforti, Lorenzo Papa, Irene Amerini and Paolo Russo

Abstract: Deep learning has revolutionized performance across diverse applications but is often criticized for its opaque nature and computational complexity, complicating efforts to understand and trust model predictions. Ex-plainability techniques have emerged to address these issues, enhancing model transparency and user trust. Simultaneously, efficiency techniques such as Knowledge Distillation offer solutions to streamline complex (teacher) models into more efficient (student) ones while preserving their predictive capabilities. Building on those research efforts, this paper presents a general framework that is able to improve well-known distillation techniques through explainability methods boosting both model interpretability and accuracy of shallow student architectures. More in detail, by incorporating a novel loss function that aligns the explainability maps of teacher and student models, our method refines the distillation process, leading to more accurate and interpretable predictions. Our approach not only improves the performance of distilled models but also demonstrates the effective integration of explainable insights into distillation frameworks, as validated by multiple experiments on a publicly available benchmark dataset.

Paper Nr: 307
Title:

Enhanced Class-Expertise Weighted Aggregation

Authors:

Stepan Kasai, Sviatoslav Stumpf, Alexey Zabashta and Valeria Efimova

Abstract: The growing need for large, high-quality labeled datasets has made crowdsourcing a mainstream data annotation strategy, particularly in fields such as computer vision (CV), where visually labeled, manually annotated data is indispensable for training and evaluating models. However, traditional aggregation methods often assume uniform skill levels among annotators and fail to account for differences in skill between classes. Recent hybrid approaches combining model predictions with human labels demonstrate promising results, but they lack dynamic, context-sensitive weighting. To address these issues, we present Enhanced Class-Expert-Based Weighted Aggregation (EnhancedCEWA)-a framework that combines annotators’ class expertise, inter-annotator agreement, and model confidence into a single aggregation scheme. EnhancedCEWA adapts the contributions of annotators and predictive models to each instance, leveraging high-confidence model predictions and robustly handling sparse annotations. We also propose refined annotator quality metrics that balance accuracy and class-specific agreement, allowing for a detailed assessment of annotator reliability. Empirical evaluation shows that EnhancedCEWA improves the quality of agreed-upon labels compared to traditional and hybrid methods, achieving accuracy of up to 0.829 on Food101N and 0.669 on Clothing1M, while achieving near-perfect Spearman correlation coefficients of 0.998 and 0.988, respectively, for assessing annotator quality. Our results highlight the advantages of dynamic class-based weighting, demonstrating that incorporating both annotator expertise and model confidence in this case leads to more accurate and robust aggregation of crowdsourced labels.

Paper Nr: 317
Title:

CRAX: Parameter-Efficient Fine-Tuning of SAM2 for Interactive Crack Annotation

Authors:

Filip Hendrichovsky, Raphael Pickl, Lukas Bednar and Daniel Soukup

Abstract: We study how to adapt the Segment Anything Model 2 (SAM2) for interactive segmentation of thin, cracklike structures. Rather than building a fully automatic crack detector, we focus on the annotator’s perspective: how many clicks are needed to obtain boundary- and topology-faithful masks. To this end, we curate CRAX, a multi-domain corpus of 35 datasets covering surface cracks, retinal vessels, and plant roots/mycelium, together with leakage-controlled leave-one-domain-out and 5-fold leave-one-dataset-out splits tailored to interactive evaluation. On top of SAM2, we systematically compare three fine-tuning strategies: decoderonly, Low-Rank-Adaptation+decoder, and full encoder+decoder. Using a conservative click simulator and a topology-aware metric suite, we show that full fine-tuning substantially improves performance on challenging thin-structure domains, almost doubling boundary IoU and nearly halving Hausdorff distance in the hardest cross-domain setting, while consistently improving cross-dataset generalization within cracks. Most importantly, fine-tuned models reach—and often surpass—the baseline’s 9-click quality after only one to two clicks, reducing annotation effort and making SAM2 more effective for large-scale crack and crack-like labeling.

Paper Nr: 329
Title:

NEF: Neural Error Fields for Follow-up Training with Fewer Rays

Authors:

Alessandro Luchetti, Kenta Ito, Dieter Schmalstieg, Denis Kalkofen and Shohei Mori

Abstract: A Neural Radiance Field (NeRF) is capable of representing scenes by capturing view-dependent properties from a specific set of images through neural network training. The lack of a significant initial image set implies that additional photographing session and training are required to improve the final view synthesis. For this purpose, we introduce a new variant of NeRF training analysis, termed the Neural Error Field (NEF). NEF visualizes and identifies view-dependent errors to reduce the number of ray samples used in the follow-up training. NEF does not require modifications to the NeRF core and training process. We evaluate and verify the accuracy of the results achieved with NEF on several public datasets, including real and synthetic images, and bounded and unbounded scenes.

Paper Nr: 331
Title:

Synthetic Aerial Image Generation with Controllable Diffusion Models for Pinus Segmentation under Data Scarcity

Authors:

Thiago Innani Justus, Rodrigo Minetto, Gilson Giraldi and Mauren Louise Sguario

Abstract: Biological invasion by exotic species, such as Pinus, represents a critical threat to global biodiversity, with severe ecological impacts in the Campos Gerais region, Brazil. While deep learning offers promising tools for monitoring these species via remote sensing, its effectiveness is often hampered by the scarcity of high-quality labeled training data. To address this bottleneck, this work proposes a generative data augmentation methodology for aerial imagery leveraging the Seg2Sat framework-a Latent Diffusion Model based on Stable Diffusion and ControlNet. By fine-tuning the model on a limited dataset of drone imagery, we synthesized photorealistic samples conditioned by semantic segmentation masks and text prompts. The efficacy of this strategy was rigorously evaluated across seven semantic segmentation architectures, including members of the YOLO, U-Net, and Transformer families. Results demonstrate that integrating synthetic data enhanced the performance of 6 out of 7 models, achieving F1-Score gains of up to 6.1% for U-Net. Furthermore, the study reveals an architecture-dependent sensitivity to augmentation ratios: while Transformers (SegFormer) peaked with a 10% synthetic data injection, Convolutional Neural Networks (CNNs) benefited most from larger increments of 15–20%.

Paper Nr: 333
Title:

A Neuro-Symbolic Framework of Behaviors Recognition for Human Robot Interaction

Authors:

Tao Wang and Hao Tang

Abstract: Service robotics have been growing rapidly in areas like logistics, health care and retail spaces, where robots need to operate smoothly and safely in human-centered environments. Monitoring Human-Robot Interaction (HRI) in such dynamic environments requires not only robust action detection but also high-level behavior understanding. In this work, we propose a Neuro-Symbolic (NeSy) framework that combines deep-learning based perception with symbolic reasoning to recognize HRI and understand three general behaviors: approaching, following and interacting. On the perception part, we use a YOLO-based detector to detect persons and robots from multi-view cameras. Then we apply NorFair, a multi-object tracking system to associate detections with tracks to generate trajectories. Those trajectories are translated into a unified 3D coordinate frame using the camera calibration. On the reasoning part, Prolog-based rules are applied to interpret human and robot tracks to infer high-level behaviors in the spatio-temporal domain. This hybrid approach leverages data-driven human-robot recognitions with rule-based behavior explanations. We evaluate our framework using the multi-view synthetic warehouse dataset, PhysicalAI-SmartSpaces. Our work demonstrates improved precision and recall in recognizing various HRIs using NeSy approach compared to the vision-only baseline. Integrating neural perception with symbolic reasoning provides robust explanations for each detected event, and demonstrates a promising path toward reliable human-robot behavior understanding in complex and dynamic environments.

Paper Nr: 337
Title:

Efficient Video Segmentation with Differential Networks

Authors:

Artur A. Oliveira, Mateus Espadoto, Roberto M. Cesar Jr. and Roberto Hirata Jr.

Abstract: Real-time video segmentation is computationally demanding, particularly for high-resolution and high-frame-rate streams. While many methods exploit pixel differences for optimization, we propose a novel approach that learns to predict changes in segmentation masks by modeling how a pre-trained segmentation model’s outputs evolve with input differences. By leveraging a grid-based representation of segmentation masks and selectively updating frames based on predicted changes, our method significantly reduces the frequency of expensive predictions from the base model while maintaining high accuracy. Experiments demonstrate that our approach achieves over 78% frame skipping with grid accuracy exceeding 99%, offering a scalable and efficient solution for real-world video processing tasks.

Paper Nr: 341
Title:

From Synthetic Data to Real Restorations: Diffusion Model for Patient-Specific Dental Crown Completion

Authors:

Dávid Pukanec, Tibor Kubík and Michal Španěl

Abstract: We present ToothCraft, a diffusion-based model for the contextual generation of tooth crowns, trained on artificially created incomplete teeth. Building upon recent advancements in conditioned diffusion models for 3D shapes, we developed a model capable of an automated tooth crown completion conditioned on local anatomical context. To address the lack of training data for this task, we designed an augmentation pipeline that generates incomplete tooth geometries from a publicly available dataset of complete dental arches (3DS, ODD). By synthesising a diverse set of training examples, our approach enables robust learning across a wide spectrum of tooth defects. Experimental results demonstrate the strong capability of our model to reconstruct complete tooth crowns, achieving an intersection over union (IoU) of 81.8% and a Chamfer Distance (CD) of 0.00034 on synthetically damaged testing restorations. Our experiments demonstrate that the model can be applied directly to real-world cases, effectively filling in incomplete teeth, while generated crowns show minimal intersection with the opposing dentition, thus reducing the risk of occlusal interference. Access to the code, model weights, and dataset information will be available at: https://github.com/ikarus1211/VISAPP ToothCraft.

Paper Nr: 34
Title:

Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization

Authors:

D. Faget, J. L. Lisani and M. Colom

Abstract: Planet-scale photo geolocalization involves the intricate task of estimating the geographic location depicted in an image purely based on its visual features. While deep learning models, particularly convolutional neural networks (CNNs), have significantly advanced this field, understanding the reasoning behind their predictions remains challenging. In this paper, we present Combi-CAM, a novel method that enhances the explainability of CNN-based geolocalization models by combining gradient-weighted class activation maps obtained from several layers of the network architecture, rather than using only information from the deepest layer as is typically done. This approach provides a more detailed understanding of how different image features contribute to the model’s decisions, offering deeper insights than the traditional approaches.

Paper Nr: 38
Title:

Similarity Analysis in Source Code Using 1D Signal

Authors:

Kaique Venuto Mancuzo and André Ricardo Backes

Abstract: Source code plagiarism is a recurring issue in academia, and its manual detection is a highly time-consuming task due to the large volume of assignments in programming courses. To address this, this study proposes an approach to measuring similarity between source codes using signal processing techniques, treating them as one-dimensional signals. The hypothesis is that this approach may be more resistant to obfuscation techniques than conventional methods. We explored three approaches: time-domain analysis, the Fourier Transform, and the Wavelet Transform. We evaluated these metrics on datasets containing previously identified cases of plagiarism and compared them with the MOSS and JPlag tools. The results indicate that time-domain analysis, particularly with Dynamic Time Warping (DTW) distance and Pearson correlation, was the most effective in identifying plagiarism, achieving performance comparable to traditional tools.

Paper Nr: 76
Title:

AIGVE-Tool: AI-Generated Video Evaluation Toolkit with Multifaceted Benchmark

Authors:

Xinhao Xiang, Xiao Liu, Zizhong Li, Zhuosheng Liu and Jiawei Zhang

Abstract: The rapid advancement in AI-generated video synthesis has led to a growing demand for standardized and effective evaluation metrics. Existing methods lack a unified framework for systematically categorizing methodologies, limiting comprehensive understanding and reuse. Many also suffer from fragmented design, incompatible environments, dataset-specific dependencies, and redundant processing logic. To address these challenges, we introduce AIGVE-Tool (AI-Generated Video Evaluation Toolkit), a modular and extensible unified framework for evaluating AI-generated videos. Built on a novel five-category taxonomy, it supports flexible metric integration, dataset abstraction, and configuration-driven workflows. We further propose AIGVE-Bench, a large-scale benchmark of 2,430 videos and 21,870 human ratings across nine aspects, generated by five state-of-the-art models. Experiments demonstrate AIGVE-Tool’s effectiveness in capturing model strengths and weaknesses across diverse scenarios, advancing rigorous and reproducible video evaluation. Together, the five-category taxonomy, AIGVE-Tool, and AIGVE-Bench form a unified ecosystem for standardizing, analyzing, and advancing the evaluation of AI-generated video content. We open source AIGVE-Tool toolkit and publish AIGVE-Bench benchmark in: https://www.aigve.org/.

Paper Nr: 81
Title:

Recursive Memory Transformers for Scalable Analysis of Complex Vector Drawings

Authors:

Andrey Pimenov, Vera Terenteva, Sergey Muravyov and Valeria Efimova

Abstract: Recent progress in neural networks has advanced vector graphics analysis; however, existing methods cannot directly process vector images with unlimited components without rasterization. While Large Language Models (LLMs) can generate vector graphics, they remain constrained by the transformer attention window. This paper introduces two novel architectures for analyzing and processing complex vector images and demonstrates their use in similarity search and plagiarism detection. The List-based Context Recursive Memory Transformer optimizes processing efficiency by integrating multiple attention window sizes, while the History-based Context Recursive Memory Transformer enhances long-term contextual understanding. Both approaches effectively process intricate vector structures and large technical drawings containing over 105 entities. Experiments conducted on a dataset of 600,000 vector and plagiarized images show that the proposed models outperform raster-based methods, including Vision Transformer (ViT) and recursive BERT. The list-based model achieved 91% positive and 88% negative detection accuracy, and the history-based model reached 85% and 91%, respectively, compared to 79% and 6% for ViT. These results demonstrate an improvement of over 10% in overall accuracy and confirm the models’ capacity for scalable, context-aware vector image analysis. The approaches are promising for engineering design, retrieval, and plagiarism detection applications.

Paper Nr: 82
Title:

Towards Efficient Semantic Segmentation Deep Neural Networks: Case Study of Unstructured Pruning with DeepLabV3

Authors:

Erika-Melinda Kali and Mihai Negru

Abstract: Unstructured pruning has shown potential in compressing deep neural networks for classification tasks, but its use in semantic segmentation is still relatively unexplored. In this work, we investigate the efficiency of magnitude-based unstructured pruning (Han et al., 2015) on DeepLabV3 (Chen et al., 2017) with a ResNet-50 (He et al., 2016) backbone, using the Cityscapes (Cordts et al., 2016) dataset. We explore both per-layer and global one-shot pruning strategies and analyze their impact on segmentation performance. We further fine-tune to recover the lost performance. To examine practical applicability, we deploy the pruned models on specialized hardware and evaluate their inference speed-ups. According to our findings, global pruning provides a better sparsity-accuracy trade-off, maintaining good performance up to 60% sparsity without any fine-tuning. At 80% sparsity, after 30 fine-tuning epochs, the network has an accuracy drop of only 2.16% compared to the baseline. When deployed, the pruned models achieve speed-ups of up to 3.95x. Finally, for global pruning, we provide an analysis of the resulting sparsity distributions at the threshold where performance starts to degrade, providing insights on redundancy and layer-wise sensitivity within the network.

Paper Nr: 109
Title:

Focal-FusionNet: A Dual-Scale Attention Network for Accurate Small Object Detection with False Positive Suppression

Authors:

Shubham Kumar Dubey, J. V. Satyanarayana and C. Krishna Mohan

Abstract: Detecting small objects in complex and cluttered scenes remains a significant challenge for modern object detectors, particularly due to scale variation, background noise, and high false-positive rates. To address these issues, we propose Focal-FusionNet, a robust detection framework that integrates multi-scale feature fusion with context-aware attention to enhance discriminative representations for small objects. The network leverages focal feature aggregation to emphasize fine-grained spatial details while simultaneously incorporating surrounding contextual cues. In addition, a dedicated False Positive Suppression Module (FPSM) is introduced to explicitly reduce spurious detections arising from visually similar background patterns. Extensive experiments on benchmark datasets including COCO, VisDrone, and DOTA demonstrate that Focal-FusionNet consistently outperforms state-of-the-art detectors such as YOLO and RetinaNet, achieving superior detection accuracy, improved small-object recall, and enhanced robustness under noisy and densely populated scenarios.

Paper Nr: 110
Title:

PRT: A Training-Free Planar Tracking Framework via Point Correspondence and Robust RANSAC Estimation

Authors:

Yong Kim, Hojae Kim, Suhyun Kim, Hyeonji Kim, Huijin Choi and Yonghyun Park

Abstract: Planar tracking plays a crucial role in applications such as augmented reality, SLAM, and visual localization. Recent deep learning-based approaches have achieved remarkable progress by training on large-scale synthetic datasets generated from MS COCO images with artificial homography perturbations. However, these methods suffer from a significant domain gap between synthetic and real-world imagery, leading to degraded accuracy under complex conditions such as illumination changes, motion blur, or occlusion. Moreover, their reliance on supervised regression of homography parameters restricts generalization to unseen planar surfaces and camera motions. To overcome these limitations, we introduce PRT (Point-based RANSAC Tracker) — a training-free planar tracking framework that integrates point-level correspondence estimation with robust geometric reasoning. Rather than learning homography mappings from synthetic transformations, PRT leverages dense correspondences extracted from pretrained point tracking backbones and computes homographies using RANSAC and MAGSAC++. This design effectively combines the generalization power of correspondence-based representations with the robustness of probabilistic model fitting, eliminating the need for domain-specific training. Extensive experiments on POT-210 and POT-280 benchmarks demonstrate that PRT achieves state-of-the-art accuracy without any synthetic supervision, outperforming prior works such as WOFT and HDN by a significant margin. In particular, PRT exhibits strong resilience to blur, occlusion, and out-of-view scenarios, where training-based models typically fail. These results highlight the effectiveness of geometry-driven, training-free planar tracking, and suggest a promising direction for domain-agnostic visual correspondence research.

Paper Nr: 121
Title:

FALQON-MST: A Fully Quantum Framework for Graph Optimization in Vision Systems

Authors:

Guilherme E. L. Pexe, Lucas A. M. Rattighieri, Leandro A. Passos, Douglas Rodrigues, Danilo S. Jodas, João P. Papa and Kelton A. P. da Costa

Abstract: Finding the minimum spanning tree (MST) of a graph is an important task in computer vision, as it enables a sparse and low-cost representation of connectivity among elements (such as superpixels, points, or regions), which is useful for tasks such as segmentation, reconstruction, and clustering. In this work, we propose and evaluate a fully quantum pipeline for computing MSTs using the FALQON algorithm, a feedback-based quantum optimization method that does not require classical optimizers. We construct a Hamiltonian formulation whose ground-state energy encodes the MST of a graph and compare different FALQON strategies: (i) time rescaling (TR-FALQON) and (ii) multi-driver configurations. To avoid domain-specific biases, we adopt graphs with random weights and show that the FALQON variants exhibit significant differences in ground-state fidelity. We discuss the relevance of this approach for computer vision problems that naturally yield graph representations, and experimental results on synthetic instances together with a small demonstrative study on image segmentation illustrate both the potential and the current limitations of the method. Our numerical simulations on randomly weighted graphs show that standard one drive FALQON, although it reduces the expected energy, fails to concentrate amplitude in the MST solution. The multi drive variant succeeds in redistributing probability mass toward the ground state so that the MST appears among the most probable outcomes, and TR FALQON applied over multi drive produces the best results with faster convergence, lower final energy, and the highest solution state probability or fidelity in our tested instances. These improvements were observed on small synthetic graphs, underscoring both the promise of multi drive controls with temporal rescaling and the need for further scaling and hardware validation.

Paper Nr: 146
Title:

MCAMOT: Multi-Camera Assisted Multi-Object Tracking for Automated Pig Monitoring

Authors:

Qinghua Guo, Yue Sun, Patrick P. J. H. Langenhuizen, J. Elizabeth Bolhuis, Piter Bijma and Peter H. N. De With

Abstract: Reliable detection and tracking of individual pigs are essential for automated health and welfare monitoring in modern livestock farming. However, single-camera systems often suffer from occlusions and a limited field of view, leading to frequent identity switches and reduced tracking accuracy of pigs. This study proposes MCAMOT, a multi-camera assisted multi-object tracking method for pig monitoring. The method aligns synchronized multi-view frames using a homography transform, and then fuses both fields of view as multi-channel inputs within a unified one-shot detection and tracking network based on FairMOT. MCAMOT obtains an IDF1 score of 86.00% and a MOTA score of 94.53%, which reduces the number of identity switches by approximately 16% across twelve 10-min. videos, compared with single-camera-based tracking using the same network. The results demonstrate that integrating complementary views through geometric transformation and fusion effectively enhances identity consistency and robustness for automated monitoring in pig farms.

Paper Nr: 208
Title:

DINOv3 for DE-ViT: Boosting Few-Shot Object Detection with Scalable and Robust Visual Features

Authors:

Helmut Neuschmied, Werner Bailer and Martin Winter

Abstract: Few-Shot object detection (FSOD) aims to detect novel object categories from only a few labeled examples. The DE-ViT framework addresses this by leveraging self-supervised visual features from DINOv2 without requiring fine-tuning on novel classes. In this work, we adapt DE-ViT to use DINOv3 as the visual backbone, and evaluate the impact of improved self-supervised representations on few-shot detection performance. DI-NOv3 offers enhanced training stability, refined patch-level features, and large-scale pretraining on 1.7 billion images. We describe the adaptation process, including prototype computation and retraining considerations necessitated by the altered feature distributions and patch resolutions of DINOv3. Experiments on the COCO and LVIS datasets demonstrate consistent performance gains, particularly for novel categories, as well as more coherent prototype-feature alignments. These results indicate that DE-ViT and derived approaches can substantially benefit from the transition to DINOv3.

Paper Nr: 221
Title:

SITUATE: Synthetic Object Counting Dataset for VLM Training

Authors:

René Peinl, Vincent Tischler, Patrick Schröder and Christian Groth

Abstract: We present SITUATE, a novel dataset designed for training and evaluating Vision Language Models on counting tasks with spatial constraints. The dataset bridges the gap between simple 2D datasets like VLMCountBench and often ambiguous real-life datasets like TallyQA, which lack control over occlusions and spatial composition. Experiments show that our dataset helps to improve generalization for out-of-distribution images, since a finetune of Qwen VL 2.5 7B on SITUATE improves accuracy on the Pixmocount test data, but not vice versa. We cross validate this by comparing the model performance across established other counting benchmarks and against an equally sized fine-tuning set derived from Pixmocount.

Paper Nr: 233
Title:

Chaos-SSL: An Attention-Based Self-Supervised Learning Framework with Chaotic Transformation for Medical Image Classification

Authors:

Joao Batista Florindo

Abstract: Self-Supervised Learning (SSL) has emerged as a powerful paradigm to mitigate the reliance on large, annotated datasets, a common bottleneck in medical image analysis. However, standard SSL methods, which rely on simple geometric and color augmentations, may fail to capture the fine-grained, complex textural details necessary for classifying subtle pathologies. This paper introduces Chaos-SSL, a novel two-stage framework for medical image classification. In the first stage, we propose a new self-supervised pre-training strategy that leverages 1D chaotic maps (Logistic, Tent, and Sine) as a complex, non-linear augmentation for contrastive learning. We hypothesize that these chaotic transformations create “harder” and more semantically-rich views, forcing a network to learn robust representations of fine-grained medical textures. In the second stage, we introduce an attention-based fusion model that dynamically combines the specialized features from our Chaos-SSL model with the general-purpose features of a larger, ImageNet-pre-trained model. We validate our method on two public datasets: ISIC 2018 (skin lesions) and APTOS 2019 (diabetic retinopathy). Our results demonstrate that the Chaos-SSL model pre-trained with a Tent map for 30 epochs, followed by attention fusion, achieves performance fully competitive with the state-of-the-art, yielding an accuracy of 0.9261 on ISIC 2018 and 0.8726 on APTOS 2019. This significantly outperforms existing SSL methods, including several recent approaches.

Paper Nr: 240
Title:

Non-Contact Respiration Rate Estimation in Cattle from Overhead Videos

Authors:

Purbaditya Bhattacharya, Erik Endlicher, Goutham Ravinaidu, Gerald Bieber, Judith Louise Pieper, Jan Langbein, Birger Puppe and Uwe Freiherr von Lukas

Abstract: The welfare and productivity of livestock are critical in modern agriculture, requiring automated and noninvasive health and behavior monitoring systems. Such systems can provide information about multiple parameters related to the animal over time. Respiration rate (RR) is one of the vital physiological parameters whose accurate, continuous estimation can provide early detection of diseases and heat stress, as well as an indication of activities such as rumination in cattle. Traditional measurement methods like direct observation and calculation or the use of contact sensors are often labor-intensive or intrusive, highlighting the need for remote solutions. This paper introduces a camera-based, semi-automated method for non-contact respiration rate estimation in cows and calves using a top-down, overhead camera perspective. The proposed pipeline leverages computer vision and signal processing, combining deep learning-based segmentation with contour and area-based analysis to convert subtle flank or abdominal movements into a time-domain signal. This signal is then filtered and analyzed using a peak-picking method for continuous RR estimation. The proposed method is evaluated on video snippets of moderately stationary animals, achieving an approximate maximum estimation error of 5% across four different feature extraction approaches. The results demonstrate the effectiveness of the method and offer encouragement for further investigation.

Paper Nr: 261
Title:

Porosity Classification in High Pressure Die Casting Using Thermal Images and Sensor Data Fusion via Fuzzy Cognitive Maps

Authors:

Tomasz Michno, Roxana Holom, Sebastian Schmalzer, Pauline Meyer-Heye, Giulia Scampone, Elias Riegler, Matthias Hartmann, Urban Repanšek, Nejc Košir, Peter Šifrer and Katarzyna Poczęta

Abstract: Accurate process monitoring and fast defect detection is very crucial nowadays in the industrial manufacturing, as many of the products must meet the safety and performance requirements. In the High Pressure Die Casting (HPDC) process one of the main and most severe defect is a porosity, which can be caused by many factors. In order to detect its occurrence, most often destructive tests or time-consuming methods like cuts, leakage tests, Computed Tomography, or X-Ray have to be made. Due to that fact, there is growing demand on methods which can be used inline and without waste production, even as only a preliminary check which reduces the number of parts for the more throughout examination. This paper presents a novel Fuzzy Cognitive Map-based fused sensor classifier for porosity prediction in HPDC parts. The main contributions are: fusion of HPDC machine sensor readouts and thermal images (before and after spraying); feature extraction methods tailored to the HPDC dataset; and a feature selection study analyzing their impact on model performance. To our knowledge, this is the first application of Fuzzy Cognitive Maps for porosity classification in die casting using fused thermal and sensor data. This solution supports sustainability, waste reduction, and inline, non-destructive visual quality control in the metallurgic industry.

Paper Nr: 271
Title:

Domain Specific Multi-Branch Network for Human Pose Estimation Inside a Car

Authors:

Romain Guesdon, Carlos Crispim-Junior and Laure Tougne Rodet

Abstract: The development of computer vision has greatly improved video surveillance methods. Among them, human pose estimation methods aim to predict the coordinates of body keypoints on an image. Although recent deep-learning methods achieve good results in generic contexts, their application inside a vehicle’s cockpit brings new difficulties, such as unusual view angles, strong occlusions, or uneven lighting. However, even with the increasing availability of car cabin datasets, no dataset can cover all possible car cabin scenes. This paper investigates the problem of human pose estimation in a car cabin in a domain generalization setting. Domain generalization investigates methods that can learn how to solve a problem using data from one or more source domains. Its objective is to obtain a solution that generalizes to data from an unknown domain, often named the target domain. This paper proposes two contributions in this direction. First, we propose a multi-branch architecture to take advantage of data from generic and car-specific domains, but also real and synthetic data, to improve human pose estimation inside a car. Second, we introduce an experimental setting to evaluate the human pose estimation methods inside a car cabin and in the context of domain generalization problems. Results show the proposed method can capitalize on features learned from multiple source datasets to improve its performance on data from the target domain.

Paper Nr: 298
Title:

Graph Memory: A Structured and Interpretable Framework for Modality-Agnostic Embedding-Based Inference

Authors:

Artur A. Oliveira, Mateus Espadoto, Roberto M. Cesar Jr. and Roberto Hirata Jr.

Abstract: We introduce Graph Memory (GM), a structured non-parametric framework that represents an embedding space as a compact graph of reliability-annotated prototype regions. GM encodes local geometry and regional ambiguity through prototype relations and performs inference by diffusing query evidence over this graph, unifying instance retrieval, prototype-based reasoning, and graph diffusion within a single inductive and interpretable model. The framework is modality-agnostic: in multimodal settings, independent prototype graphs are built per modality and combined through reliability-aware late fusion. Experiments on synthetic benchmarks, breast histopathology (IDC), and the multimodal AURORA dataset show that GM matches or exceeds kNN and Label Spreading in accuracy, while delivering substantially better calibration, smoother decision boundaries, and an order-of-magnitude smaller memory footprint. Overall, GM provides a principled and interpretable approach to non-parametric inference across single- and multi-modal domains.

Paper Nr: 309
Title:

A Review of Advances in Artificial Intelligence-Driven Technologies for Assessing Food Quality Using Hyperspectral and Multispectral Imaging

Authors:

Michael B. Estrada, Boris X. Vintimilla and Luis E. Chuquimarca

Abstract: This review examines the synergy between Hyperspectral and Multispectral Imaging (HSI/MSI) and Artificial Intelligence (AI) for non-destructive food quality assessment. Synthesizing evidence from 35 pivotal studies on fruits and meats, we delineate the specific operational domains of AI architectures. Machine Learning (ML) proves superior for chemical profiling, such as sugar quantification and species authentication. Conversely, Deep Learning (DL) excels in complex pattern recognition for defect and freshness analysis, achieving accuracies up to 0.997. Furthermore, we highlight the emerging role of Generative Adversarial Networks (GANs) in mitigating data scarcity through synthetic augmentation, which improves model performance by 30%. Despite these advancements, a critical barrier persists: a severe lack of open-access data, with only 16% of fruit datasets being public and a complete absence of accessible meat repositories. This work provides a comprehensive roadmap, comparing camera systems and algorithmic strategies to guide the next generation of intelligent, non-invasive quality control systems.

Paper Nr: 343
Title:

SLAMDUNKS: A Vision-Based Approach for Uncovering Semantic Relations in Dataset Collections

Authors:

Petra Bevandić and Barbara Hammer

Abstract: Multi-dataset training is a key strategy for improving the quality of deep models, but its effectiveness is often hindered by unaligned dataset taxonomies which may introduce noise into training. To address this, we propose SLAMDUNKS, an exclusively vision-based framework for simultaneous semantic relation discovery between classes and multi-dataset training. Its core are two competing heads: a gating head that determines which dataset-specific classes are related, and a classification head that maps samples to the emerging shared taxonomy. We demonstrate that visual information is necessary for semantic alignment. To rigorously evaluate alignment quality, we formalize semantic relation discovery as standalone binary classification task. We propose pseudo dataset collections from existing datasets as it allows us to control semantic misalignment and define ground-truth semantic relations. Our method demonstrates good precision, perfectly recovering correct semantic relations for same-domain datasets. Across more challenging cross-domain pairs, SLAMDUNKS matches or outperforms the state-of-the-art, validating its superior capabilities.

Area 3 - Low-Level Vision & Computational Imaging

Full Papers
Paper Nr: 48
Title:

Defense that Attacks: How Robust Models Become Better Attackers

Authors:

Mohamed Awad, Mahmoud Mohamed and Walid Gomaa

Abstract: Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples. The code and models can be found in the repo: https://github.com/mohamed11354/Defense-That-Attacks.

Paper Nr: 130
Title:

End-to-End Optimization of Polarimetric Measurement and Material Classifier

Authors:

Ryota Maeda, Naoki Arikawa, Yutaka No and Shinsaku Hiura

Abstract: Material classification is a fundamental problem in computer vision and plays a crucial role in scene understanding. Previous studies have explored various material recognition methods based on reflection properties such as color, texture, specularity, and scattering. Among these cues, polarization is particularly valuable because it provides rich material information and enables recognition even at distances where capturing high-resolution texture is impractical. However, measuring polarimetric reflectance properties typically requires multiple modulations of the polarization state of the incident light, making the process time-consuming and often unnecessary for certain recognition tasks. While material classification can be achieved using only a subset of polarimetric measurements, the optimal configuration of measurement angles remains unclear. In this study, we propose an end-to-end optimization framework that jointly learns a material classifier and determines the optimal combinations of rotation angles for polarization elements that control both the incident and reflected light states. Using our Mueller-matrix material dataset, we demonstrate that our method achieves high-accuracy material classification even with a limited number of measurements.

Paper Nr: 168
Title:

Training-Free Plant Leaf Disease Severity Estimation Using Fuzzy C-Means Clustering and Reference Palette Validation

Authors:

Om Chatterjee, Prachi Das, Amiya Kumar Bhowmik, Avighyan Chakraborty, Jacob Tauro and Sanjoy Pratihar

Abstract: This study presents a novel framework for automated plant disease severity estimation in leaf images using Fuzzy C-means (FCM) clustering in the Lab color space with reference-based validation. The method integrates HSV-based preprocessing, hierarchical clustering, and perceptually weighted sub-clusters to capture fine-grained chromatic variations indicative of disease progression. A key advantage of the proposed approach is that it operates without any model training or prior dataset requirements, relying solely on perceptual color differences to quantify severity. Severity is determined by computing distances between sub-cluster centroids and reference shades, followed by normalization to a standard scale. Experimental evaluation on 60 images each from six crops-apple, cotton, grape, potato, mango, and tomato-rated by ten independent observers demonstrated strong agreement with human perception, achieving an overall mean accuracy of 98.48%, with a mean absolute error (MAE) of 1.52 and root mean square error (RMSE) of 1.93. The findings highlight the effectiveness of the proposed perceptually informed, training-free framework in enhancing the accuracy and interpretability of disease severity estimation in precision agriculture.

Paper Nr: 201
Title:

Diffusion-Based HDR Reconstruction from Mosaiced Exposure Images

Authors:

Seeha Lee, Dongyoung Choi and Min H. Kim

Abstract: Snapshot-based HDR imaging from Bayer-patterned multi-exposure inputs has gained significant attention with recent advancements in HDR imaging technology. Learning-based approaches have enabled the reconstruction of HDR images from extremely sparse multi-exposure measurements captured on a single Bayer-patterned sensor. However, existing learning-based methods predominantly rely on tone-mapped representations due to the inherent challenges of direct supervision in the HDR radiance domain. This tone-mapping-based approach suffers from critical limitations, including amplified noise and structural distortions in the reconstructed HDR images. The fundamental challenge arises from the high dynamic range of HDR radiance values, which exhibit a sparse and uneven distribution in floating-point space, making gradient-based optimization unstable. To address these issues, we propose a novel diffusion-based HDR reconstruction framework that operates directly in a split HDR radiance domain while preserving the linearity of the original HDR radiance values. By leveraging the generative power of diffusion models, our approach effectively learns the structural and radiometric characteristics of HDR images, leading to superior detail preservation, reduced noise artifacts, and enhanced reconstruction fidelity. Experiments demonstrate that our method outperforms state-of-the-art techniques in both qualitative and quantitative evaluations.

Paper Nr: 217
Title:

Solving Large Square Jigsaw Puzzles with Quasi-Linear Candidate Selection Using DC-L1 Norm

Authors:

Him Kafle and Amit Banerjee

Abstract: Solving jigsaw puzzles presents a dual challenge of computational efficiency and matching accuracy. The existing solutions with quadratic computational complexity often become infeasible for solving large jigsaw puzzles. To address this, the paper introduces an efficient candidate selection strategy by using kd-trees in multi-color space, namely HSV and Lab, to prune the vast search space. On the reduced set of candidates, the proposed algorithm applies the statistically-driven Directional MGC (DC) metric to determine the candidate neighbor. The experimental results show that the proposed two-stage compatibility measurement strategy can be as efficient as the quadratic pairwise matching while significantly reducing the execution time. In addition to the above, in the jigsaw assembly phase, the paper priorities the selection of the candidate neighbors for constructing 2×2 blocks with the highest compatible neighbor to improve the overall accuracy of different datasets.

Paper Nr: 239
Title:

Horizontally-Aware Deep Network for Seismic Multiple Attenuation

Authors:

José Ribamar Durand Rodrigues Júnior, Paulo Ivson Netto Santos, Aristófanes Corrêa Silva, Francisco de Assis Silva Neto, Carlos Rodriguez Suarez and Deane Roehl

Abstract: This work presents a deep learning approach for the attenuation of internal multiples in seismic data from basaltic regions, where complex subsurface heterogeneities (vesicular zones) generate strong reverberations that challenge conventional processing methods. Synthetic datasets were generated from well-based acoustic properties using a Phase Shift (PS) modeling approach, allowing separate simulation of primary and multiple reflections while ensuring physical consistency. The main contribution lies in the introduction of two novel architectural components, the Atrous Spatial Pyramid Pooling horizontal (ASPPh) and the Signed Fusion Head (SFH). The ASPPh block enhances the U-Net's ability to capture horizontal features at multiple scales, improving the discrimination between primaries and multiples. The SFH block, in turn, focuses on refining the seismic reconstruction, enabling more accurate recovery of the signal. Normal Moveout (NMO) correction was applied to emphasize kinematic differences and introduce temporal distortion (stretching), testing the model’s robustness against typical artifacts. Experimental results demonstrate that the proposed method effectively predicts and attenuates internal multiples, yielding enhanced signal recovery and structural fidelity. On the test subset, the model achieved an average SSIM of 0.9995, PSNR of 62.0 ± 3.5 dB, and SNR of 33.9 ± 3.4 dB, indicating high-quality attenuation and accurate reconstruction of primary reflections.

Paper Nr: 283
Title:

Adaptive IG-ODAM: Efficient Attribution Maps for Object Detection via Spatial-Guided Sampling

Authors:

Yuma Nakai, Tsubasa Hirakawa, Takayoshi Yamashita and Hironobu Fujiyoshi

Abstract: We propose Adaptive IG-ODAM, an enhanced version of IG-ODAM that incorporates adaptive sampling. IG-ODAM visualizes the decision-making rationale of object detection models by integrating gradients along a linear interpolation path from a baseline to an input image, following the Integrated Gradients framework. However, uniform sampling along this path can lead to high approximation errors, particularly in regions where contribution scores change rapidly, thereby reducing the fidelity of the visualization. Furthermore, in saturated regions where gradients vary little, gradients are often noisy, resulting in unstable attribution maps. To address these issues, we introduce Spatial-Guided Adaptive Sampling, which dynamically adjusts interpolation intervals based on gradient fluctuations and IoU variations of predicted bounding boxes. This strategy enables more precise sampling near decision boundaries, improving integration accuracy in critical regions. Experiments on the COCO dataset demonstrate that our method achieves higher attribution quality with fewer samples than the original IG-ODAM, thereby enhancing both computational efficiency and visual fidelity.

Paper Nr: 313
Title:

Efficient Multi-Temporal Building Change Detection with Reduced-Rank Linear Attention

Authors:

Ricardo A. Acuña-Villogas, Germain García-Zanabria, Cristian Lopez and Rensso Mora-Colque

Abstract: Transformer-based architectures have become the dominant paradigm for bi-temporal building change detection (BCD). However, their application to very high–resolution (VHR) imagery remains limited by the quadratic complexity of multi-head self-attention (MHSA). This work introduces RALA-ChangeFormer, an efficient variant of ChangeFormer that replaces the encoder’s MHSA modules with rank-augmented linear attention (RALA). The integration preserves the hierarchical structure of the original model and maintains its ability to perform bi-temporal spatio–temporal reasoning. Experiments on the LEVIR-CD and DSIFN-CD benchmarks show that RALA-ChangeFormer delivers segmentation accuracy comparable to the baseline, with variations within standard training variance. The method reduces attention-related GFLOPs, lowers peak GPU memory consumption, and enables substantially larger batch sizes at multiple resolutions. Ablation studies confirm that the efficiency gains arise primarily from the encoder, where token counts are highest. Overall, the results demonstrate that rank-augmented linear attention provides a practical accuracy–efficiency trade-off for VHR BCD. The proposed approach improves computational scalability without sacrificing predictive performance, making high-resolution or large-scale training more accessible under fixed hardware constraints.

Short Papers
Paper Nr: 59
Title:

Quality of Brain Tumour Detection in Hyperspectral Imaging Based on Ground Truth Representation

Authors:

Martin Rydlo, Max Verbers, Carlos Vega, Raquel Leon, Himar Fabelo, Gustav Liu Burström, Alfonso Lagares, Jesus Morera Molina, Gustavo M. Callico, Francesca Manni and Svitlana Zinger

Abstract: Brain surgery is a complex procedure requiring precise incisions for tumour tissue removal. Hyperspectral imaging is a promising non-invasive medical image acquisition technique explored in brain surgery as an assistive intraoperative image guidance tool. The data obtained offers potential for tumour detection, but is currently limited by sparse labelling based on biopsy locations with known histopathological outcomes. The contribution of this research lies in creating image processing methods to meaningfully expand the amount of labelled ground truth data, specifically tumour tissue. The expansions, performed using image morphology and unsupervised segmentation, are used to train machine learning classifiers. A Leave-one-patient-out Cross-validation method is utilised for binary classification performance with a hyperspectral dataset comprising 36 images from 25 patients. The image morphology and segmentation-based methods yield an average macro F1 score increase of 15.7% and 14.8% of all classifiers compared to the original ground truth benchmark, respectively. These results establish the feasibility of explainable image processing-based ground truth expansion methods for improved brain tumour classification performance and may enable new future hyperspectral imaging research methods, contributing to the development of new surgical guidance technologies.

Paper Nr: 88
Title:

GazeUnconstrained: A Multimodal Dataset for Visual Attention and Gaze Estimation in Natural Video Viewing

Authors:

Wenjuan Zhou, Haibin Cai, Baihua Li and Qinggang Meng

Abstract: Accurate gaze estimation in naturalistic environments remains a significant challenge due to head movement, lighting variability, and the dynamic nature of visual attention. Most existing datasets are either task-driven, requiring participants to fixate on predefined targets, or target-free with limited cognitive engagement, which limits their ability to capture naturalistic viewing behavior. We introduce GazeUnconstrained, a multimodal dataset designed to capture gaze behavior during cognitively engaging long-form video viewing. Using a synchronized pipeline, we record 2D gaze points, 3D gaze vectors, facial videos, and screen content with frame-level alignment, enabling cross-modal analysis without post-processing. Unlike existing resources such as Gaze360 (pose diversity without synchronized stimuli) and EVE (stimuli variety with weak semantic engagement), GazeUnconstrained complements them by providing synchronized, naturalistic, and semantically rich visual attention data. We benchmark five state-of-the-art appearance-based models on GazeUnconstrained, revealing that attention-driven gaze shifts and patterns in natural video viewing remain challenging for existing approaches. Although modest in scale (14 subjects), the dataset demonstrates the importance of high-quality multimodal synchronization and introduces new challenges that motivate hybrid and context-aware methods. To facilitate further research, both the dataset and benchmark code are publicly released.

Paper Nr: 104
Title:

Computing a Characteristic Orientation for Rotation-Independent Image Analysis

Authors:

Cristian Valero-Abundio, Emilio Sansano-Sansano, Raúl Montoliu and Marina Martínez García

Abstract: Handling geometric transformations, particularly rotations, remains a challenge in deep learning for computer vision. Standard neural networks lack inherent rotation invariance and typically rely on data augmentation or architectural modifications to improve robustness. Although effective, these approaches increase computational demands, require specialised implementations, or alter network structures, limiting their applicability. This paper introduces General Intensity Direction (GID), a preprocessing method that improves rotation robustness without modifying the network architecture. The method estimates a global orientation for each image and aligns it to a canonical reference frame, allowing standard models to process inputs more consistently across different rotations. Unlike moment-based approaches that extract invariant descriptors, this method directly transforms the image while preserving spatial structure, making it compatible with convo-lutional networks. Experimental evaluation on the rotated MNIST dataset shows that the proposed method achieves higher accuracy than state-of-the-art rotation-invariant architectures. Additional experiments on the CIFAR-10 dataset, confirm that the method remains effective under more complex conditions.

Paper Nr: 136
Title:

Supervised Deep Feature-Based Industrial Defect Detection in Optical Lenses with Minimal Data

Authors:

Tarek Zenati

Abstract: Defects in optical lenses can significantly affect the final product’s quality and performance. In most industrial lines, quality control is performed by human operators and the precision of defect detection is heavily dependent on their skills. However, some defects are subtle and difficult for the human eye to detect, prompting the adoption of machine vision systems for automation. A key challenge in integrating machine learning into these systems is the lack of high-quality labeled data. To address this, we propose an optimized supervised learning model based on a customized VGG16 convolutional layer for feature extraction combined with a machine learning classifier among SVM, Random Forest, Linear Regression, KNN or gradient boosting to detect specific optical lens defects. The novel contribution of our work is the successful application of this model with an exceptionally small dataset consisting of only 87 labeled images. Our experiments demonstrate the model’s effectiveness, achieving a precision of 94 %, recall of 97 %, F1 score of 88 %, and accuracy of 86 %. These results show that our optimized algorithm can achieve state-of-the-art performance despite the limited dataset, offering an efficient solution for vision-based defect detection systems with minimal labeled data.

Paper Nr: 138
Title:

Fusing Thermal and Event Data for Visible Spectrum Image Reconstruction

Authors:

Simone Melcarne and Jean-Luc Dugelay

Abstract: Reconstructing visible spectrum images from unconventional sensors is a timely and relevant problem in computer vision. In settings where standard cameras fail or are not allowed, thermal and event-based cameras can offer complementary advantages-robustness to darkness, fog, motion, and high dynamic range conditions-while also being privacy-preserving and energy efficient. However, their raw data is hard to read, and most computer vision models are designed and pretrained on standard visible inputs, making direct integration of unconventional data challenging. In this work, we ask whether it is possible, given a paired system that simultaneously records thermal and event data, to recover the kind of information people associate with the visible spectrum. We propose a simple dual-encoder, gated-fusion network that synthesizes visible-like images from thermal frames and event streams. The thermal branch captures structure and coarse appearance; the event branch models spatio-temporal changes and adds more detailed information. Their outputs are combined together and finally decoded into a colored image. We train and test the proposed solution on a paired thermal–visible–event dataset. Results show that this approach can recover plausible visible images producing better results than single-modality baselines, both quantitatively and qualitatively.

Paper Nr: 166
Title:

Centrality-Driven Prufer-Code Based Graph Encoding to Capture Structural Relationships

Authors:

Atashi Saha and Sanjoy Pratihar

Abstract: Graphs model real-world systems like social networks or image areas. They tend to have a high number of connected nodes, so their direct analysis is computationally expensive. For this reason, in this research work a bias-controlled spanning tree encoding scheme has been introduced. It combines node centrality with Prufer sequence representation. The method begins with the detection of influential nodes using a hybrid centrality formulation and then uses the node labels to influence the building of an Minimum Spanning Tree (MST) under a tunable bias factor α. The MST captures the key backbone of connectivity, but also retains the influence hierarchy of the network. The resultant tree is then represented as a Prufer sequence that yields an order-independent compact representation to use in comparison, storage, and analysis of big graphs with potential applications in image segmentation, retrieval, image interpretation, and semantic labeling, etc. Social (Facebook) and image-based (Superpixel) network experiments show that the proposed representation reliably balances structural preservation and influence-driven organization and provides a scalable substitute for relational and visual graph understanding.

Paper Nr: 174
Title:

Similarity Analysis of AI-Generated Images from Spanish and English Prompts via Feature Extraction

Authors:

Paulina Morillo, Christopher Morales, Christian Castro and Diego Vallejo-Huanga

Abstract: Currently, image-generating AIs are being used more frequently for a variety of creative and practical applications. The outputs of these tools depend heavily on natural language texts, called prompts, with the prompt’s language significantly influencing the final result. Similarity can be desirable or undesirable depending on the application. In this work, we focus on cross-lingual consistency, where similarity between semantically equivalent prompts in different languages is desirable as an indicator of multilingual robustness. In this study, we compare images generated by three different artificial intelligence tools: Leonardo AI, Bing Image Creator, and DALL·E 2. Using prompts in both Spanish and English, and applying feature extraction techniques such as SIFT and ORB to calculate distances between the images, a total of 600 images were created-half generated from English prompts and the other half from Spanish prompts. Additionally, prompts were grouped into three categories: faces, landscapes, and art. The results show that DALL·E 2 produces slightly smaller distances between images than Leonardo AI and Bing Image Creator, but overall, the measured distances indicate comparable local-structural similarity across languages for the three tools, with average distances of 1.3238 and 1.0576 using SIFT and ORB, respectively.

Paper Nr: 196
Title:

Depth Estimation in Scattering Media due to Synchronization Delay in Projector Camera Systems

Authors:

Ryosuke Komatsu, Masaya Oishi, Takafumi Iwaguchi, Hiroshi Kawasaki and Hiroyuki Kubo

Abstract: This study proposes a method for estimating the depth of objects within scattering media using a synchronized projector-camera system. We observed that when an intentional synchronization delay is introduced between the projector and the camera, the captured shadow image of an object changes depending on the delay time, and that the magnitude of this image displacement varies with the object’s depth inside the medium. Based on this observation, we propose a depth estimation method that analyzes the image variations obtained under different synchronization delay times. The proposed measurement technique involves capturing a sequence of images while systematically varying the delay time applied to the projector–camera synchronization. From this sequence, a cross-sectional slice image containing part of the object is extracted. A Hough transform is then applied to the slice image to measure the displacement of object features. By applying this process across the captured dataset, the depth of objects within scattering media can be estimated. Experimental verification confirmed that the proposed method effectively estimates the depth of objects inside scattering medium.

Paper Nr: 198
Title:

Event Camera Deraining Using Transformer Networks with Time-Space Attention

Authors:

Pin-Yuan Yang and Shih-Chieh Chang

Abstract: Event cameras are bio-inspired visual sensors offering high dynamic range, high temporal resolution, and low power consumption, making them suitable for high-speed and variable lighting conditions. These advantages have spurred extensive research into their applications in various visual tasks. However, when mounted on vehicles like drones, event cameras face challenges under adverse weather, particularly rain. Raindrops trigger events at high velocity, creating rain streaks that interfere with object edges and degrade downstream task performance. This paper addresses rain removal in event camera data. We propose a novel transformer-based model for event deraining. By adapting the Vision Transformer (ViT) to handle spatiotemporal segmentation of event frames, we enable more flexible modeling of spatiotemporal relationships. We introduce several self-attention designs and empirically validate their performance on paired event raining datasets. Our best-performing design, the ”xt-y attention” architecture, applies local attention to tokens in neighboring xt planes followed by neighboring tokens in the y direction. This approach significantly reduces voxel and reconstruction errors on synthetic and real rain datasets. Our results show the proposed transformer-based method outperforms WTSD and other schemes, providing a robust solution for rain removal in event data and enhanc-ing downstream task performance under rainy conditions.

Paper Nr: 224
Title:

Dance Style Classification Using Laban-Inspired and Frequency-Domain Motion Features

Authors:

Ben Hamscher, Arnold Brosch, Nicolas Binninger, Maksymilian Jan Dejna and Kira Maag

Abstract: Dance is an essential component of human culture and serves as a tool for conveying emotions and telling stories. Identifying and distinguishing dance genres based on motion data is a complex problem in human activity recognition, as many styles share similar poses, gestures, and temporal motion patterns. This work presents a lightweight framework for classifying dance styles that determines motion characteristics based on pose estimates extracted from videos. We propose temporal-spatial descriptors inspired by Laban Movement Analysis. These features capture local joint dynamics such as velocity, acceleration, and angular movement of the upper body, enabling a structured representation of spatial coordination. To further encode rhythmic and periodic aspects of movement, we integrate Fast Fourier Transform features that characterize movement patterns in the frequency domain. The proposed approach achieves robust classification of different dance styles with low computational effort, as complex model architectures are not required, and shows that interpretable motion representations can effectively capture stylistic nuances.

Paper Nr: 254
Title:

Self-Supervised Real-World Image Denoising with Noise-Level-Aware Dynamic Receptive Fields

Authors:

Yoichi Furukawa, Takahiro Maruyama and Kazuhiro Hotta

Abstract: Image denoising aims to restore clean images from noisy observations, but the spatial variation and complexity of real-world noise remain major challenges. Blind Neighborhood Network (BNN) is effective in flat regions, yet its fixed receptive field causes performance degradation when noise levels differ across image samples. Locally Aware Network (LAN) captures fine structures but is highly sensitive to global noise-level changes due to its narrow receptive field. In conventional self-supervised pipelines, the outputs of BNN and LAN are combined to construct pseudo-clean targets. However, artifacts originating from the BNN often remain in these pseudo-targets. Because LAN focuses primarily on local detail restoration, it cannot fully correct these residual artifacts, causing them to propagate into the final denoised output. To address these limitations, this study introduces three components: Noise Level Estimator (NLE), Dynamic-Receptive-Field BNN (DRF-BNN), and Detail-Region Re-Noising (DRRN). The NLE predicts per-image noise level using a ResNet18 classifier. The DRF-BNN adjusts its receptive field according to the estimated noise level, improving adaptability across diverse noise conditions. The DRRN selectively reintroduces differential noise to regions where fine structures exist, reducing LAN’s sensitivity to excessive noise while enhancing structural detail. Integrated into a self-supervised framework, these components collectively enable stable adaptation to various noise levels, suppression of artifacts in high-noise regions, and improved fine-structure restoration in real-world images.

Paper Nr: 290
Title:

Video Denoising Still Needs Quality Benchmarks

Authors:

Sofiia Dorogova, Amir Shamsutdinov, Aleksei Khalin and Egor Ershov

Abstract: Modern neural models require large and representative datasets to learn effectively; however, existing public benchmarks remain limited in scale, realism, and the accuracy of their noise modeling. This research addresses the urgent need for high-quality, realistic benchmarks in video denoising, particularly within the RAW domain. We present a structured overview of existing real and synthetic video datasets, analyzing their acquisition methodologies - including stop-and-motion capture, beam-splitter setups, screen re-capture of 4K content, and low-light RAW mobile bursts. Although some of the datasets were originally developed for low-light enhancement, they demonstrate acquisition strategies that are highly relevant to future video denoising research. Through a unified comparative analysis, we identify systematic gaps, notably the absence of large-scale RAW benchmarks that combine realistic motion, temporal consistency, and device diversity. As a contribution, we propose a new direction for dataset development based on an enhanced beam-splitter methodology for syn-chronized 4K RAW video acquisition. This approach aims to produce physically realistic noisy-clean pairs and to establish an open, high-fidelity foundation for future deep learning research in video denoising. All reviewed datasets are compiled and maintained at: https://github.com/sofiadorogova/video denoising datasets.

Paper Nr: 291
Title:

Edge Bundling with Divergence and Convergence

Authors:

Naoki Hashimoto, Ian Gomasaki and Ryosuke Saga

Abstract: Edge bundling reduces visual clutter in graph visualizations by grouping edges. While Genetic Algorithm (GA)-based approaches offer diverse bundling styles, they often lack the tight bundling quality of traditional physics-based methods. This paper proposes a novel GA-based edge bundling method that incorporates a ”convergence” process to enhance bundling quality while maintaining the diversity inherent to evolutionary algorithms. We introduce two convergence strategies: Nearest Neighbor Merging and FDEB-like Merging, both controlled by a single frequency parameter. Experiments demonstrate that the proposed method significantly improves bundling quality metrics such as Ink-Ratio, Spatial-Entropy, and Edge-Crossings compared to the baseline. Furthermore, the frequency parameter effectively controls the degree of bundling, allowing for adjustable visualization styles. Comparative analysis reveals that Nearest Neighbor Merging offers consistent improvements at higher frequencies, while FDEB-like Merging can be advantageous for reducing edge crossings in specific scenarios.

Paper Nr: 315
Title:

Compensating Light Source Defocus for Per-Pixel Surface Roughness Measurement

Authors:

Ryohei Ohmori, Michitaka Yoshida, Ryo Kawahara and Takahiro Okabe

Abstract: Measuring per-pixel surface roughness is important not only for photorealistic image synthesis but also for visual inspection of various surfaces. The existing methods measure/estimate the roughness values of a target surface from the specular reflection components observed in the images taken under point or area light sources. Unfortunately, however, those methods miss the light source defocus; the light source illuminating the target surface would be out of focus due to the limited depth of field of a camera. Therefore, they overestimate the roughness values in general. Accordingly, based on a novel Fourier analysis of illumination, reflection, and light source defocus, we show the relationship between the apparent surface roughness with the light source defocus and the inherent surface roughness without the light source defocus. Then, we propose a simple but effective method for compensating the light source defocus for per-pixel surface roughness measurement by using a reference surface. We conducted a number of experiments using both synthetic and real images, and confirmed that our proposed method can accurately estimate per-pixel surface roughness from a small number of images.

Paper Nr: 321
Title:

Unveiling AI-Manipulated Medical Images: Detecting and Localizing Tampered Areas

Authors:

T. Ashlesha, Vignesh, S. Varun, Vats Shubhangi and Narayan Surabhi

Abstract: Recent generative models based on Generative Adversarial Networks (GANs) and diffusion can insert synthetic tumors or erase genuine lesions from Computed Tomography (CT) scans, posing a serious threat to clinical decision making. Existing forensic approaches are largely limited to slice-level real–vs–fake classification and rarely indicate where a scan was altered or how it was manipulated, particularly for subtle removal-type edits. We propose a compact, multi-stage forensic framework for (i) tamper detection, (ii) manipulation-type classification (injection vs. removal), and (iii) pixel-level localization. A removal-oriented localization module fuses PACE-enhanced CT, Error Level Analysis (ELA), noise-variance maps, and low-frequency Fast Fourier Transform (FFT) to expose over-smoothed inpainted regions, while injected lesions are emphasized using Local Gradient Ternary Patterns (LGTrP) and Pyramid Histogram of Oriented Gradients (PHOG). The enriched descriptors feed a dual-stream classifier combining DenseNet-121 spatial features with a lightweight frequency Convolutional Neural Network (CNN), achieving an F1-score of 0.937 for real–vs–fake detection and a macro F1 of 0.938 for injection–vs–removal classification. For localization, U-Net++ with attention and a soft-kNN refinement head segments injected regions (Dice = 0.9201), while a streamlined CT+ELA+Noise+FFT UNet identifies removal regions (Dice = 0.7369). Experiments on 10,844 CT slices show 7–12% gains over spatial-only or texture-only baselines, with full-scan analysis completing in 102 seconds. Overall, the framework delivers an end-to-end, interpretable solution for detecting, characterizing, and localizing GAN- and diffusion-driven manipulations in lung CT imagery, enabling direct radiologist review.

Paper Nr: 338
Title:

Pixel-Driven Image Representation through Rule-Based Cellular Automata Dynamics

Authors:

Lorenzo C. Maia, Priscila T. M. Saito and Pedro H. Bugatti

Abstract: Pixel analysis and classification are cornerstone tasks in computer vision. This paper introduces a novel pixel-driven image description approach based on cellular automata (CA). The proposed method models local discrete relationships between pixels and propagates this information through rule-based CA dynamics to enhance discrimination. Four variants for extracting pixel relationship information are designed, all derived from two-dimensional orthogonal grid rules inspired by Conway’s Game of Life. By combining these variants into compact histogram-based descriptors, the resulting representations are inherently robust to geometric transformations such as translation, rotation, and scaling. The proposed approach is validated on a public image dataset and compared against a vanilla CNN and a ResNet-18 architecture. Multiple evaluation metrics are reported, including accuracy, macro F1, precision, and recall. Experimental results show that the proposed method achieves strong classification performance, with gains of up to 35% in macro F1 compared to the vanilla CNN. A detailed per-class analysis further reveals several scenarios in which the proposed approach matches or outperforms ResNet-18, while requiring substantially lower computational. These results testify that our approach provides a competitive, interpretable, and resource-efficient alternative to deep CNN architectures for image representation.

Paper Nr: 175
Title:

Improving Video Object Detection Performance in Rainy Conditions Using Edge Preserving Image Deraining

Authors:

Sibani Panigrahi, Debi Prosad Dogra and Jothi Ramalingam

Abstract: Restoring rain-degraded images is important for improving various computer vision applications. Rain can significantly reduce visibility, negatively impacting these applications. There are two types of visibility problems caused by rain. First, distant rain streaks accumulate, creating a hazy, fog-like veil, as the streaks scatter light, reducing clarity. Second, nearby rain streaks produce highlights and obstruct the background, which decreases the visibility. The rain streaks can vary in shape, size, and direction, especially during heavy rain, leading to severe loss of visibility. To overcome this, we propose Edge Preserving Image Deraining (EPID), aiming to remove rain artifacts while retaining the inherent structure, contour and texture of the scene. EPID is based on auto-encoder which adopts edge-aware decomposition techniques which includes domain transform filter to separate an image to base and detail layers , edge-preserving loss functions which includes Charbonnier, perceptual and SSIM losses and is trained with detailed structures of the images. EPID helps reduce rain artifacts preserving image sharpness and structural integrity. Experiments on real world videos reveal that the method is highly effective in reducing the effect of rain. This essentially helps the downstream applications such as object detection and tracking in adverse weather conditions.

Paper Nr: 234
Title:

CP HDR: A Feature Point Detection and Description Library for LDR and HDR Images

Authors:

Artur Santos Nascimento, Daniel Oliveira Dantas, Valter Guilherme Silva de Souza and Beatriz Trinchão Andrade

Abstract: Feature point detection and description are fundamental to computer vision, however, their performance severely degrades under extreme lighting conditions. While HDR imaging can overcome these limitations by capturing a wider luminance range, the use of HDR images in feature extraction applications remains a significant technical challenge, leading most researchers to rely on tone mapping operators to convert HDR images into LDR. This study evaluates the use of HDR imagery for robust feature detection and description. Based on this study, we developed the CP HDR: a novel, open-source library designed for fair and reproducible comparison of feature point detection and description algorithms for LDR and HDR images. CP HDR incorporates state-of-the-art metrics, established detectors (Harris, SIFT), and their HDR-optimized variants (Harris for HDR and SIFT for HDR), which integrate a coefficient of variation (CV) filter and logarithmic transformation to enhance performance in saturated regions. Our evaluation of specialized datasets demonstrates that direct HDR processing, particularly with our modified algorithms, significantly improves feature distribution uniformity rate (UR), up to 83%, across brightest, intermediate, and darkest image areas compared to LDR baselines. The library, its documentation, and benchmarking tools are publicly available to facilitate future research in HDR computer vision.

Paper Nr: 285
Title:

No Mountain, no Building, no Cue? Synthetic Data Generation of Digital Surface Models and Their Application to Visual Geo-Localization

Authors:

Nicolai Skutsch, Olaf Hellwich and Frank Fuchs-Kittowski

Abstract: Visual geo-localization techniques are crucial for estimating the precise position and orientation in the context of outdoor mobile augmented reality (AR) applications. The information contained in digital surface models (DSMs) suggests that their employment in visual geo-localization is a promising approach for rural, non-mountainous environments. However, the paucity of images and DSMs from rural regions renders existing datasets ill-suited for the development and evaluation of approaches for these regions. This paper proposes a method for generating ground-view query images and aerial-view reference images, which can be utilized to construct a DSM, from the video game Grand Theft Auto V (GTA V). In order to evaluate the applicability of DSMs to visual geo-localization, an approach based on skyline matching has been applied to the generated data. The findings indicate that the implementation of DSMs holds considerable potential to provide useful visual cues. However, the results also underscore the demand for future research on visual geo-localization in rural environments. The code is publicly available at: https://github.com/n-skutsch/GTAGeo.

Paper Nr: 293
Title:

Few-Shot Data Sampling Strategies for Insect Pest Maturity Classification with YOLO Family Models

Authors:

Kelly Abreu, Aclecio Costa and Dibio Borges

Abstract: The scarcity of labeled data is a critical bottleneck for training deep learning models in Precision Agriculture. This paper addresses this challenge by systematically evaluating the interplay between data-centric sampling strategies, transfer learning configurations, and YOLO architectures (v5, v8, v11). We compare two sampling strategies (sharpness-guided, background-aware) against a random baseline and analyze two transfer learning approaches: full fine-tuning and partial backbone freezing. Our results show that while data-centric sampling did not uniformly improve mean mAP@0.5 across all architectures, it significantly enhanced training stability, reducing variance across runs. Furthermore, stability was found to depend on biological maturity and freezing policy: YOLOv11 with background-aware sampling achieved superior consistency for adult insects, whereas early-stage classes exhibited markedly higher instability and sensitivity to algorithmic and freezing changes, reinforcing the need for adaptive strategies tailored to biological maturity.

Paper Nr: 324
Title:

SynShapes: A Synthetic Image-Annotation Dataset for Edge Detection

Authors:

Guillermo A. Castillo, Xavier Soria and Angel D. Sappa

Abstract: In the edge detection area, generating ground-truth edge annotations is a time-consuming, costly and subjective process, often leading to inconsistent labels and limited scalability. To address this issue, we propose SynShapes, a synthetic dataset composed of automatically generated image–annotation pairs designed to address these limitations. The dataset is created using a Python-based pipeline that produces consistent ground-truth edge maps from randomly generated geometric shapes; and in order to reduce the gap between synthetic and real data, a Fourier Domain Adaptation (FDA) strategy is applied. Then, SynShapes is benchmarked by training the TEED edge detector model on multiple datasets: BIPED, BSDS500, MDBD and SynShapes with FDA; evaluations across the training datasets and the UDED dataset show that SynShapes with FDA achieves performance comparable to, and in some cases surpassing those trained on real-world datasets; all without the cost and bias of manual annotation. The dataset is publicly available on: https://vision-cidis.github.io/SynShapes/.

Area 4 - Recognition & Detection

Full Papers
Paper Nr: 41
Title:

Toward Automated Bed Safety: Comparative Study of Vision Algorithms for Bed-Rail State Detection

Authors:

Navid Aslankhani Khameneh, Michela Farenzena and Marco Carletti

Abstract: Detecting bed-rail position is critical for patient safety in clinical and home settings. While bed-rails are designed to prevent falls, their effectiveness relies on correct positioning, which can be monitored automatically with camera-based systems. In this work, we present a comparative study of vision-based approaches for detecting bed-rail state using infrared (IR) and depth cameras. We evaluate three approaches: (i) a classical image-space baseline that extracts linear structures from IR frames; (ii) a 3D geometry pipeline that fits planar or linear models to point-cloud regions; and (iii) a lightweight neural network that classifies rail point clouds. Our main contribution is an occlusion-robust geometric detector based on multi-region RANSAC line fitting with orientation constraints and a temporal smoothing, designed to detect bed rails under partial occlusion. Through extensive experiments on a labeled dataset under various occlusions, we analyze the trade-offs among accuracy, robustness, and computational efficiency. Our results show that while traditional methods are fast and effective in ideal scenes, 3D geometry-based and learning-based approaches that operate on point clouds provide better generalization under real-world variability. This study serves as a practical guide for selecting methods for vision-based patient safety monitoring.

Paper Nr: 53
Title:

Simplify Your Fusion: Reducing Complexity for Multimodal Sensor Fusion

Authors:

Thorsten Herd, Philipp Heidenreich and Christoph Stiller

Abstract: In this paper we present a novel approach to improve runtime efficiency and detection performance of a mul-timodal sensor fusion network for autonomous driving. Building upon the FusionFormer architecture, we introduce a multiscale attention strategy combining sequential upscaling with multilevel LiDAR and camera feature inputs. Furthermore, we investigate the suitability of deformable attention in sparse LiDAR data and propose a lightweight alternative called buffer attention. Extensive experiments on the nuScenes dataset demonstrate that our method significantly reduces the GPU memory footprint and the inference latency of the bird’s-eye view fusion encoder while improving the detection accuracy. These findings support the development of efficient and scalable fusion architectures for real-world applications.

Paper Nr: 74
Title:

Stylized Synthetic Augmentation Further Improves Corruption Robustness

Authors:

Georg Siedel, Rojan Regmi, Abhirami Anand, Weijia Shao, Silvia Vock and Andrey Morozov

Abstract: This paper proposes a training data augmentation pipeline that combines synthetic image data with neural style transfer in order to address the vulnerability of deep vision models to common corruptions. We show that although applying style transfer on synthetic images degrades their quality with respect to the common Frechet Inception Distance (FID) metric, these images are surprisingly beneficial for model training. We conduct a systematic empirical analysis of the effects of both augmentations and their key hyperparameters on the performance of image classifiers. Our results demonstrate that stylization and synthetic data complement each other well and can be combined with popular rule-based data augmentation techniques such as TrivialAugment, while not working with others. Our method achieves state-of-the-art corruption robustness on several small-scale image classification benchmarks, reaching 93.54%, 74.9% and 50.86% robust accuracy on CIFAR-10-C, CIFAR-100-C and TinyImageNet-C, respectively.

Paper Nr: 86
Title:

G2TM: Single-Module Graph-Guided Token Merging for Efficient Semantic Segmentation

Authors:

Victor Bercy, Martyna Poreba, Michal Szczepanski and Samia Bouchafa

Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a range of computer vision tasks, however their substantial computational requirements remain a significant limitation, especially in dense prediction tasks. Reducing this burden is a central challenge for deploying ViTs at scale. While token merging has emerged as a promising solution, existing methods often rely on iterative or multiple stages, limiting their efficiency. We introduce Graph-Guided Token Merging (G2TM), a lightweight one-shot module designed to eliminate redundant tokens early in the network. G2TM performs a single merging step right after a shallow attention block, enabling all subsequent layers to operate on a compact token set. Our method constructs a similarity graph over tokens and exploits its connected components to identify groups of semantically redundant patches. Tokens within each connected component are then aggregated into a single representative via feature averaging, which not only reduces redundancy but also preserves semantic coherence across the image. Extensive experiments on semantic segmentation benchmark show that G2TM reduces computational complexity by 30-45%, accelerates inference by up to 50% compared to a Segmenter baseline, achieving these gains with minimal accuracy drop. Code is available at https://github.com/vbercy/g2tm-segmenter.

Paper Nr: 131
Title:

Group Emotion Recognition Explained! A Human-Centric Approach Using Multi-Modal VGER Dataset and Framework

Authors:

Harinadh Sivaramakrishna Patamsetti, Kamakshya Prasad Nayak, Kamalakar Vijay Thakare, Debi Prosad Dogra, Heeseung Choi, Haesol Park and Ig-Jae Kim

Abstract: Group Emotion Recognition (GER) plays a pivotal role in building human-centric AI systems that can understand, respond to, and support collective human experiences in healthcare, collaborative learning, and human–computer interaction. While emotion analysis of individuals from images has advanced significantly, extending this understanding to small groups in dynamic video settings remains challenging due to temporal variability, occlusion, and the diversity of interpersonal expressions. Generally, humans interact with small groups of people in daily life. The existing GER datasets are often designed for large, crowded scenarios with coarse emotional categories, making them less suitable for modeling subtle social and relational dynamics in everyday human interactions. To bridge this gap, we present the Video Group Emotion Recognition (VGER) dataset, designed around small conversational groups of 2–6 individuals. Unlike prior datasets that contain the labeled emotions only, VGER incorporates both group emotion categories and graded levels of cohesion. Cohesion reflects the strength of interpersonal bonding. We further propose a multi-modal framework that combines visual and textual cues with cohesion annotations, enabling more contextually grounded and socially aware group emotion analysis. Our method integrates person-level emotion data, group-level visual descriptors, and semantic representations from a pretrained vision language model, fused through an attention mechanism that incorporates cohesion embeddings. Comprehensive experiments highlight the importance of cohesion in improving recognition performance. The work demonstrates the potential to move GER towards socially intelligent, context-aware, and capable AI systems having human–level interactions. This work lays the foundation for developing future human-centric applications where understanding group affect is essential for trust, collaboration, and well-being.

Paper Nr: 134
Title:

UMergeNet: Exploring Lightweight Mechanisms for High-Performance Semantic Segmentation

Authors:

Alan Klinger Sousa Alves and Bruno Motta de Carvalho

Abstract: In this paper, we propose UMergeNet, a convolutional neural network designed for semantic segmentation. With approximately 1.7 million trainable parameters, it achieves accuracy comparable to traditional architectures such as U-Net and Attention U-Net, while maintaining faster inference time. UMergeNet employs Axial Convolutions to reduce computational complexity and introduces three intermediate modules, termed MergeBlocks, between the encoder and decoder to enhance information flow. Furthermore, its dual-path en-coder–decoder design enables processing across multiple receptive field scales, improving the model’s ability to capture fine structural details. Experimental results demonstrate that Axial Convolutions offer an excellent balance between accuracy and efficiency when compared to standard and atrous convolutions, while Depth-Wise variants further improve inference speed with only a minor loss in accuracy. These findings indicate that UMergeNet is well suited for applications requiring high accuracy-comparable to larger networks-while using fewer computational resources. We also show that Axial Convolutions can be more efficient than atrous or standard convolutions, suggesting their potential for integration into other architectures. The code will be made publicly available at: https://github.com/klingerkrieg/UMergeNet.

Paper Nr: 173
Title:

ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

Authors:

Tony Montes and Fernando Lozano

Abstract: Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes state-of-the-art performance in zero-shot VideoQA, showing enhanced results on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at https://github.com/t-montes/viqagent.

Paper Nr: 179
Title:

Multi-Objective Quantum-Inspired Firefly Algorithm for Medical Image Segmentation

Authors:

Alokeparna Choudhury, Sourav Samanta, Sanjoy Pratihar and Oishila Bandyopadhyay

Abstract: In medical image analysis, segmentation is a crucial step because it groups various areas, such as organs, tissues, and lesions, within the image. Different computational methods have been employed to enhance the quality and accuracy of segmentation. Hence, the evolutionary optimization method is a widely used computational approach for incorporating various segmentation methods, such as thresholding and clustering. The majority of works in the literature focus on a single objective, where the quality of the segmentation is measured based on a single function. In contrast, less work has been carried out considering multiple objectives simultaneously to evaluate the segmentation quality. Moreover, very little work has been reported. The firefly algorithm has become one of the efficient optimization algorithms widely applied in multiple domains. According to the findings of the present research, the multi-objective quantum-inspired firefly algorithm, presented for the purpose of segmenting two types of medical imaging modalities, including magnetic resonance imaging and computed tomography, performed effectively. The correlation and structural similarity index measure are two objectives for evaluating segmentation quality. The performance of the proposed quantum-inspired version is also evaluated using a classical multi-objective firefly algorithm based on various segmentation quality parameters, which suggests an improvement in both the segmentation quality and the algorithm’s performance.

Paper Nr: 202
Title:

Human-in-the-Loop Refinement of Zero-Shot Object Detection for Domain-Specific Artwork Datasets

Authors:

Alex Alonso Dalbøl, Christofer Meinecke and Stefan Jänicke

Abstract: This work investigates the use of zero-shot object detection for analyzing Holocaust-related artworks, aiming to develop efficient and interpretable AI tools for cultural heritage professionals. Adopting an interdisciplinary approach that bridges computer vision, participatory design, and historical research, the study introduces CHORUS, a human-in-the-loop annotation framework that allows experts to review, correct, and extend AI-generated detections. The performance of two state-of-the-art zero-shot models, OWLv2 and GroundingDINO, is compared, emphasizing the influence of prompt engineering, label curation, and conservative detection behavior when working with sensitive visual material. The CHORUS prototype was developed in the context of the MEMORISE project and evaluated and tested with associated domain experts and heritage professionals. While current models remain limited in handling symbolic and interpretive content, the study demonstrates how human-guided AI can accelerate dataset creation and improve analytical accuracy in low-resource cultural heritage domains. The paper concludes with design recommendations and outlines future work toward a fully integrated annotation and training pipeline.

Paper Nr: 223
Title:

Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving

Authors:

Serin Varghese, Kevin Roß, Fabian Hüger and Kira Maag

Abstract: Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.

Paper Nr: 302
Title:

AVNet: Cross-Spectral Attention-Vision Model for Camouflaged Object Detection in Ecological Conservation

Authors:

Henry O. Velesaca, Andrea Mero, Rafael E. Rivadeneira, Guillermo A. Castillo and Angel D. Sappa

Abstract: This work introduced AVNet, a novel attention-vision architecture for Camouflaged Object Detection (COD), optimized for ecological conservation. The proposed approach integrates an RGB-Thermal fusion approach with the Convolutional Block Attention Model (CBAM) within an encoder-decoder framework, enabling accurate detection of low-contrast and highly camouflaged targets. As an additional contribution, this study introduces the Bimodal Iguana Observational Set (BIOS), comprising 148 camouflaged RGB-Thermal registered image pairs, specifically collected to support COD research in wildlife conservation. Experimental results validate the model’s robustness under challenging real-world conditions. The original code and dataset presented in the study are openly available in GitHub at https://cod-espol.github.io/AVNet .

Paper Nr: 312
Title:

A Simple and Lightweight Model for Person Re-Identification

Authors:

Stuart A. Montes, Edward Cayllahua and Rensso Mora-Colque

Abstract: Person re-identification (Re-ID) remains a challenging problem in computer vision due to significant variations in pose, illumination, and occlusion. Although transformer-based approaches have achieved state-of-the-art results, their computational and memory demands hinder their use in real-time or resource-constrained scenarios. This work investigates the effectiveness of a lightweight Vision Transformer (ViT-Small/16) architecture for supervised person re-identification, combining cross-entropy identity classification with a Triplet-Hard loss to enhance identity discrimination and embedding separation. The model is evaluated on the Market-1501 and DukeMTMC-reID benchmarks, achieving Rank-1 accuracies of 93% and 87.3%, respectively, which is competitive with larger and more complex architectures while maintaining a substantially lower computational cost. Compared to typical transformer-based Re-ID models, our approach achieves a roughly fivefold reduction in GFLOPs and a three to fourfold reduction in parameter count. Overall, the results demonstrate that a compact transformer backbone, when properly optimized, can achieve an effective trade-off between accuracy and efficiency, providing a strong and reproducible baseline for lightweight person re-identification.

Paper Nr: 316
Title:

Real-Time 6DoF Pallet Pose Estimation with Monocular Metric Depth

Authors:

Azuma Miura, Hideaki Uchiyama, Masahiro Yamaguchi, Natsuki Kai, Takahiro Shiroshima and Hideo Saito

Abstract: Accurate 6DoF pallet pose estimation is essential for autonomous forklifts in warehouse material handling. RGB-only approaches often suffer from large translation errors due to inherent monocular depth ambiguity. We propose a real-time 6DoF pallet pose estimator that leverages monocular metric depth estimation. Specifically, we (i) predict metric-scale depth with Depth-Anything-V2, (ii) regularize the pose using ground-plane normal constraints, (iii) perform efficient RGB–D fusion based on AsymFormer, and (iv) improve robustness via virtual camera–based view augmentation. On a real-world warehouse dataset, our method achieves 3.88 cm translation error (43% reduction from baseline) and 1.65° rotation error, with translation accuracy being stable across varying load appearances. Within a practical range of up to 5 m, we obtain a 78.1% success rate under the <5 cm / <3° criterion. The full pipeline runs at 36 FPS on an NVIDIA RTX 4090, enabling real-time operation for autonomous forklift deployment.

Paper Nr: 345
Title:

Geometric Skeletal Distance Learning for Self-Supervised Sign Language Recognition

Authors:

Evangelos G. Sartinas, Dimitrios Kosmopoulos, Emmanouil Z. Psarakis, Kostas Blekos, Bikram Kumar De and Vangelis Metsis

Abstract: We propose a geometry-aware framework for sign language recognition that models articulated skeletal motion through pose-level geometric relationships. The framework integrates multiple complementary distance formulations, including an articulated pose distance (APD) that captures hierarchical kinematic dependencies by modeling local bone orientations. These geometric distances guide the training of a self-supervised, distance-preserving embedding network, producing compact motion representations that are subsequently fused with raw skeletal features and processed by a Transformer-based recognition model. Extensive experiments on the WLASL-100 and WLASL-300 benchmarks using MediaPipe skeletal features demonstrate that incorporating geometry-aware embeddings substantially improves recognition performance, achieving up to a +12% Top-1 gain over Transformer baselines trained solely on joint coordinates. Overall, the results highlight the value of geometry-aware representations and articulated pose modeling for robust and generalizable sign language recognition.

Short Papers
Paper Nr: 32
Title:

Domain-Specific Synthetic Data Generation for Person Re-Identification in Public Transport

Authors:

Perry op ’t Landt, Ulrich Krispel and Torsten Ullrich

Abstract: Person re-identification (person Re-ID) refers to the task of locating a person of interest across multiple cameras and has seen substantial progress since the inception of deep learning. However, models trained on large-scale, publicly available datasets often struggle to generalize to unseen target domains. This domain gap is particularly pronounced in public transport environments, where strong variations in viewpoints, poses, lighting conditions, and occlusions are common. Collecting and annotating a large, diverse, and domain-specific dataset is expensive, time-consuming, and raises privacy concerns. To address these issues, we present a synthetic data generation pipeline specifically adapted for a bus environment, aimed at improving generalization for person Re-ID models. We import randomly generated 3D human characters into Unity and simulate a realistic bus surveillance scenario. Utilizing Unity’s Perception package, we automatically capture and annotate an extensive synthetic dataset. We also collect a small real-world test dataset within an actual bus to evaluate our approach. Our experiments demonstrate that training with our domain-specific synthetic dataset already significantly outperforms a baseline model (up to 20% in CMC Rank-1) trained exclusively on publicly available, generic real-world data (Market-1501). By combining the synthetic and real-world data with augmentation for training, we further improve Re-ID performance by 8.4% CMC Rank-1, showcasing the potential of synthetic data to bridge domain gaps and facilitate robust Re-ID in challenging, privacy-sensitive environments.

Paper Nr: 33
Title:

Order-Level Arthropod Detection Using Deep Learning: Addressing Scale Variability through Synthetic Data

Authors:

Chenchang Liu, Svetlana Ionova, Patrick Mäder and Marco Seeland

Abstract: Monitoring wild arthropods like insects is vital for biodiversity research and ecological conservation. Manual analysis of arthropod images from natural environments is labor-intensive and time-consuming. As arthropods vary widely in shape, size, and position, it is challenging to both comprehensively collect images and accurately recognize arthropod instances in ecological contexts. To address this challenge, we propose a deep learning approach for efficient order-level arthropod detection, focusing on seven common arthropod orders. Given the distribution of relative arthropod sizes in images in a given dataset, we hypothesize that detection performance may vary across object scales. Therefore, we conduct a systematic evaluation of our model performance across both arthropod orders and their relative size in an image. We identify several recognition gaps in certain orders and scales, and propose the generation of composite images to augment the training set, focusing on weakly-performing orders and scales. Experiments show that our approach achieves high average precision across multiple arthropod orders and object scales, with notable gains at the targeted scales when composite images are incorporated. The best results have been achieved by augmenting each image by two corresponding composite images. Overall, all seven arthropod orders show consistent improvements, with the largest gains for Hemiptera (+2.28%) and Hymenoptera (+2.19%). These results highlight the potential of targeted composite image generation as a way to overcoming data limitations for augmenting deep learning for fine-grained ecological monitoring.

Paper Nr: 39
Title:

Enhancing Temporal Stability in Small Object Detection: A Post-Processing Approach for YOLOv8

Authors:

Junyu Xiao and Shohei Yokoyama

Abstract: Detecting and tracking small, fast-moving objects in video remains challenging due to blur, occlusion, and limited resolution. Although YOLOv8 performs well in general detection tasks, it struggles with small-object scenarios such as ball sports. We propose a nonstructural postprocessing framework that combines detection cleaning, including IoU and motion consistency constraints, and short-term TP protection, with interpolation-based temporal smoothing using linear and spline interpolation. This method enhances recall by recovering missed detections and reduces flickering by enforcing temporal consistency, without modifying the underlying detector. Experiments on a 360-degree badminton video dataset demonstrate improved detection stability and small-object recall with an average improvement of 18.56%, offering a lightweight and practical solution for real-world applications.

Paper Nr: 46
Title:

A Parametric 3D Bee Model for Scalable Synthetic Data Generation in Animal Behavior Studies

Authors:

Christian Matos Rivera, Remi Megret and David Flores

Abstract: Deep learning models for computer vision require large amounts of annotated data, which is particularly challenging to acquire in animal behavioral research due to resource constraints and ethical considerations. This paper addresses data scarcity in bee monitoring applications by developing a parametric 3D honeybee model for synthetic dataset generation using game engine technology. Our methodology combines a pre-built model with a novel parametrization system, creating a hybrid approach that enables morphological parametrization, paint code customization, and pollen modeling through a unified data structure compatible with the replicANT platform in Unreal Engine. Evaluation of synthetic datasets for bee detection using YOLOv8 demonstrated that the addition of synthetic data to real bases (50-100 images) improved detection accuracy to 0.97 mAP50, significantly outperforming models trained on equivalent real data alone. For bee re-identification tasks using paint codes, synthetic data enabled color estimation with an average color distance of 0.29 in [0,1] normalized RGB space, showing promising transfer from synthetic to real domains despite remaining domain gaps. Our results demonstrate that high-quality 3D models can serve as effective pre-training foundations, reducing real data requirements while maintaining competitive performance, advancing ecological monitoring through scalable synthetic data generation.

Paper Nr: 58
Title:

INFORM: A Food Monitoring and Tracking System for Sustainable Healthcare Facilities

Authors:

James Rainey, John Wannan, Douglas MacLachlan, Boguslaw Obara and Deepayan Bhowmik

Abstract: Food recognition has many applications from monitoring diet, to tracking nutritional intake. In hospital and care environments, nutritional intake is often tracked manually using pen and paper to record the results. This makes larger-scale monitoring difficult as it requires a lot of time and effort to cover many patients. In this paper, we present an INtelligent FOod Recognition and Monitoring (INFORM) system, the first end-to-end system based on Edge AI for analysing food consumption and plate waste in hospitals. The system combines an array of sensors with an embedded processor to capture images and run a state-of-the-art image recognition pipeline. This accurately identifies individual food items on a plate, estimates their weight, temperature and nutritional content and determines whether they are suitable to be eaten. This work moves beyond a research prototype by translating the underlying methods into a fully engineered product, addressing real-world challenges from both hardware constraints and practical obstacles. Deploying the system in operational healthcare settings introduces practical challenges such as inconsistent lighting and environmental variations and height constraints due to integration with existing food service equipment, as well as hardware constraints around computation, durability and calibration. These implementation challenges are often overlooked in a research setting but are critical in achieving good performance in practice. By addressing these factors, the INFORM system enables the monitoring of patient food and nutrient intake while also supporting food waste management and food quality monitoring. The system delivers actionable feedback at every stage of the food journey, demonstrating a pathway from research innovation to a pragmatic healthcare technology.

Paper Nr: 65
Title:

Estimating Person Positions Using a Camera and Wireless Devices in a Space with Temporary Shielding

Authors:

Yuji Makimura, Daichi Hayashi, Teppei Kobiki, Shouki Sakemoto and Masashi Nishiyama

Abstract: Camera sensors can generally estimate a person’s position accurately. Here, we consider the case where the camera cannot estimate a person’s position when its field of view is unavoidably blocked by temporary shielding. In recent years, several person position estimation methods have been proposed based on use of wireless devices that emit radio waves that can penetrate the shielding. However, to estimate a person’s position accurately using wireless devices, we must also consider how we collect the training samples, which consist of pairs of the radio wave strength and the person’s position data, because the radio signals acquired from wireless devices cannot be annotated intuitively by humans in the same way as camera images. Our method collects training samples automatically for a regression model that estimates a person’s position when there is no shielding via a collaboration between the camera and the wireless devices. Then, our method can estimate person positions using the model, even when temporary shielding occurs. The experimental results obtained when temporary shielding occurred show that the person position estimation error was 15.9 cm when one person was walking in a poster space and 18.7 cm when two people were walking in the space simultaneously.

Paper Nr: 66
Title:

Synthetic Data in the Context of Automotive In-Cabin: A Review

Authors:

Julia Frick, Patrick Laufer, Roman Seidel and Gangolf Hirtz

Abstract: Recent advancements in the automotive industry, such as autonomous driving, and increasingly stringent safety requirements, have elevated the significance of in-cabin monitoring. Training models of neural networks designed to perform in-cabin classification, segmentation and depth estimation through sensor data analysis necessitates a large amount of data. Generating this data in real-world scenarios poses several challenges, including high costs, time consumption and privacy concerns. Synthetic data offers an alternative that overcomes these issues. The objective of this literature review is to analyze and compare existing studies on the generation of synthetic in-cabin data. Both the methodologies employed as well as any evaluations conducted on the generated synthetic data are examined. Although numerous contributions have been made in this area, there are still notable gaps. The findings show that most studies focus on poses, with some expanding to encompass general seat occupancy by individuals and objects, tracking of head and hand movements, shape estimation, and detection of hands on the wheel. The approaches for evaluating synthetic data successfully demonstrate beneficial applications, their variety and the diverse scenarios presented, however, make a direct comparison difficult.

Paper Nr: 69
Title:

Rendered vs AI-Generated vs Real Data: Training Industrial Object Detection Models with Multi-Source Data

Authors:

Chafic Abou Akar, Hadi Koubeissy, Joe Khalil, Jimmy Tekli, Marc Kamradt and Abdallah Makhoul

Abstract: Deep learning (DL) and computer vision (CV) applications, allow industrial robots to better understand their surroundings and act accordingly. For instance, object detection emerges in various robotic tasks like smart navigation, robotic grasp, visual inspection, etc. Nevertheless, due to the need of large training datasets, synthetic data is tackling the challenges of acquiring real industrial images. Successively, synthetic data brings new variations to improve the generalization of the DL models. In this paper, we investigate the performance of industrial synthetic data with different levels of image realism. We trained several detection models using real images augmented with synthetic data produced by different data sources: simulation and generative models. Consequently, we emphasize (1) the impact of image realism in data augmentation, (2) as well as the influence of structured and semantic content variations instead of random scene setup or text-to-image prompts. Furthermore, while most contributions focus on employing a single source of image synthesis, in our paper, we propose novel best practices for adopting multi-source synthetic data to train detection models reaching the highest mAP with a margin of 3%.

Paper Nr: 156
Title:

Long-Tailed Species Recognition in the NACTI Wildlife Dataset

Authors:

Zehua Liu and Tilo Burghardt

Abstract: As most “in the wild” data collections of the natural world, the North America Camera Trap Images (NACTI) dataset shows severe long-tailed class imbalance, noting that the largest ‘Head’ class alone covers >50% of the 3.7M images in the corpus. Building on the PyTorch Wildlife model, we present a systematic study of LongTail Recognition methodologies for species recognition on the NACTI dataset covering experiments on various LTR loss functions plus LTR-sensitive regularisation. Our best configuration achieves 99.40% Top-1 accuracy on our NACTI test data split, substantially improving over a 95.51% baseline using standard cross-entropy with Adam. This also improves on previously reported top performance in MLWIC2 at 96.8% albeit using partly unpublished (potentially different) partitioning, optimiser, and evaluation protocols. To evaluate domain shifts (e.g. night-time captures, occlusion, motion-blur) towards other datasets we construct a Reduced-Bias Test set from the ENA-Detection dataset where our experimentally optimised long-tail enhanced model achieves leading 52.55% accuracy (up from 51.20% with WCE loss), demonstrating stronger generalisation capabilities under distribution shift. We document the consistent improvements of LTR-enhancing scheduler choices in this NACTI wildlife domain, particularly when in tandem with state-of-the-art LTR losses. We finally discuss qualitative and quantitative shortcomings that LTR methods cannot sufficiently address, including catastrophic breakdown for ‘Tail’ classes under severe domain shift. For maximum reproducibility we publish all dataset splits, key code, and full network weights available at https://github.com/ZehuaLiuY/Species-Classification.

Paper Nr: 162
Title:

Retail Shelf Monitoring Using Deep Hough Transform and Object Detection

Authors:

Luiz Fernando Merli de Oliveira Sementille, Marcos Cleison Silva Santana, Douglas Rodrigues, Danilo Samuel Jodas, Kelton Augusto Pontara da Costa, Fabio Luiz de Oliveira and João Paulo Papa

Abstract: The presence of products on supermarket shelves is an important element for both customer satisfaction and the profitability of retail businesses. However, there are several challenges related to out-of-stock situations, particularly for computer vision approaches. This paper presents a vision-based approach for automatic shelf gap detection to enable rapid product replacement. The method combines object detection and semantic line segmentation for product and shelf rows identification, respectively, in order to produce a mask fusion strategy. The method also extends product bounding boxes to the adjacent superior and inferior shelves, producing a binary mask to identify candidate gaps as empty spaces in the shelves. The proposed method addresses several challenges in retail environments, including varying product arrangements and shelf configurations. The reported results reveal the accuracy and robustness of a simple, state-of-the-art method for estimating empty spaces using calculations from the geometry of bounding boxes.

Paper Nr: 163
Title:

Improving Underwater Fish Detection and Tracking via Temporal Modeling and Deformable Convolutions

Authors:

Josep Sánchez, Jose-Luis Lisani and Antoni Bennasar Garau

Abstract: Underwater fish detection and tracking remain challenging due to the complex visual conditions of aquatic environments, including turbidity, occlusion, and dynamic lighting. This work presents two architectural enhancements to the YOLOX detector aimed at improving temporal consistency and spatial adaptability for underwater video analysis. The first introduces multi-frame input to capture short-term motion cues, while the second integrates deformable convolutions to better handle non-rigid object deformation. Both modifications maintain real-time inference while enhancing detection robustness. Experiments conducted on the FishCLEF 2015 dataset and a custom underwater tracking benchmark demonstrate that multi-frame input significantly improves detection stability and tracking accuracy, whereas deformable convolutions enhance spatial precision under occlusion and motion blur. When coupled with the Deep OC-SORT tracker, the proposed framework achieves superior identity preservation and smoother trajectories. These results highlight the benefits of embedding temporal and motion awareness directly within detection architectures for efficient underwater fish tracking and monitoring applications.

Paper Nr: 189
Title:

Hazard-Aware Duration Prior and F1-Adaptive Loss for Temporal Action Segmentation

Authors:

Hinako Mitsuoka and Kazuhiro Hotta

Abstract: Temporal Action Segmentation (TAS) is a challenging task due to the severe class imbalance and complex temporal dynamics of human activities. Most existing methods optimize frame-wise predictions using the standard Cross-Entropy (CE) loss, often with auxiliary smoothing losses. Although these encourage temporal consistency, they do not fundamentally improve CE itself, which remains sensitive to class imbalance and unable to capture the duration structure of actions. In this paper, we propose a novel duration-aware and performance-adaptive loss function as a direct replacement for CE, specifically designed to overcome these limitations. An adaptive weighting scheme dynamically adjusts class importance based on real-time model performance. By tracking class-wise F1-scores during training, it automatically emphasizes underperforming classes. In parallel, a duration-hazard prior, derived from survival analysis of action lengths, effectively models the persistence probability of each class to regularize temporal dynamics and suppress spurious transitions. Together, these mechanisms yield robust, balanced, and temporally coherent segmentation without requiring any architectural modification. On three benchmark datasets, our method achieved up to +3.2% F1, +6.6% Edit, and +1.7% frame accuracy improvement over the same backbones trained with CE.

Paper Nr: 192
Title:

Kinship Verification with Custom Sampling and Hard Contrastive Loss

Authors:

Warley Barbosa and Tiago Vieira

Abstract: Facial Kinship Verification (FKV) determines biological relatedness from facial features, a task hindered by subtle hereditary cues and dataset limitations. We propose a method combining a custom batch sampler and Hard Contrastive Loss (HCL) on an AdaFace backbone. The sampler enforces family uniqueness and prioritizes diverse and difficult samples, while HCL emphasizes hardest negatives. On the FIW dataset, our approach achieves state-of-the-art accuracy (82.2% T1, 86.5% T2) and improves Rank@5 retrieval by 7.9 points. Ablations confirm the synergy between batch composition and HCL, with best results at τ ≈ 0.3. Code will be available online.

Paper Nr: 203
Title:

Improved MambdaBDA Framework for Robust Building Damage Assessment across Disaster Domains

Authors:

Alp Eren Gençoğlu and Hazım Kemal Ekenel

Abstract: Reliable post-disaster building damage assessment (BDA) from satellite imagery is hindered by severe class imbalance, background clutter, and domain shift across disaster types and geographies. In this work, we address these problems and explore ways to improve the MambaBDA, the BDA network of ChangeMamba architecture, one of the most successful BDA models. The approach enhances the MambaBDA with three modular components: (i) Focal Loss to mitigate class imbalance damage classification, (ii) lightweight Attention Gates to suppress irrelevant context, and (iii) a compact Alignment Module to spatially warp pre-event features toward post-event content before decoding. We experiment on multiple satellite imagery datasets, including xBD, Pakistan Flooding, Turkey Earthquake, and Ida Hurricane, and conduct in-domain and cross-dataset tests. The proposed modular enhancements yield consistent improvements over the baseline model, with 0.8% to 5% performance gains in-domain, and up to 27% on unseen disasters. This indicates that the proposed enhancements are especially beneficial for the generalization capability of the system.

Paper Nr: 215
Title:

Visibility-Gated ConvGRU for Robust Multi-View Pedestrian Detection and Tracking in BEV Space

Authors:

Taigo Sakai, Takeshi Nakamura, Hiroshi Shimizu and Kazuhiro Hotta

Abstract: Multi-view pedestrian detection and tracking in bird’s-eye-view (BEV) space enable robust perception under heavy occlusion. However, existing one-shot tracking frameworks such as TrackTacular process each frame independently, lacking temporal reasoning. Moreover, recurrent approaches like ConvGRU or ConvLSTM assume uniform visibility and often overwrite reliable memory with uncertain features from unobserved regions. To address these issues, we propose a Visibility-Gated ConvGRU, which integrates temporal information in BEV space while selectively updating visible regions based on multi-camera visibility maps. This mechanism prevents memory corruption and preserves stable states even in occluded areas. Experiments on WildTrack and MultiViewX datasets demonstrate that our method achieves consistent improvements over state-of-the-art methods, improving both MODA and IDF1 by approximately 2%, confirming the importance of visibility-aware temporal integration for reliable multi-camera tracking.

Paper Nr: 243
Title:

Gabor-Guided Tiny Bird Detection for Drone-Based Agricultural Bird Deterrence

Authors:

Mathijs Lens and Toon Goedemé

Abstract: Birds cause substantial damage to crops in agricultural settings. However, existing deterrence methods-such as lasers, acoustic devices, and scarecrows-are often ineffective over time, as birds tend to habituate to repetitive stimuli. Localizing birds and guiding deterrence systems directly to their positions offers a more robust and sustainable solution. Long-distance detection of birds from a single high-mounted camera is challenging because they occupy very few pixels in the wide field-of-view, and stationary birds are nearly invisible to both humans and automated detectors. However, wing flapping provides a distinctive motion pattern that can be exploited for detection. We propose a lightweight pipeline that uses a 3D Gabor filter as a computationally efficient spatiotemporal attention mechanism to highlight regions with wing-flapping motion, followed by a YOLO detector applied only on these high-response patches. This Gabor-first approach reduces false positives, lowers inference cost, and improves tiny bird detection compared to baseline methods. We describe our 3D Gabor preprocessing, patch selection strategy, and integration with a YOLO backbone, and provide ablation studies demonstrating trade-offs between computational cost and detection performance on the FBD-SV dataset and real-world videos.

Paper Nr: 268
Title:

Billboard in Focus: Estimating Driver Gaze Duration from a Single Image

Authors:

Carlos Pizarroso, Zuzana Berger Haladová, Zuzana Černeková and Viktor Kocur

Abstract: Roadside billboards represent a central element of outdoor advertising, yet their presence may contribute to driver distraction and accident risk. This study introduces a fully automated pipeline for billboard detection and driver gaze duration estimation, aiming to evaluate billboard relevance without reliance on manual annotations or eye-tracking devices. Our pipeline operates in two stages: (1) a YOLO-based object detection model trained on Mapillary Vistas and fine-tuned on BillboardLamac images achieved 94% mAP50 in the billboard detection task (2) a classifier based on the detected bounding box positions and DINOv2 features. The proposed pipeline enables estimation of billboard driver gaze duration from individual frames. We show that our method is able to achieve 68.1% accuracy on BillboardLamac when considering individual frames. These results are further validated using images collected from Google Street View.

Paper Nr: 278
Title:

SignIT: A Comprehensive Dataset and Multimodal Analysis for Italian Sign Language Recognition

Authors:

Alessia Micieli, Giovanni Maria Farinella and Francesco Ragusa

Abstract: In this work we present SignIT, a new dataset to study the task of Italian Sign Language (LIS) recognition. The dataset is composed of 644 videos covering 3.33 hours. We manually annotated videos considering a taxonomy of 94 distinct sign classes belonging to 5 macro-categories: Animals, Food, Colors, Emotions and Family. We also extracted 2D keypoints related to the hands, face and body of the users. With the dataset, we propose a benchmark for the sign recognition task, adopting several state-of-the-art models showing how temporal information, 2D keypoints and RGB frames can be influence the performance of these models. Results show the limitations of these models on this challenging LIS dataset. We will release data and annotations at the following link: https://fpv-iplab.github.io/SignIT/.

Paper Nr: 279
Title:

Adversarial Robustness of Proxy-Based Metric Learning Models

Authors:

Marcin Maciąg and Grzegorz Sarwas

Abstract: This paper presents a systematic evaluation of the adversarial robustness of modern Deep Metric Learning (DML) objectives under multiple threat models. We focus on proxy-based and hierarchical loss functions, including Proxy Anchor, SoftTriple, and Hierarchical loss, and compare them against standard pair-based and classification-based baselines trained with Contrastive and Cross-Entropy losses across three retrieval benchmarks—Stanford Online Products (SOP), CARS-196, and CUB-200-2011—covering both large-scale and fine-grained recognition scenarios. Across all datasets, proxy-based and hierarchical objectives consistently achieve superior clean retrieval performance in terms of Recall@1 and Recall@2, confirming their ability to induce compact and discriminative embedding spaces. Robustness is evaluated under three white-box adversarial attacks: FGSM, Projected Gradient Descent (PGD) with ℓ∞ constraints, and the optimisation-based Carlini & Wagner (C&W) ℓ2 attack. Under ℓ∞-bounded attacks, proxy-based and hierarchical models exhibit modest but consistent robustness advantages over contrastive and cross-entropy baselines, degrading more gracefully under iterative PGD. However, this robustness is strictly attack-dependent: across all datasets and loss functions, the C&W attack proves near-universally effective, collapsing retrieval performance to negligible levels. These findings reveal a clear gap between clean retrieval performance and adversarial resilience in current DML frameworks, demonstrating that improved embedding geometry alone is insufficient to guarantee robustness against strong optimisation-based adversaries.

Paper Nr: 288
Title:

EventAction: Vision Mamba-Based Event-Driven Action Recognition

Authors:

Qingyu Wang, Xingzhen Song, ChungSheng Chang, Feiyu Ge, Eisuke Sato and Masato Tsukada

Abstract: Event cameras, inspired by biological vision, provide asynchronous brightness-change streams with high temporal resolution, low latency, and low power consumption. These advantages make them ideal for challenging scenes where RGB cameras struggle. However, conventional Human Action Recognition (HAR) methods relying on dense frames or optical flow cannot directly handle the sparse, edge-driven nature of event data. To address this limitation, we propose EventAction, a two-stage event-based HAR framework. The first stage, EventMambaPose, introduces a novel Vision Mamba–based event pose estimator enhanced with a MamLSTM module to capture robust spatiotemporal dependencies from sparse event data. The second stage, MambaAc-tion, extends the framework to skeleton-based action recognition by incorporating positional encoding and inverted 3D convolutions for efficient temporal–spatial modeling. Experiments on public datasets CDEHP and DHP19 show that EventMambaPose achieves state-of-the-art results in event-based pose estimation, while MambaAction surpasses existing baselines on CDEHP with higher accuracy. These results confirm the effectiveness of our event-driven two-stage architecture.

Paper Nr: 292
Title:

Generalizable Hyperparameter Optimization for Federated Learning on Non-IID Cancer Images

Authors:

Elisa Gonçalves Ribeiro, Rodrigo Moreira, Larissa Ferreira Rodrigues Moreira and André Ricardo Backes

Abstract: Deep learning for cancer histopathology training conflicts with privacy constraints in clinical settings. Federated Learning (FL) mitigates this by keeping data local; however, its performance depends on hyperparameter choices under non-independent and identically distributed (non-IID) client datasets. This paper examined whether hyperparameters optimized on one cancer imaging dataset generalized across non-IID federated scenarios. We considered binary histopathology tasks for ovarian and colorectal cancers. We perform centralized Bayesian hyperparameter optimization and transfer dataset-specific optima to the non-IID FL setup. The main contribution of this study is the introduction of a simple cross-dataset aggregation heuristic by combining configurations by averaging the learning rates and considering the modal optimizers and batch sizes. This combined configuration achieves a competitive classification performance.

Paper Nr: 340
Title:

Comparative Evaluation of Vision Transformer Architectures for Video-Based Weather Intensity Recognition

Authors:

Yuma Kokubu and Tomokazu Ishikawa

Abstract: This paper argues that task-aligned architectural design is fundamentally more important than sophisticated pre-training for domain-specific tasks. Using weather intensity recognition as a case study, we reveal unexpected limitations of recent pre-training strategies (VideoMAE, EVA-02) through systematic evaluation on the VARG dataset. MViTv2, employing multi-scale hierarchical processing, achieves 91.00% hamming match and 67.34% exact match, while advanced pre-trained models show limited effectiveness. Through attention visualization and per-weather-type evaluation, we demonstrate why certain pre-training strategies fail: (1) masking-based reconstruction is unsuitable for diffuse weather patterns, and (2) language-aligned representations prioritize semantic understanding over quantitative visual features. These findings challenge the assumption that advanced pre-training universally transfers to domain-specific tasks and provide actionable guidance for architecture selection in safety-critical weather recognition systems.

Paper Nr: 17
Title:

HyperNut: Hyper Spectral Dataset of Nuts for Unsupervised Defect Detection and Segmentation

Authors:

Afshin Dini, Farnaz Delirie and Esa Rahtu

Abstract: Hyperspectral Imaging (HSI), providing detailed information from various spectrums, is a suitable candidate for detecting defects in real-world applications, which is a hot topic in the field of computer vision nowadays. We introduce the HyperNut dataset, containing hyperspectral images of almonds and pistachios in the visible and near-infrared (VIS-NIR) ranges (400nm-1000nm). This dataset contains non-anomalous samples that can be used for training unsupervised approaches and defective samples for testing purposes. To our best knowledge, our dataset is the only one in the literature that (a) allows a thorough analysis of nuts quality by providing different types of defective samples, (b) provides real-world samples containing multiple objects and considering noise and variable environmental conditions while sampling, and (c) allows defect segmentation by providing masks presenting exact locations of defects in samples. Moreover, we have tested basic and simple anomaly detection methods on the hyperspectral data and the related RGB images and compared the results to show that hyperspectral images are suitable candidates for defect detection problems.

Paper Nr: 22
Title:

Trade-Offs of Contrast Enhancement and Denoising for Low-Resolution Images: An Empirical Study on DIV2K

Authors:

Morales García Emmanuel, Domínguez Isidro Saúl, Rojano Cáceres José Rafael and Cruz López Cecilia

Abstract: In various applications, such as computer vision, medicine, and remote sensing, the quality of the analyzed images is often limited by factors such as the image capture method and the equipment used, which fre-quently results in low-resolution images. This research analyzed different image processing techniques for low-resolution images with the aim of improving their quality. For this purpose, images from the DIV2K dataset were used. Two groups of filters were considered: the first, which enhances contrast (such as histogram equalization, adaptive equalization, and CLAHE); and the second, which smooths the image (such as bilateral filtering and Gaussian blurring). Finally, metrics such as entropy, contrast, noise level, brightness, sharpness, and edge detection were used to evaluate the quality of the processed images. For enhancing details in low-resolution images, adaptive equalization and the CLAHE algorithm are recommended. For noise reduction and edge detection, Gaussian smoothing methods are more suitable.

Paper Nr: 27
Title:

Robust Cell Segmentation in Urine Cytology Images for Bladder Cancer Diagnosis

Authors:

Mariana L. Teixeira, Hugo S. Oliveira, Raquel L. Monteiro, Daniela Ferreira, Tania Pereira, Raphaël F. Canadas and Hélder P. Oliveira

Abstract: In Bladder Cell antigen (BCa) diagnostics, cystoscopy is invasive and costly, while urine cytology offers a non-invasive alternative. Accurate cell segmentation in bright-field microscopy remains a challenge for automation. We benchmark classical and deep learning methods using the CELLo dataset, a private collection of brightfield images from commercial cell lines and clinical samples at CHUSJ, Porto, Portugal, covering diverse urothelial morphologies from healthy individuals and BCa patients. Classical methods (Otsu, thresholding, hybrid pipelines, Cellpose) are compared with DeepCell, a CNN model enhanced with self-supervised Vision Transformer (DINO) features. Quantitative and qualitative evaluations assess contour consistency, segmentation fidelity, and morphological plausibility. Hybrid methods effectively handle noise and overlaps, but DeepCell achieves superior accuracy and detection rates. Results highlight the potential of generalisable deep models for cytology, with future work integrating DeepCell into diagnostic pipelines and broader datasets.

Paper Nr: 49
Title:

High Semantic Features for the Continual Learning of Complex Emotions: A Lightweight Solution

Authors:

Thibault Geoffroy, Gauthier Gerspacher and Lionel Prevost

Abstract: Incremental learning is a complex process due to potential catastrophic forgetting of old tasks when learning new ones. This is mainly due to transient features that do not fit from task to task. In this paper, we focus on complex emotion recognition. First, we learn basic emotions and then, incrementally, like humans, complex emotions. We show that Action Units, describing facial muscle movements, are non-transient, highly semantic features that outperform those extracted by both shallow and deep convolutional neural networks. Thanks to this ability, our approach achieves interesting results when learning incrementally complex, compound emotions with an accuracy of 0.75 on the CFEE dataset and can favorably compare to state-of-the-art results. Moreover, it results in a lightweight model with a small memory footprint.

Paper Nr: 62
Title:

An Automated Deep Learning Pipeline for Panel Segmentation, Defect Detection, and 3D Reconstruction in Impact Test Analysis

Authors:

Uğur Can Karaca and Toygar Akgün

Abstract: This paper presents an automated pipeline for impact test panel analysis that combines deep learning-based segmentation with 3D reconstruction. A two-stage convolutional neural network is used to segment metallic panels and detect defects within them. On the methodological side, we investigate enhanced U-Net-based architectures that incorporate attention mechanisms, deep supervision, and a learnable skip-connection gate. Deep supervision strengthens gradient flow in intermediate layers and improves small-object segmentation, while attention gates refine feature selection. The proposed Learning Gate (LG) mechanism replaces skip-connection concatenation with an adaptive pixel-wise fusion, reducing parameters and improving robustness across object scales. Combined deep supervision and LG improve the baseline U-Net by +2.39 IoU points and up to +3.6 AP points depending on defect size. On the system side, segmented panels are registered in 3D using prior knowledge of panel orientations, and detected defects are projected onto a virtual representation of the test object, enabling reconstruction of impact and fragmentation patterns. The resulting workflow reduces manual effort, improves reproducibility, and provides a scalable, data-driven approach for structural testing and defect analysis.

Paper Nr: 91
Title:

From Label Error Detection to Correction: A Modular Framework and Benchmark for Object Detection Datasets

Authors:

Sarina Penquitt, Jonathan Klees, Rinor Cakaj, Daniel Kondermann, Matthias Rottmann and Lars Schmarje

Abstract: Object detection has advanced rapidly in recent years, driven by increasingly large and diverse datasets. However, label errors often compromise the quality of these datasets and affect the outcomes of training and benchmark evaluations. Although label error detection methods for object detection datasets now exist, they are typically validated only on synthetic benchmarks or via limited manual inspection. How to correct such errors systematically and at scale remains an open problem. We introduce a semi-automated framework for label error correction called REC✓D (Rechecked). Building on existing label error detection methods, their error proposals are reviewed with lightweight, crowd-sourced microtasks. We apply REC✓D to the class pedestrian in the KITTI dataset, for which we crowdsourced high-quality corrected annotations. We detect 18% of missing and inaccurate labels in the original ground truth. We show that current label error detection methods, when combined with our correction framework, can recover hundreds of errors with little human effort compared to annotation from scratch. However, even the best methods still miss up to 66% of the label errors, which motivates further research, now enabled by our released benchmark.

Paper Nr: 117
Title:

When Surgery Meets the Unknown: Uncertainty-Aware Open-Set Recognition for Surgery Phase Classification

Authors:

Stefan Geyer, Vicky Kalogeiton and Alina Roitberg

Abstract: Open-set recognition aims to correctly classify known concepts while simultaneously detecting unknown ones. While extensively studied in image and video classification, it remains underexplored in surgical phase recognition and medical data in general. Yet, this task is critical in surgery due to its high-risk nature and limited availability of annotated surgical training data. We address Open-set Surgical Phase Recognition (OSPR) by adapting recent OSR advances to surgery, extending two cholecystectomy datasets with new evaluation regimes, and benchmarking eight algorithms, including classical and state-of-the-art methods. Since many rely on uncertainty estimation, we systematically assess multiple uncertainty methods, revealing complex phase interactions and varying challenge levels. We further introduce GEAR (Gaussian Residual Evidential Aleatoric Regression), which enhances evidential deep learning with input-dependent aleatoric uncertainty optimization, achieving state-of-the-art performance in identifying unknown surgical phases. Our Code and Weights will be made publicly available.

Paper Nr: 128
Title:

Microstructural Classification of Non-Oriented Electrical Steels: A Machine Learning and Computer Vision-Based Approach

Authors:

Leonardo Adriano Vasconcelos de Oliveira, José Daniel de Alencar Santos, Francisco Nélio Costa Freitas, Pedro Pedrosa Rebouças Filho, Hamilton Ferreira Gomes de Abreu and Luis Flávio Gaspar Herculano

Abstract: This study proposes a novel approach to assist in classifying the microstructural states of non-oriented electrical steels (NOES) from optical microscopy images, leveraging machine learning, synthetic data generation via SinDiffusion, and classical computer vision techniques. The methodology enables the discrimination between deformed, partially recrystallized, and fully recrystallized microstructures, overcoming limitations of traditional approaches regarding subjectivity and limited sample availability. Among the classical computer vision methods, Local Binary Pattern (LBP) combined with an RBF-kernel SVM achieved the highest accuracy (85.67%). Among the deep learning–based methods, DINOv2 with a linear SVM yielded the best performance (87.25%). The results demonstrate the feasibility of the proposed methodology to support the microstructural analysis of NOES using optical microscopy, particularly in scenarios characterized by scarce data and imbalanced class distributions.

Paper Nr: 142
Title:

ReAL-YOLO: Reinforcement-Driven Active Learning for Training YOLO Object Detectors

Authors:

Gabriel Souto Ferrante and Priscila Tiemi Maeda Saito

Abstract: Training object detection models, such as those in the YOLO family, traditionally requires large labeled datasets, which significantly increases training time. The training speed is strongly influenced by both the dataset size and the parameters of the neural architecture. This work proposes a new training pipeline that integrates deep active learning with reinforcement learning agents for image selection. It represents one of the first studies to combine YOLOv11 with reinforcement agents in the context of active learning, whereas most previous works have focused on detectors such as Faster R-CNN, RetinaNet, and other two-stage architectures. The objective is to optimize detector performance while reducing the amount of required data by prioritizing the most informative samples. Two initial selection strategies were compared for constructing the training set: random sampling and cluster-based sampling. The results demonstrate that the proposed approach can substantially reduce training data requirements while maintaining competitive accuracy – achieving only about 10% lower performance than training with the full dataset, yet with a significant reduction in data labeling effort. Furthermore, even though agents trained using the PPO strategy exhibited some instability, they successfully learned effective and adaptive image selection policies, highlighting the potential of reinforcement-driven active learning to improve data efficiency in object detection tasks.

Paper Nr: 147
Title:

ASRDNet: A New Image-Segmentation Neural Network Model for Detecting Asian Rust on Soybean Leaves

Authors:

Paulo Henrique Bueno Lopes and Leandro A. F. Fernandes

Abstract: Asian Soybean Rust (ASR), caused by the fungus Phakopsora pachyrhizi, is one of the biggest threats to global soybean production, necessitating early and accurate detection to mitigate significant economic losses. While deep learning models for image segmentation have shown promise, state-of-the-art architectures are computationally intensive, with large model sizes and a high number of trainable parameters, limiting their practical application in resource-constrained environments such as on-farm mobile devices. This paper proposes ASRDNet, a novel, lightweight convolutional neural network (CNN) for efficiently segmenting ASR lesions on soybean leaves. The core contribution of ASRDNet is its multi-resolution architecture, which features a multi-path encoder that effectively extracts features from lesions of varying sizes without incurring the computational overhead of conventional models. We validate our approach on a dataset of high-resolution soybean leaf images, annotated by expert phytopathologists. Experimental results show that ASRDNet performs statistically equal to or superior to classic segmentation networks (U-Net, SegNet, and DeepLabv3), while being significantly more efficient. ASRDNet reduces the number of trainable parameters by up to 76% compared to DeepLabv3, 52% compared to SegNet, and 47% compared to U-Net. With data augmentation, ASRDNet achieved top performance metrics (F1-score of 0.9251 and IoU of 0.9283), demonstrating its potential as a robust and practical solution for in-field disease diagnostics.

Paper Nr: 148
Title:

Class-Based Adaptive Training for Facial Expression Recognition Using a Deep Convolutional Network

Authors:

Brian Luís Coimbra Maia, Lucas Silva Santana, Gabriel Rezende da Silva, Mathews Edwirds Gomes Almeida, Paulo Victor de Magalhães Rozatto, Luiz Maurílio da Silva Maciel, Saulo Moraes Villela, Marcelo Bernardes Vieira and Bruno Campos de Carvalho

Abstract: Analyzing the suitability of a dataset for the supervised learning process is a challenging task. Issues related to class imbalance and/or ambiguity make this problem even more challenging. This paper presents Class-Based Adaptive Training (CBAT), a method that improves training under these conditions. We propose to compensate the difficulty in correctly classifying certain classes in a dataset by using variable weighting throughout the training. The adaptive weight distributions are calculated by using the confusion matrices generated over the validation set at specific epochs. To evaluate our method, we worked with the FER2013 dataset for facial expression classification, which contains imbalance and ambiguities. We also evaluated the combination of models in an ensemble using the multi-objective Non-dominated Sorting Genetic Algorithm II (NSGA-II). The results show an increase in the accuracy of the challenging classes while keeping a competitive performance in terms of overall accuracy. Hence, the results showed that our method contributes to the learning process and effectively improves the accuracy of classes containing hard-to-classify samples.

Paper Nr: 151
Title:

Computer Vision-Assisted Literacy: Recognizing Children’s Handwritten Words

Authors:

Simone Bello Kaminski Aires, Maria Eduarda Guedes Pinto Gianisella, Simone Nasser Matos, Marcos Vinicius Santos Passos and Thales Janisch Santos

Abstract: Recognizing children’s handwritten words is challenging due to high intra-/inter-writer variability, stroke inconsistency, and incomplete character formation-difficulties amplified for learners with intellectual disabilities, who often face fine-motor constraints. We developed and evaluated convolutional neural networks (CNNs) for Portuguese children’s word recognition, aiming at inclusive educational and assistive uses. An offline dataset of 3,476 images (ages 6-10) spanning 84 word classes is used to compare VGG-16 and ResNet-50 under four regimes: baseline, data augmentation (DA), transfer learning (TL) from IAM, and TL+DA. Test evaluation employs accuracy and Word Error Rate (WER). Baselines underperform (e.g., VGG-16 2.30% accuracy), evidencing data scarcity and domain variability; DA markedly improves generalization (VGG-16 76.97%, ResNet-50 77.16%); TL yields further gains (VGG-16 90.40%, ResNet-50 84.07%); and TL+DA achieves the best results - 96.55% accuracy (WER 3.45%) for VGG-16 and 95.39% (WER 4.61%) for ResNet-50. Class-wise analyses indicate robustness with residual errors concentrated in visually/morphologically similar words. We additionally report results on IRONOFF database under identical regimes, where VGG-16 (TL+DA) attains 97.14% accuracy (WER 2.86%), corroborating robust generalization beyond the target domain. Results show that coupling TL with targeted DA enables accurate recognition of high-variability handwriting, supporting literacy assessment, educational games, and assistive technologies for both early-stage learners and users with intellectual disabilities.

Paper Nr: 152
Title:

Palm Trees Meets Deep Learning: A Real-Time Detection of Boufaroua with YOLOv8n and Embedded Optics

Authors:

Amal Ben Amor, Sonia Mosbah and Jihene Hlel

Abstract: Date palms are highly susceptible to several species of Date Palm Mite, thus affecting date production. The most destructive plantation type is Boufaroua-Oligonychus afrasiaticus (O. afrasiaticus), a microscopic mite that colonizes palm trees and feeds on their fruit sap. To fight against such threats, farmers often rely on chemicals that can eradicate all microscopic organisms within the palm tree. However, such protective measures represent an imminent danger for these trees, which are exposed to excessive chemical applications. In response to the absence of artificial intelligence–driven solutions for the detection of O. afrasiaticus, this study proposes a new deep learning–based framework for the automated identification of this mite using microscopic imaging. The proposed system uses a microscopic camera to acquire images from multiple orientations and anatomical regions of the palm tree, enabling robust detection under real-world variability. Real-time inference is achieved through the integration of YOLOv8n, a model trained on a curated dataset of microscopic images of O. afrasiaticus. Experimental results show that YOLOv8n delivers strong detection performance, achieving a precision of 88.2%, a recall of 87.2%, a mean Average Precision (mAP@0.5) of 94.1%, and an mAP@0.5:0.95 of 80.4%.

Paper Nr: 178
Title:

Hepatic Vessel Segmentation: A Narrative Review

Authors:

Nassim Kaddouri, Afifa Dahmane and Hadja Faiza Haned Khellaf

Abstract: Accurate delineation of the hepatic and portal venous trees underpins surgical planning and 3D reconstruction, yet tiny tortuous branches, phase-dependent contrast, motion/metal artefacts, and severe class imbalance make segmentation difficult. This survey reviews four families: (i) classical pipelines, (ii) traditional ML with engineered vascular features/graphical models, (iii) modern 3D CNN baselines (U-Net variants with residual/multi-scale/attention), and (iv) hybrid/advanced designs injecting topology, centerlines, or long-range context (graph/CRF/metric learning, sequence/transformer, diffusion with graph priors). We consolidate key datasets (MSD-08, 3D-IRCADb, LiVS, LIRCAD, VSNet re-annotations) and emphasize evaluation beyond Dice (clDice, Surface Dice, HD95, connectivity). We conclude with limits and perspectives on HV/PV-specific supervision, and protocol variance toward clinically grounded progress.

Paper Nr: 237
Title:

ASO PatchCore: Memory-Efficient and Fast Anomaly Detection via Automatic Sampling Optimization

Authors:

Satoshi Kamiya, Kota Yamashita and Kazuhiro Hotta

Abstract: In manufacturing anomaly detection where abnormal samples are scarce, methods that can be established solely from normal data are crucial. PatchCore, a representative approach, computes anomaly scores based on the nearest-neighbor distance between features of test images and a coreset of representative normal features constructed via greedy sampling. However, both the original paper and official implementation empirically fix the sampling rate at 10%, ensuring sufficient feature coverage. Excessive sampling increases coreset size, leading to higher memory usage and slower inference, whereas an extremely low rate degrades the accuracy. To balance the accuracy, memory, and speed, we propose ASO PatchCore, which incorporates our newly proposed Automatic Sampling Optimization (ASO) algorithm. ASO normalizes the representative distance curve obtained during greedy sampling and selects the point on the representative distance curve minimizing the Euclidean distance to the origin. The ASO algorithm provides a simple and reproducible decision rule that preserves feature coverage while suppressing coreset size. Experiments on MVTec AD, VisA, and MVTec AD2 demonstrated that ASO PatchCore achieves memory reductions of 86.5%, 88.6%, and 85.1%, with 3.9×, 6.3×, and 4.1× faster inference, respectively, while maintaining comparable I-AUROC, P-AUROC, and PRO.

Paper Nr: 265
Title:

Modeling Student Engagement in the Wild: Analysis from Classroom Video

Authors:

Uroš Petković, Jonas Frenkel, Rebecca Lazarides and Olaf Hellwich

Abstract: This work analyzes student engagement in authentic secondary-school classrooms using fine-grained behavioral annotations for 268 students across 46 classes and roughly 20 hours of video. Engagement was annotated continuously at 4 Hz, providing dense supervision for short video segments and enabling a detailed examination of moment-to-moment behavioral variation in real instructional settings. We evaluate a multiscale video transformer (MViTv2-S, pretrained on Kinetics-400) and obtain a mean Pearson correlation of r=0.53 under 5-fold subject-independent cross-validation, with model–human agreement of ICC(2,2)=0.67 compared to an inter-rater ICC(2,2)=0.85. Temporal subsampling experiments show that clip durations of about five seconds yield the most reliable predictions, and an architectural comparison demonstrates that MViTv2-S substantially outperforms a standard 3D CNN (R3D-18; mean r=0.39). These findings highlight the difficulty of engagement estimation in in-the-wild classroom video and point toward future work integrating broader contextual cues and richer behavioral information.

Paper Nr: 274
Title:

Classification of Normal versus Leukemic Cells with Swin Transformer and Balanced Data Augmentation

Authors:

Douglas Costa Braga, Albert de Jesus Souza, Samuel Bastos Borges Pinho and Daniel Oliveira Dantas

Abstract: Acute lymphoblastic leukemia (ALL) diagnosis still relies heavily on the manual microscopic examination of stained blood or bone marrow smears, a process that is inherently subjective and time-consuming. To address these limitations, we present an automated classification framework built upon the Swin Transformer Tiny architecture, designed to distinguish leukemic from normal lymphocytes. The hierarchical attention mechanism of the model enables the capture of long-range spatial relationships. To counter dataset imbalance, the training pipeline incorporates geometric data augmentation and a weighted random sampling strategy. Experiments conducted on the C-NMC 2019 dataset employed transfer learning and fine-tuning to achieve optimal performance. The proposed approach achieved an F1-score of 99.44%, outperforming previously reported CNN-based and handcrafted feature methods. Statistical robustness was confirmed via 30-run Monte Carlo validation (p < 0.001). These results highlight the potential of attention-based transformer architectures to deliver accurate and computationally efficient diagnostic tools suitable for clinical deployment in leukemia detection.

Paper Nr: 327
Title:

An Ensemble Approach to Climate Misinformation Detection

Authors:

Lei Lei, Marc Lalonde, Hamed Ghodrati and Azur Handan

Abstract: The proliferation of climate-related misinformation on social media, particularly through multimodal content (text-image pairs), poses a significant threat to public understanding and policy action. While recent advancements in Vision-Language Models (VLMs) offer new detection capabilities, individual models often suffer from limitations such as specific reasoning modes, static knowledge boundaries, or prompt-induced biases. To address these challenges, this paper proposes a robust ensemble framework for verifying climate claims. Our approach integrates four distinct classifiers-including three VLM-based reasoning agents and one retrieval-augmented system-to leverage their complementary strengths. By aggregating the predictions of these base learners through a majority voting mechanism, the system mitigates the specific weaknesses of individual detectors. We evaluated the proposed framework on a subset of 500 image-claim pairs from the CliME dataset, employing a rigorous GPT-4o-based labeling pipeline aided by expert-validated descriptions. Crucially, these descriptions were withheld during inference to simulate realistic fact-checking scenarios where such contextual metadata is unavailable. Experimental results demonstrate that while the VLM-based detector achieves the highest performance among individual models, the ensemble method yields superior Accuracy, Recall, and F1-scores. These findings highlight the effectiveness of ensemble strategies in creating more reliable and resilient automated detection systems for complex climate misinformation narratives.

Paper Nr: 335
Title:

Efficient Transformer-Based Spatio-Temporal Action Recognition for Industrial Safety Surveillance

Authors:

Ruslan Zaripov, Anastasiya Shpileva, Maksim Koltakov, Georgii Petrov, Viacheslav Shalamov and Valeria Efimova

Abstract: Action recognition in video is essential for understanding human behavior and ensuring safety in industrial environments. This paper presents ActionFormer, a lightweight multimodal transformer architecture for spatiotemporal action recognition and localization in surveillance video. The model integrates skeletal representations and segmentation masks to capture both human motion and relevant object interactions, improving robustness under challenging industrial conditions such as low resolution, occlusion, and variable lighting. We also present action recognition methods build upon VideoMAE and Hiera backbones for efficient spatiotemporal feature extraction. To support the study, we collected a custom dataset of safety-relevant actions, including smoking, drinking, eating, and phone usage—recorded under real surveillance conditions and augmented with synthetic data to balance classes. Experimental evaluation shows that ActionFormer achieves an average mAP@50 of 0.78, outperforming Hiera (0.56) and YOLOv11 + VideoMAE (0.45) while maintaining inference speed approximately equal to 1.8 seconds per clip. These results demonstrate that transformer-based encoders and multimodal fusion yields accurate and efficient models, suitable for real-world industrial safety monitoring.

Paper Nr: 336
Title:

A Deterministic Edge-AI System for Early Wildfire Smoke Detection: From Lightweight Neural Models to Operationally Reliable Surveillance

Authors:

Damian Kmiecik and Adrian Dziembowski

Abstract: Wildfire mitigation depends critically on minimizing Time-to-Detect (TTD). While lightweight convolutional neural networks (CNNs) enable visual detection of early-stage wildfire smoke, deploying them in practical, autonomous off-grid monitoring towers remains a system engineering challenge due to strict power and hardware constraints. This paper presents a proof-of-concept (PoC) implementation of a deterministic Edge-AI architecture designed for reliable wildfire surveillance. The proposed approach introduces an application-layer scheduler guaranteeing bounded system latency, as well as a hybrid approach connecting neural network output with heuristic spatio-temporal logic reducing false positives. Preliminary experimental results demonstrate that the proposed system achieves an F1-Score of 92.5% while maintaining stable latency for up to nine concurrent streams, proving that a multi-camera wildfire detection is achievable in practice, also on low-power edge hardware.

Paper Nr: 339
Title:

IoMB: Evaluating Object Detectors on Occluded and Imbalanced Seabird Populations

Authors:

Charis Hanna, Kasim Terzić, Mark James and Karen Spencer

Abstract: Computer vision techniques are seeing increasing use in wildlife monitoring surveys but challenges remain when applying standard techniques to complex, real-world data. To gauge the state-of-the-art on this problem, we present a challenging new dataset specifically designed to reflect the challenges of monitoring cliff-nesting seabird species. Unlike the widely used benchmarks of CUB (Caltech-UCSD Birds-200-2011) (Wah et al., 2011) and NAB (North America Birds) (Horn et al., 2015), our dataset captures a wider range of relative bird sizes, a high degree of class imbalance and significant levels of occlusion. We evaluate the performance of state-of-the-art object detectors YOLO (Khanam and Hussain, 2024), RetinaNet (Lin et al., 2020), Faster R-CNN (Ren et al., 2015) and DINO (Zhang et al., 2022) on this dataset and show that these standard detectors struggle with occlusion, limited species representation, and large size variation. Although pre-training on larger datasets provides some improvement, it cannot fully address these challenges. Our study highlights the need for the development of new techniques tailored to these specific issues in wildlife monitoring.