VISAPP 2025 Abstracts


Area 1 - Image and Video Processing and Analysis

Full Papers
Paper Nr: 19
Title:

Self-Supervised Partial Cycle-Consistency for Multi-View Matching

Authors:

Fedor Taggenbrock, Gertjan Burghouts and Ronald Poppe

Abstract: Matching objects across partially overlapping camera views is crucial in multi-camera systems and requires a view-invariant feature extraction network. Training such a network with cycle-consistency circumvents the need for labor-intensive labeling. In this paper, we extend the mathematical formulation of cycle-consistency to handle partial overlap. We then introduce a pseudo-mask which directs the training loss to take partial overlap into account. We additionally present several new cycle variants that complement each other and present a time-divergent scene sampling scheme that improves the data input for this self-supervised setting. Cross-camera matching experiments on the challenging DIVOTrack dataset show the merits of our approach. Compared to the self-supervised state-of-the-art, we achieve a 4.3 percentage point higher F1 score with our combined contributions. Our improvements are robust to reduced overlap in the training data, with substantial improvements in challenging scenes that need to make few matches between many people. Self-supervised feature networks trained with our method are effective at matching objects in a range of multi-camera settings, providing opportunities for complex tasks like large-scale multi-camera scene understanding.
Download

Paper Nr: 57
Title:

Targeted Test Time Adaptation of Memory Networks for Video Object Segmentation

Authors:

Isidore Dubuisson, Damien Muselet, Christophe Ducottet and Jochen Lang

Abstract: Semi Automatic Video Object Segmentation (SVOS) aims to segment few objects in a video based on the annotation of these particular objects in the first frame only. State-of-the-art methods rely on offline training on a large dataset that may lack specific samples and details directly applicable to the current test video. Common solutions are to use test-time adaptation to finetune the offline model with the single annotated frame or by relying on complex semi-supervised strategies. In this paper, we introduce targeted test-time adaptation of memory-based SVOS providing the benefits of finetuning with much smaller learning effort. Our method targets specific parts of the model to ensure improved results while maintaining robustness of the offline training. We find that targeting the bottleneck features and the masks that are saved in memory provide substantial benefits. The evaluation of our method shows a significant improvement for video segmentation on DAVIS16 and DAVIS17 datasets.
Download

Paper Nr: 98
Title:

Polygonizing Roof Segments from High-Resolution Aerial Images Using Yolov8-Based Edge Detection

Authors:

Qipeng Mei, Dimitri Bulatov and Dorota Iwaszczuk

Abstract: This study presents a novel approach for roof detail extraction and vectorization using remote sensing images. Unlike previous geometric-primitive-based methods that rely on the detection of corners, our method focuses on edge detection as the primary mechanism for roof reconstruction, while utilizing geometric relationships to define corners and faces. We adapt the YOLOv8 OBB model, originally designed for rotated object detection, to extract roof edges effectively. Our method demonstrates robustness against noise and occlusion, leading to precise vectorized representations of building roofs. Experiments conducted on the SGA and Melville datasets highlight the method’s effectiveness. At the raster level, our model outperforms the state-of-the-art foundation segmentation model (SAM), achieving a mIoU between 0.85 and 1 for most samples and an ovIoU close to 0.97. At the vector level, evaluation using the Hausdorff distance, PolyS metric, and our raster-vector-metric demonstrates significant improvements after polygonization, with a close approximation to the reference data. The method successfully handles diverse roof structures and refines edge gaps, even on complex roof structures of new, excluded from training datasets. Our findings underscore the potential of this approach to address challenges in automatic roof structure vectorization, supporting various applications such as urban terrain reconstruction.
Download

Paper Nr: 104
Title:

STEP: SuperToken and Early-Pruning for Efficient Semantic Segmentation

Authors:

Mathilde Proust, Martyna Poreba, Michal Szczepanski and Karim Haroun

Abstract: Vision Transformers (ViTs) achieve state-of-the-art accuracy in numerous vision tasks, but their heavy computational and memory requirements pose significant challenges. Minimising token-related computations is critical to alleviating this computational burden. This paper introduces a novel SuperToken and Early-Pruning (STEP) approach that combines patch merging along with an early-pruning mechanism to optimize token handling in ViTs for semantic segmentation. The improved patch merging method is developed to effectively address the diverse complexities of images. It features a dynamic and adaptive system, dCTS, which employs a CNN-based policy network to determine the quantity and size of patch groups that share the same supertoken during inference. With a flexible merging strategy, it handles superpatches of varying sizes: 2×2, 4×4, 8×8, and 16×16. Early in the network, high-confidence tokens are discarded and preserved from subsequent processing stages. This hybrid approach reduces both computational and memory requirements without significantly compromising segmentation accuracy. It is shown through experimental results that, on average, 40% of tokens can be predicted from the 16th layer onwards when using ViT-Large as the backbone. Additionally, a reduction of up to 3× in computational complexity is achieved, with a maximum drop in accuracy of 2.5%.
Download

Paper Nr: 106
Title:

Dawn: A Robust Tone Mapping Operator for Multi-Illuminant and Low-Light Scenarios

Authors:

Furkan Kınlı, Barış Özcan and Furkan Kıraç

Abstract: We introduce Dawn, a novel Tone Mapping Operator (TMO) designed to address the limitations of state-of-the-art TMOs such as Flash and Storm, particularly in challenging lighting conditions. While existing methods perform well in stable, well-lit, single-illuminant environments, they struggle with multi-illuminant and low-light scenarios, often leading to artifacts, amplified noise, and color shifts due to the additional step to adjust overall scene brightness. Dawn solves these issues by adaptively inferring the scaling parameter for the Naka-Rushton Equation through a weighted combination of luminance mean and variance. This dynamic approach allows Dawn to handle varying illuminant conditions, reducing artifacts and improving image quality without requiring additional adjustments to scene brightness. Our experiments show that Dawn matches the performance of current state-of-the-art TMOs on HDR datasets and outperforms them in low-light conditions, providing superior visual results. The source code for Dawn will be available at https://github.com/birdortyedi/ dawn-tmo/.
Download

Paper Nr: 118
Title:

Separation of Insect Trajectories in Dynamic Vision Sensor Data

Authors:

Juliane Arning, Christoph Dalitz and Regina Pohle-Fröhlich

Abstract: We present a method to separate insect flight trajectories in dynamic vision sensor data and for their mathematical description by smooth curves. The method consists of four steps: Pre-processing, clustering, post-processing, and curve fitting. As the time and space coordinates use different scales, we have rescaled the dimensions with data-based scale factors. For clustering, we have compared DBSCAN and MST-based clustering, and both suffered from undersegmentation. A suitable post-processing was introduced to fix this. Curve fitting was done with a non-parametric LOWESS smoother. The runtime of our method is sufficiently fast to be applied in real-time insect monitoring. The data used for evaluation only had two spatial dimensions, but the method can be applied to data with three spatial dimensions, too.
Download

Paper Nr: 130
Title:

Joint Calibration of Cameras and Projectors for Multiview Phase Measuring Profilometry

Authors:

Hyeongjun Cho and Min H. Kim

Abstract: Existing camera-projector calibration for phase-measuring profilometry (PMP) is valid for only a single view. To extend a single-view PMP to a multiview system, an additional calibration, such as Zhang’s method, is necessary. In addition to calibrating phase-to-height relationships for each view, calibrating parameters of multiple cameras, lenses, and projectors by rotating a target is indeed cumbersome and often fails with the local optima of calibration solutions. In this work, to make multiview PMP calibration more convenient and reliable, we propose a joint calibration method by combining these two calibration modalities of phase-measuring pro-filometry and multiview geometry with high accuracy. To this end, we devise (1) a novel compact, static calibration target with planar surfaces of different orientations with fiducial markers and (2) a joint multiview optimization scheme of the projectors and the cameras, handling nonlinear lens distortion. First, we automatically detect the markers to estimate plane equation parameters of different surface orientations. We then solve homography matrices of multiple planes through target-aware bundle adjustment. Given unwrapped phase measurement, we calibrate intrinsic/extrinsic/lens-distortion parameters of every camera and projector without requiring any manual interaction with the calibration target. Only one static scene is required for calibration. Results validate that our calibration method enables us to combine multiview PMP measurements with high accuracy.
Download

Paper Nr: 131
Title:

Editing Scene Illumination and Material Appearance of Light-Field Images

Authors:

Jaemin Cho, Dongyoung Choi, Dahyun Kang, Gun Bang and Min H. Kim

Abstract: In this paper, we propose a method for editing the scene appearance of light-field images. Our method enables users to manipulate the illumination and material properties of scenes captured in light-field format, offering various control over image appearance, including dynamic relighting and material appearance modification, which leverages our specially designed inverse rendering framework for light-field images. By effectively separating light fields into appearance parameters—such as diffuse albedo, normal, specular intensity, and roughness within a multi-plane image domain, we overcome the traditional challenges of light-field imaging decomposition. These challenges include handling front-parallel views and a limited image count, which have previously hindered neural inverse rendering networks when applying them to light-field image data. Our method also approximates environmental illumination using spherical Gaussians, significantly enhancing the realism of scene reflectance. Furthermore, by differentiating scene illumination into far-bound and near-bound light environments, our method enables highly realistic editing of scene appearance and illumination, especially for local illumination effects. This differentiation allows for efficient, real-time relighting rendering and integrates seamlessly with existing layered light-field rendering frameworks. Our method demonstrates rendering capabilities from casually captured light-field images.
Download

Paper Nr: 143
Title:

Patch-Based Deep Unsupervised Image Segmentation Using Graph Cuts

Authors:

Isaac Wasserman and Jeová Farias Sales Rocha Neto

Abstract: Unsupervised image segmentation seeks to group semantic patterns in an image without the use of human annotation. Similarly, image clustering searches for groupings of images based on their semantic content. Traditionally, both problems have drawn from sound mathematical concepts to produce concrete applications. With the emergence of deep learning, the scientific community turned its attention to complex neural network-based solvers that achieved impressive results in those domains but rarely leveraged the advances made by classical methods. In this work, we propose a patch-based unsupervised image segmentation strategy that uses the algorithmic strength of classical graph-based methods to enhance unsupervised feature extraction from deep clustering. We show that a simple convolutional neural network, trained to classify image patches and iteratively regularized using graph cuts, can be transformed into a state-of-the-art, fully-convolutional, unsupervised, pixel-level segmenter. Furthermore, we demonstrate that this is the ideal setting for leveraging the patch-level pairwise features generated by vision transformer models. Our results on real image data demonstrate the effectiveness of our proposed methodology.
Download

Paper Nr: 144
Title:

Obstacle Detection and Ship Recognition System for Unmanned Surface Vehicles

Authors:

Sevda Sayan and Hazım Kemal Ekenel

Abstract: This study investigates obstacle detection and ship classification via cameras to ensure safe navigation for Unmanned Surface Vehicles. A two-stage approach was employed to achieve these goals. In the first stage, the focus was on detecting ships, humans, and other obstacles in maritime environments. Models based on the You Only Look Once architecture, specifically YOLOv5 and its variant TPH-YOLOv5 —specialized for detecting small objects— were optimized using the MODS dataset. This dataset contains labeled images of dynamic obstacles, such as ships, humans, and static obstacles, e.g., buoys. TPH-YOLOv5 performed well in detecting small objects, crucial for collision avoidance in Unmanned Surface Vehicles. In the second stage, the study addressed the ship classification problem, using the MARVEL dataset, which contains over two million images across 26 ship subtypes. A comparative analysis was conducted between Convolutional Neural Networks and Vision Transformer based models. Among these, the Data-efficient Image Transformer achieved the highest classification accuracy of 92.87%, surpassing the previously reported state-of-the-art performance. In order to further analyze the classification results, this study introduced a generic method for generating attention heatmaps in vision transformer based models. Unlike related works, this method is applicable not only to Vision Transformer but also to its variants. Additionally, pruning techniques were explored to improve the computational efficiency of Data-efficient Image Transformer model, reducing inference times and moving closer to the speed required for real-time applications, though Convolutional Neural Networks remain faster for such tasks.
Download

Paper Nr: 145
Title:

A Multi-Criteria Approach for Gaze Analysis Similarity in Paintings

Authors:

Tess Masclef, Mihaela Scuturici, Tetiana Yemelianenko and Serge Miguet

Abstract: In the fields of art history and visual semiotics, analysing gazes in paintings is important to understand the artwork, and to find semantic relationships between several paintings. Thanks to digitization and museum initiatives, the volume of datasets on artworks continues to expand, enabling new avenues for exploration and research. Artificial neural networks, trained on large datasets are able to extract complex features, and visually compare artworks. This comparison could be done by focusing on the objects present in the paintings, and matching paintings with high object co-occurrence. Our research takes this further by studying the way objects are viewed by characters in the scene. This study proposes a new approach that combines methods for gaze-based and visual-based similarity, to encode and use gaze information for finding similar paintings, while maintaining a close visual aspect. Experimental results which integrate the opinions of domain experts, show that these methods complement each other. Quantitative and qualitative assessments confirm the results from the combination of gaze and visual analysis. Thus, this method improves existing visual similarity queries and opens up new possibilities to retrieve similar paintings according to user-specific criteria.
Download

Paper Nr: 225
Title:

An Assessment of Shadow Generation by GAN with Depth Images on Non-Planar Backgrounds

Authors:

Kaito Toyama and Maki Sugimoto

Abstract: We propose the use of a Generative Adversarial Network (GAN) with depth images to generate shadows for virtual objects in mixed reality environments. This approach improves the accuracy of shadow generation process by aligning shadows with non-planar geometries. While traditional methods require detailed lighting and geometry data, recent research has emerged that generates shadows by learning from the image itself, even when such conditions are not fully known. However, these studies are limited to projecting shadows only onto the ground: a planar geometry. Our dataset used for training the GAN, includes depth images allows natural shadow generation in complex environments.
Download

Paper Nr: 233
Title:

Towards Robust Multimodal Land Use Classification: A Convolutional Embedded Transformer

Authors:

Muhammad Zia Ur Rehman, Syed Mohammed Shamsul Islam, Anwaar UlHaq, David Blake and Naeem Janjua

Abstract: Multisource remote sensing data has gained significant attention in land use classification. However, effectively extracting both local and global features from various modalities and fusing them to leverage their complementary information remains a substantial challenge. In this paper, we address this by exploring the use of transformers for simultaneous local and global feature extraction while enabling cross-modality learning to improve the integration of complementary information from HSI and LiDAR data modalities. We propose a spatial feature enhancer module (SFEM) that efficiently captures features across spectral bands while preserving spatial integrity for downstream learning tasks. Building on this, we introduce a cross-modal convolutional transformer, which extracts both local and global features using a multi-scale convolutional embedded encoder (MSCE). The convolutional layers embedded in the encoder facilitate the blending of local and global features. Additionally, cross-modal learning is incorporated to effectively capture complementary information from HSI and LiDAR modalities. Evaluation on the Trento dataset highlights the effectiveness of the proposed approach, achieving an average accuracy of 99.04% and surpassing comparable methods.
Download

Paper Nr: 252
Title:

Learning Weakly Supervised Semantic Segmentation Through Cross-Supervision and Contrasting of Pixel-Level Pseudo-Labels

Authors:

Lucas David, Helio Pedrini and Zanoni Dias

Abstract: The quality of the pseudo-labels employed in training is paramount for many Weakly Supervised Semantic Segmentation techniques, which are often limited by their associated uncertainty. A common strategy found in the literature is to employ confidence thresholds to filter unreliable pixel labels, improving the overall quality of label information, but discarding a considerable amount of data. In this paper, we investigate the effectiveness of cross-supervision and contrastive learning of pixel-level pseudo-annotations in weakly supervised tasks, where only image-level annotations are available. We propose CSRM: a multi-branch deep convolutional network that leverages reliable pseudo-labels to learn to classify and segment a task in a mutual promotion scheme, while employing both reliable and unreliable pixel-level pseudo-labels to learn representations in a contrastive learning scheme. Our solution achieves 75.0% mIoU in Pascal VOC 2012 testing and 50.4% MS COCO 2014 validation datasets, respectively. Code available at github.com/lucasdavid/wsss-csrm.
Download

Paper Nr: 302
Title:

Towards JPEG-Compression Invariance for Adversarial Optimization

Authors:

Amon Soares de Souza, Andreas Meißner and Michaela Geierhos

Abstract: Adversarial image processing attacks aim to strike a fine balance between pattern visibility and target model error. This balance ideally results in a sample that maintains high visual fidelity to the original image, but forces the model to output the target of the attack, and is therefore particularly susceptible to transformations by post-processing such as compression. JPEG compression, which is inherently non-differentiable and an integral part of almost every web application, therefore severely limits the set of possible use cases for attacks. Although differentiable JPEG approximations have been proposed, they (1) have not been extended to the stronger and less perceptible optimization-based attacks, and (2) have been insufficiently evaluated. Constrained adversarial optimization allows for a strong combination of success rate and high visual fidelity to the original sample. We present a novel robust attack based on constrained optimization and an adaptive compression search. We show that our attack outperforms current robust methods for gradient projection attacks for the same amount of applied perturbation, suggesting a more effective trade-off between perturbation and attack success rate. The code is available here: https://github.com/amonsoes/frcw.
Download

Paper Nr: 319
Title:

Application-Guided Image Fusion: A Path to Improve Results in High-Level Vision Tasks

Authors:

Gisel Bastidas-Guacho, Patricio Moreno-Vallejo, Boris Vintimilla and Angel D. Sappa

Abstract: This paper proposes an enhanced application-driven image fusion framework to improve final application results. This framework is based on a deep learning architecture that generates fused images to better align with the requirements of applications such as semantic segmentation and object detection. The color-based and edge-weighted correlation loss functions are introduced to ensure consistency in the YCbCr space and emphasize structural integrity in high-gradient regions, respectively. Together, these loss components allow the fused image to retain more features from the source images by producing an application-ready fused image. Experiments conducted on two public datasets demonstrate a significant improvement in mIoU achieved by the proposed approach compared to state-of-the-art methods.
Download

Paper Nr: 331
Title:

Can Bayesian Neural Networks Explicitly Model Input Uncertainty?

Authors:

Matias Valdenegro-Toro and Marco Zullich

Abstract: Inputs to machine learning models can have associated noise or uncertainties, but they are often ignored and not modelled. It is unknown if Bayesian Neural Networks and their approximations are able to consider uncertainty in their inputs. In this paper we build a two input Bayesian Neural Network (mean and standard deviation) and evaluate its capabilities for input uncertainty estimation across different methods like Ensembles, MC-Dropout, and Flipout. Our results indicate that only some uncertainty estimation methods for approximate Bayesian NNs can model input uncertainty, in particular Ensembles and Flipout.
Download

Paper Nr: 332
Title:

Exploring Local Graphs via Random Encoding for Texture Representation Learning

Authors:

Ricardo T. Fares, Luan B. Guerra and Lucas C. Ribas

Abstract: Despite many graph-based approaches being proposed to model textural patterns, they not only rely on a large number of parameters, culminating in a large search space, but also model a single, large graph for the entire image, which often overlooks fine-grained details. This paper proposes a new texture representation that utilizes a parameter-free micro-graph modeling, thereby addressing the aforementioned limitations. Specifically, for each image, we build multiple micro-graphs to model the textural patterns, and use a Randomized Neural Network (RNN) to randomly encode their topological information. Following this, the network’s learned weights are summarized through distinct statistical measures, such as mean and standard deviation, generating summarized feature vectors, which are combined to form our final texture representation. The effectiveness and robustness of our proposed approach for texture recognition was evaluated on four datasets: Outex, USP-tex, Brodatz, and MBT, outperforming many literature methods. To assess the practical application of our method, we applied it to the challenging task of Brazilian plant species recognition, which requires microtex-ture characterization. The results demonstrate that our new approach is highly discriminative, indicating an important contribution to the texture analysis field.
Download

Paper Nr: 333
Title:

Volumetric Color-Texture Representation for Colorectal Polyp Classification in Histopathology Images

Authors:

Ricardo T. Fares and Lucas C. Ribas

Abstract: With the growth of real-world applications generating numerous images, analyzing color-texture information has become essential, especially when spectral information plays a key role. Currently, many randomized neural network texture-based approaches were proposed to tackle color-textures. However, they are integrative approaches or fail to achieve competitive processing time. To address these limitations, this paper proposes a single-parameter color-texture representation that captures both spatial and spectral patterns by sliding volumetric (3D) color cubes over the image and encoding them with a Randomized Autoencoder (RAE). The key idea of our approach is that simultaneously encoding both color and texture information allows the autoencoder to learn meaningful patterns to perform the decoding operation. Hence, we employ as representation the flattened decoder’s learned weights. The proposed approach was evaluated in three color-texture benchmark datasets: USPtex, Outex, and MBT. We also assessed our approach in the challenging and important application of classifying colorectal polyps. The results show that the proposed approach surpasses many literature methods, including deep convolutional neural networks. Therefore, these findings indicate that our representation is discriminative, showing its potential for broader applications in histological images and pattern recognition tasks.
Download

Paper Nr: 338
Title:

Recognition-Oriented Low-Light Image Enhancement Based on Global and Pixelwise Optimization

Authors:

Seitaro Ono, Yuka Ogino, Takahiro Toizumi, Atsushi Ito and Masato Tsukada

Abstract: In this paper, we propose a novel low-light image enhancement method aimed at improving the performance of recognition models. Despite recent advances in deep learning, the recognition of images under low-light conditions remains a challenge. Although existing low-light image enhancement methods have been developed to improve image visibility for human vision, they do not specifically focus on enhancing recognition model performance. Our proposed low-light image enhancement method consists of two key modules: the Global Enhance Module, which adjusts the overall brightness and color balance of the input image, and the Pixelwise Adjustment Module, which refines image features at the pixel level. These modules are trained to enhance input images to improve downstream recognition model performance effectively. Notably, the proposed method can be applied as a frontend filter to improve low-light recognition performance without requiring retraining of downstream recognition models. Experimental results demonstrate that our method improves the performance of pretrained recognition models under low-light conditions and its effectiveness.
Download

Paper Nr: 357
Title:

Enhancing Marine Habitats Detection: A Comparative Study of Semi-Supervised Learning Methods

Authors:

Rim Rahali, Thanh Phuong Nguyen and Vincent Nguyen

Abstract: Most of the recent success in applying deep learning techniques to object detection relies on large amounts of carefully annotated and large training data, whereas annotating underwater images is a costly process and providing a large dataset is not always affordable. In this paper, we conduct a comprehensive analysis of multiple semi-supervised learning models, used for marine habitats detection, aiming to reduce the reliance on extensive labeled data while maintaining high accuracy in challenging underwater environments. Results, performed on Deepfish and UTDAC2020 datasets attest a significant performance conducted by semi-supervised learning, in terms of quantitative and qualitative evaluation. An other study related to Underwater Image Enhancement (UIE) methods and contrastive learning is presented in this work to deal with underwater images specificity and provide more comprehensive analysis of their impact on marine habitats detection.
Download

Short Papers
Paper Nr: 25
Title:

D-LaMa: Depth Inpainting of Perspective-Occluded Environments

Authors:

Simone Müller, Willyam Sentosa, Daniel Kolb, Matthias Müller and Dieter Kranzlmüller

Abstract: Occlusion is a common problem in computer vision where backgrounds or objects are occluded by other objects in the foreground. Occlusion affects object recognition or tracking and influences scene understanding with the associated depth estimation and spatial perception. To solve the associated problems and improve the detection of areas, we propose a pre-trained image distortion model that allows us to incorporate new perspectives within previously rendered point clouds. We investigate approaches in synthetically generated use cases: Masking previously generated virtual images and depth images, removing and painting over a provided mask, and the removal of objects from the scene. Our experimental results allow us to gain valuable insights into fundamental problems of occlusion configurations and confirm the effectiveness of our approaches. Our research findings serve as a guide to applying our model to real-life scenarios and ultimately solve the occlusion problem.
Download

Paper Nr: 27
Title:

Synthetic Thermal Image Generation from Multi-Cue Input Data

Authors:

Patricia L. Suárez and Angel D. Sappa

Abstract: This paper presents a novel approach for generating synthetic thermal images using depth and edge maps from the given grayscale image. In this way, the network receives the fused image as input to generate a synthetic thermal representation. By training a generative model with the depth map fused with the corresponding edge representation, the model learns to generate realistic synthetic thermal images. A study on the correlation between different types of inputs shows that depth and edge maps are better correlated than grayscale images, or other options generally used in the state-of-the-art approaches. Experimental results demonstrate that the method outperforms state-of-the-art and produces better-quality synthetic thermal images with improved shape and sharpness. Improvements in results are attributed to the combined use of depth and edge maps together with the novel loss function terms proposed in the current work.
Download

Paper Nr: 39
Title:

RandSaliencyAug: Balancing Saliency-Based Data Augmentation for Enhanced Generalization

Authors:

Teerath Kumar, Alessandra Mileo and Malika Bendechache

Abstract: Improving model generalization in computer vision, especially with noisy or incomplete data, remains a significant challenge. One common solution is image augmentation through occlusion techniques like cutout, random erasing, hide-and-seek, and gridmask. These methods encourage models to focus on less critical information, enhancing robustness. However, they often obscure real objects completely, leading to noisy data or loss of important context, which can cause overfitting. To address these issues, we propose a novel augmentation method, RandSaliencyAug (RSA). RSA identifies salient regions in an image and applies one of six new strategies: Row Slice Erasing, Column Slice Erasing, Row-Column Saliency Erasing, Partial Saliency Erasing, Horizontal Half Saliency Erasing, and Vertical Half Saliency Erasing. RSA is available in two versions: Weighted RSA (W-RSA), which selects policies based on performance, and Non-Weighted RSA (N-RSA), which selects randomly. By preserving contextual information while introducing occlusion, RSA improves model generalization. Experiments on Fashion-MNIST, CIFAR10, CIFAR100, and ImageNet show that W-RSA outperforms existing methods.
Download

Paper Nr: 59
Title:

Objective Diagnosis of Depression and Autism Spectrum Disorder Based on fMRI Time Series Statistics

Authors:

Jakub D. Szkudlarek, Sjir J. C. Schielen, Catarina Dinis Fernandes, Danny Ruijters, Albert P. Aldenkamp and Svitlana Zinger

Abstract: A common challenge in diagnosing neuropsychiatric disorders is the lack of objective biomarkers. Current diagnostic approaches rely on the subjective interpretation of observations instead of measurements of brain activity obtained using functional magnetic resonance imaging (fMRI). We propose a method for the objective diagnosis of depression and autism spectrum disorder (ASD), marking the first known experiment that explores the diagnostic performance of only fMRI time series statistics. We researched the importance of time series statistics based on ICA and BOLD for ASD diagnosis. Besides well-known statistics, we introduce features based on the first-order derivative and the frequency-domain representation of the signals. The performance of these features is assessed using multiple machine-learning algorithms. A test accuracy of 69% is achieved on a depression dataset consisting of 72 subjects (51 depressed, 21 controls). On an autism dataset composed of 49 subjects (24 ASD, 25 controls), a test accuracy of 67% and 74% is achieved for ICA and BOLD-based methods respectively. The best results on the ASD dataset are related to the lateral sensorimotor network and the right ventral anterior region. These results demonstrate the potential of fMRI time series statistics as objective biomarkers for neuropsychiatric disorders.
Download

Paper Nr: 61
Title:

StrikeNet: A Deep Neural Network to Predict Pixel-Sized Lightning Location

Authors:

Mélanie Bosc, Adrien Chan-Hon-Tong, Aurélie Bouchard and Dominique Béréziat

Abstract: Forecasting the location of electrical activity at a very short time range remains one of the most challenging predictions to make, primarily attributable to the chaotic nature of thunderstorms. Additionally, the punctual nature of lightning further complicates the establishment of reliable forecasts. This article introduces StrikeNet, a specialized Convolutional Neural Network (CNN) model designed for very short-term forecasts of pixel-sized electrical activity locations, utilizing sequences of temporal images as input and only two data types. Employing soft Non-Maximum Suppression (NMS) techniques, incorporating morphological features within residual blocks, and implementing dropout regularization, StrikeNet is specifically designed for detecting and predicting pixel-sized objects in images. This design seamlessly aligns with the task of forecasting imminent electrical activity achieving F1 score about 0.53 for the positive class (lightning) and outperforms the state of the art. Moreover, it can be applied to similar datasets such as the Aerial Elephant Dataset (AED) where it outperforms traditional CNN models.
Download

Paper Nr: 82
Title:

Deep Learning-Tuned Adaptive Inertia Weight in Particle Swarm Optimization for Medical Image Registration

Authors:

Katharina Krämer, Stefan Müller and Michael Kosterhon

Abstract: A novel parameter training approach for Adaptive Inertia Weight Particle Swarm Optimization (AIW-PSO) using Deep Learning is proposed. In PSO, balancing exploration and exploitation is crucial, with inertia governing parameter space sampling. This work presents a method for training transfer function parameters that adjust the inertia weight based on a particle’s individual search ability (ISA) in each dimension. A neural network is used to train the parameters of this transfer function, which then maps the ISA value to a new inertia weight. During inference, the best possible success ratio and lowest average error are used as network inputs to predict optimal parameters. Interestingly, the parameters across different objective functions are similar and assume values that may appear spatially implausible, yet outperform all other considered value expressions. We evaluate the proposed method Deep Learning-Tuned Adaptive Inertia Weight (TAIW) against three inertia strategies: Constant Inertia Strategy (CIS), Linear Decreasing Inertia (LDI), Adaptive Inertia Weight (AIW) on three benchmark functions. Additionally, we apply these PSO inertia strategies to medical image registration, utilizing digitally reconstructed radiographs (DRRs). The results show promising improvements in alignment accuracy using TAIW. Finally, we introduce a metric that assesses search effectiveness based on multidimensional search space volumes.
Download

Paper Nr: 107
Title:

EasyPortrait: Face Parsing and Portrait Segmentation Dataset

Authors:

Karina Kvanchiani, Elizaveta Petrova, Karen Efremyan, Alexander Sautin and Alexander Kapitanov

Abstract: Video conferencing apps have recently improved functionality by incorporating computer vision-based features such as real-time background removal and face beautification. The lack of diversity in existing portrait segmentation and face parsing datasets – particularly regarding head poses, ethnicity, scenes, and video conferencing-specific occlusions – motivated us to develop a new dataset, EasyPortrait, designed to address these tasks simultaneously. It contains 40,000 primarily indoor photos simulating video meeting scenarios, featuring 13,705 unique users and fine-grained segmentation masks divided into 9 classes. Since annotation masks from other datasets were unsuitable for our task, we revised the annotation guidelines, enabling EasyPortrait to handle cases like teeth whitening and skin smoothing. This paper also introduces a pipeline for data mining and high-quality mask annotation through crowdsourcing. The ablation study demonstrated the critical role of data quantity and head pose diversity in EasyPortrait. The cross-dataset evaluation experiments confirmed the best domain generalization ability among portrait segmentation datasets. The proposed dataset and trained models are publicly available.
Download

Paper Nr: 114
Title:

Gait Recognition Using CGAN and EfficientNet Deep Neural Networks

Authors:

Entesar T. Burges, Zakariya A. Oraibi and Ali Wali

Abstract: The objective of gait recognition is to use a visual camera to identify a person from a distance using a visual camera by their distinctive gait. However, the accuracy of this recognition can be impacted by things like carrying a bag and changing clothes. The framework for human gait recognition system presented in this study is based on deep learning and EfficientNet Deep Neural Network. The proposed framework includes three steps. The first step involves extracting silhouettes. The second step involves computing the gait cycle, and the third involves calculating gait energy Depending on the conditional generative adversarial networks and EfficientNet Deep Neural Network. In the first step, silhouette images are extracted using Gaussian mixture-based background algorithm. The segmentation of the gait cycle is estimated by measuring the silhouette’s bounding box’s length and width, then calculating gait energy. Images resulted from the previous stage are used as input to the conditional generative adversarial networks to generate Gait Energy Image (GEI). EfficientNet is employed as an identification discriminator in this work. The suggested framework was evaluated on a challenging gait dataset called CASIA-B, and scored an accuracy of 97.13%. The framework introduced in this paper outperformed techniques in literature in accuracy.
Download

Paper Nr: 115
Title:

Towards Resource-Efficient Deep Learning for Train Scene Semantic Segmentation

Authors:

Marie-Claire Iatrides, Petra Gomez-Krämer, Olfa Ben Ahmed and Sylvain Marchand

Abstract: In this paper, we present a promising application of scaling techniques for segmentation tasks in a railway environment context to highlight the advantages of task specific models tailored for on-board train use. Smaller convolutional neural networks (CNNs) do not focus on accuracy but resource efficiency. Our models are scaled using skip connections as well as quantization in order to form lightweight models trained specifically for our context. The proposed models have been evaluated both in terms of segmentation performance and efficiency on state of the art scene segmentation datasets namely RailSem19 and Cityscapes. We have obtained models with less than 3.5M parameters and a minimum of 78.4% of segmentation accuracy showing that lightweight models can effectively segment the railway surroundings.
Download

Paper Nr: 117
Title:

Features for Classifying Insect Trajectories in Event Camera Recordings

Authors:

Regina Pohle-Fröhlich, Colin Gebler, Marc Böge, Tobias Bolten, Leland Gehlen, Michael Glück and Kirsten S. Traynor

Abstract: Studying the factors that affect insect population declines requires a monitoring system that automatically records insect activity and environmental factors over time. For this reason, we use a stereo setup with two event cameras in order to record insect trajectories. In this paper, we focus on classifying these trajectories into insect groups. We present the steps required to generate a labeled data set of trajectory segments. Since the manual generation of a labelled dataset is very time consuming, we investigate possibilities for label propagation to unlabelled insect trajectories. The autoencoder FoldingNet and PointNet++ as a classification network for point clouds are analyzed to generate features describing trajectory segments. The generated feature vectors are converted to 2D using t-SNE. Our investigations showed that the projection of the feature vectors generated with PointNet++ produces clusters corresponding to the different insect groups. Using the PointNet++ with fully-connected layers directly for classification, we achieved an overall accuracy of 90.7% for the classification of insects into five groups. In addition, we have developed and evaluated algorithms for the calculation of the speed and size of insects in the stereo data. These can be used as additional features for further differentiation of insects within groups.
Download

Paper Nr: 119
Title:

Uncertainty-Driven Past-Sample Selection for Replay-Based Continual Learning

Authors:

Anxo-Lois Pereira, Eduardo Aguilar and Petia Radeva

Abstract: In a continual learning environment, methods must cope with catastrophic forgetting, i.e. avoid forgetting previously acquired knowledge when new data arrives. Replay-based methods have proven effective for this problem; in particular, simple strategies such as random selection have provided very competitive results. In this paper, we go a step further and propose a novel approach to image recognition utilizing a replay-based continual learning method with uncertainty-driven past-sample selection. Our method aims to address the challenges of data variability and evolving databases by selectively retaining and revisiting samples based on their uncertainty score. It ensures robust performance and adaptability, improving image classification accuracy over time. Based on uncertainty quantification, three groups of methods were proposed and validated, which we call: sample sorting, sample clustering, and sample filtering. We experimented and evaluated the proposed methods on three public datasets: CIFAR10, CIFAR100 and FOOD101. We obtained very encouraging results largely outperforming the baseline sample selection method for rehearsal on all the datasets.
Download

Paper Nr: 121
Title:

Learning-Based Reconstruction of Under-Sampled MRI Data Using End-to-End Deep Learning in Comparison to CS

Authors:

Adnan Khalid, Husnain Shahid, Hatem A. Rashwan and Domenec Puig

Abstract: Magnetic Resonance Imaging (MRI) reconstruction, particularly restoration and denoising, remains challenging due to its ill-posed nature and high computational demands. In response to this, Compressed Sensing (CS) has recently gained prominence for enabling image reconstruction from limited measurements and consequently reducing computational costs. However, CS often struggles to maintain diagnostic image quality and strictly relies on sparsity and incoherence conditions that are somewhat challenging to meet with experimental data or particularly real-world medical data. To address these limitations, this paper proposes a novel framework that integrates CS with a convolutional neural network (CNN), effectively relaxing the CS constraints and enhancing the diagnostic quality of MRI reconstructions. In essence, this method applies CS to generate a measurement vector during initial step and then refined the output by CNN to improve image quality. Extensive evaluations on the MRI knee dataset demonstrate the efficacy of this dual step approach, achieving significant quality improvements with measurements (SSIM = 0.876, PSNR = 27.56 dB). A deep comparative analysis also perform to identify the superior performance over multiple existing CNN architectures
Download

Paper Nr: 129
Title:

Onboarding Customers in Car Sharing Systems: Implementation of Know Your Customer Solutions

Authors:

Magzhan Kairanbay

Abstract: Car-sharing systems have become an essential part of modern life, with Know Your Customer (KYC) processes being crucial for onboarding users. This research presents a streamlined KYC solution designed to efficiently onboard customers by extracting key information from identity cards and driving licenses. We employ techniques from Computer Vision and Machine Learning, including object detection and Optical Character Recognition (OCR), to facilitate this process. The paper concludes by exploring additional features, such as gender recognition, age prediction, and liveness detection, which can further enhance the KYC system.
Download

Paper Nr: 138
Title:

ODPG: Outfitting Diffusion with Pose Guided Condition

Authors:

Seohyun Lee, Jintae Park and Sanghyeok Park

Abstract: Virtual Try-On (VTON) technology allows users to visualize how clothes would look on them without physically trying them on, gaining traction with the rise of digitalization and online shopping. Traditional VTON methods, often using Generative Adversarial Networks (GANs) and Diffusion models, face challenges in achieving high realism and handling dynamic poses. This paper introduces Outfitting Diffusion with Pose Guided Condition (ODPG), a novel approach that leverages a latent diffusion model with multiple conditioning inputs during the denoising process. By transforming garment, pose, and appearance images into latent features and integrating these features in a UNet-based denoising model, ODPG achieves non-explicit synthesis of garments on dynamically posed human images. Our experiments on the FashionTryOn and a subset of the DeepFashion dataset demonstrate that ODPG generates realistic VTON images with fine-grained texture details across various poses, utilizing an end-to-end architecture without the need for explicit garment warping processes. Future work will focus on generating VTON outputs in video format and on applying our attention mechanism, as detailed in the Method section, to other domains with limited data.
Download

Paper Nr: 142
Title:

Garbage Classification from Visual Footprints: Using Transfer Learning Strategy

Authors:

Zheyuan Xu and Nasim Hajari

Abstract: This study investigates the application of computer vision models based on deep learning, to improve waste sorting and promote environmental sustainability. The research evaluates the effectiveness of Convolutional Neural Networks (CNNs) and transfer learning techniques by comparing the performance of eleven pre-trained models in classifying household waste from images into eight distinct categories. Through the implementation of fine-tuning, learning rate scheduling, and overfitting prevention strategies, the study optimizes model performance. Remarkably, the ConvNeXtBase and EfficientNetV2L models achieved impressive accuracy rates of 99.00% and 98.64%, respectively, underscoring the potential of modern CNN architectures in waste classification tasks. Furthermore, a comparative analysis with recent studies reveals that the dataset’s size, quality, and category diversity play crucial roles in determining model performance, with larger and more diverse datasets enabling superior generalization. The originality of this research lies in its comprehensive, side-by-side comparison of multiple pre-trained models on garbage classification application. This offers valuable insights into balancing knowledge retention and adaptation to new tasks. The findings underscore the significant potential of advanced neural network architectures in enhancing waste management and recycling practices.
Download

Paper Nr: 153
Title:

AKDT: Adaptive Kernel Dilation Transformer for Effective Image Denoising

Authors:

Alexandru Brateanu, Raul Balmez, Adrian Avram and Ciprian Orhei

Abstract: Image denoising is a fundamental yet challenging task, especially when dealing with high-resolution images and complex noise patterns. Most existing methods rely on standard Transformer architectures, which often suffer from high computational complexity and limited adaptability to varying noise levels. In this paper, we introduce the Adaptive Kernel Dilation Transformer (AKDT), a novel Transformer-based model that fully harnesses the power of learnable dilation rates within convolutions. AKDT consists of several layers and custom-designed blocks, including our novel Learnable Dilation Rate (LDR) module, which is utilized to construct a Noise Estimator module (NE). At the core of AKDT, the NE is seamlessly integrated within standard Transformer components to form the Noise-Guided Feed-Forward Network (NG-FFN) and Noise-Guided Multi-Headed Self-Attention (NG-MSA). These noise-modulated Transformer components enable the model to achieve unparalleled denoising performance while significantly reducing computational costs. Extensive experiments across multiple image denoising benchmarks demonstrate that AKDT sets a new state-of-the-art, effectively handling both real and synthetic noise. The source code and pre-trained models are publicly available at https://github.com/albrateanu/AKDT.
Download

Paper Nr: 155
Title:

RacketDB: A Comprehensive Dataset for Badminton Racket Detection

Authors:

Muhammad Abdul Haq, Shuhei Tarashima and Norio Tagawa

Abstract: In this paper, we present RacketDB, a specialized dataset designed to address the challenges of detecting badminton rackets in images. This task often hindered by the lack of dedicated datasets. Existing general-purpose datasets fail to capture the unique characteristics of badminton rackets. RacketDB includes 16,608 training images, 3,175 testing images, and 2,899 validation images, all meticulously annotated to enhance object detection performance for sports analytics. To evaluate the effectiveness of RacketDB, we utilized several established object detection models, including YOLOv5, YOLOv8, DETR, and Faster R-CNN. These models were assessed based on metrics like mean average precision (mAP), precision, recall, and F1. Our results demonstrate that RacketDB significantly improves detection accuracy compared to general datasets, highlighting its potential as a valuable resource for developing advanced sports analytics tools. This paper provides a detailed description of RacketDB, the evaluation process, and insights into its application in enhancing automated detection in badminton. The dataset is available at https://github.com/muhabdulhaq/racketdb.
Download

Paper Nr: 170
Title:

Integrating Image Quality Assessment Metrics for Enhanced Segmentation Performance in Reconstructed Imaging Datasets

Authors:

Samiha Mirza, Apurva Gala, Pandu Devarakota, Pranav Mantini and Shishir K. Shah

Abstract: Addressing the challenge of ensuring high-quality data selection for segmentation models applied to reconstructed imaging datasets, particularly seismic and MRI data, is crucial for enhancing model performance. These datasets often suffer from quality variations due to the complex nature of their acquisition processes, leading to the model failing to generalize well on these datasets. This paper investigates the impact of incorporating Image Quality Assessment (IQA) metrics into the data selection process to mitigate this challenge. By systematically selecting images with the highest quality based on quantitative metrics, we aim to improve the training process of segmentation models. Our approach focuses on training salt segmentation models for seismic data and tumor segmentation models for MRI data, illustrating the influence of image quality on segmentation accuracy and overall model performance.
Download

Paper Nr: 179
Title:

DeepHorizon: A Two-Stage Visual Transformer Model for Maritime Horizon Detection

Authors:

Reza Mohammadi Moghaddam, Amin Majd and Juha Kalliovaara

Abstract: Horizon line detection is a crucial task for maritime navigation, enabling accurate course adjustments and situational awareness in real-time. This paper presents a novel approach for detecting horizon lines in complex maritime environments, including challenging conditions such as fog, night, and overcast skies. The method utilizes the Deep Hough Transform (DHT) to generate candidate horizon lines, followed by a Probability Density Function (PDF) filtering process to discard unrealistic candidates. The remaining lines are then processed by two parallel classifiers: a Pyramid Vision Transformer (PVT) and a Convolutional Neural Network (CNN), each focusing on different aspects of the image to extract global and local features.The proposed method effectively addresses the challenges posed by diverse weather conditions, achieving notable improvements in accuracy. Experimental results on the Singapore Maritime Dataset (SMD) show that the method achieves an accuracy of 95.1%. This approach enhances the robustness of horizon detection and provides a reliable solution for real-time maritime navigation systems. In addition, the proposed approach demonstrates notable proficiency in determining the central position of the horizon line and angular precision, particularly when analyzing the 95th percentile of the evaluated dataset, outperforming existing state-of-the-art techniques. The accurate estimation of the horizon line’s orientation facilitates perspective correction, which enhances subsequent computer vision tasks such as object detection and tracking in challenging maritime environments.

Paper Nr: 200
Title:

Lunar Technosignatures: A Deep Learning Approach to Detecting Apollo Landing Sites on the Lunar Surface

Authors:

Tom Sander and Christian Wöhler

Abstract: Uncovering anomalies on the lunar surface is crucial for understanding the Moon’s geological and astronomical history. By identifying and studying these anomalies, new theories about the changes that have occurred on the Moon can be developed or refined. This study seeks to enhance anomaly detection on the Moon and replace the time-consuming manual data search process by testing an anomaly detection method using the Apollo landing sites. The landing sites are advantageous as they are both anomalous and can be located, enabling an assessment of the procedure. Our study compares the performance of various state-of-the-art machine learning algorithms in detecting anomalies in the Narrow-Angle Camera data from the Lunar Reconnaissance Orbiter spacecraft. The results demonstrate that our approach outperforms previous publications in accurately predicting landing site artifacts and technosignatures at the Apollo 15 and 17 landing sites. While our method achieves promising results, there is still room for improvement. Future refinements could focus on detecting more subtle anomalies, such as the rover tracks left by the Apollo missions.
Download

Paper Nr: 206
Title:

Weak Segmentation and Unsupervised Evaluation: Application to Froth Flotation Images

Authors:

Egor Prokopov, Daria Usacheva, Mariia Rumiantceva and Valeria Efimova

Abstract: Images featuring clumped texture object types are prevalent across various domains, and accurate analysis of this data is crucial for numerous industrial applications, including ore flotation—a vital process for material enrichment. Although computer vision facilitates the automation of such analyses, obtaining annotated data remains a challenge due to the labor-intensive and time-consuming nature of manual labeling. In this paper, we propose a universal weak segmentation method adaptable to different clumped texture composite images. We validate our approach using froth flotation images as a case study, integrating classical watershed techniques with foundational models for weak labeling. Additionally, we explore unsupervised evaluation metrics that account for highly imbalanced class distributions. Our dataset was tested across several architectures, with Swin-UNETR demonstrating the highest performance, achieving 89% accuracy and surpassing the same model tested on other datasets. This approach highlights the potential for effective segmentation with minimal manual annotations while ensuring generalizability to other domains.
Download

Paper Nr: 207
Title:

Cross-Site Relational Distillation for Enhanced MRI Segmentation

Authors:

Eddardaa Ben Loussaief, Mohammed Ayad, Hatem A. Rashwan and Domenec Puig

Abstract: The joint use of diverse data sources for medical imaging segmentation has emerged as a crucial area of research, aiming to address challenges such as data heterogeneity, domain shift, and data quality discrepancies. Integrating information from multiple data domains has shown promise in improving model generalizabil-ity and adaptability. However, this approach often demands substantial computational resources, hindering its practicality. In response, knowledge distillation (KD) has garnered attention as a solution. KD involves training lightweight models to emulate the behavior of more resource-intensive models, thereby mitigating the computational burden while maintaining performance. This paper addresses the pressing need to develop a lightweight and generalizable model for medical imaging segmentation that can effectively handle data integration challenges. Our proposed approach introduces a novel relation-based knowledge framework by seamlessly combining adaptive affinity-based and kernel-based distillation. This methodology empowers the student model to accurately replicate the feature representations of the teacher model, facilitating robust performance even in the face of domain shift and data heterogeneity. To validate our approach, we conducted experiments on publicly available multi-source MRI prostate. The results demonstrate a significant enhancement in segmentation performance using lightweight networks. Notably, our method achieves this improvement while reducing both inference time and storage usage.
Download

Paper Nr: 208
Title:

Two Simple Unfolded Residual Networks for Single Image Dehazing

Authors:

Bartomeu Garau, Joan Duran and Catalina Sbert

Abstract: Haze is an environmental factor that impairs visibility for outdoor imaging systems, presenting challenges for computer vision tasks. In this paper, we propose two novel approaches that combine the classical dark channel prior with variational formulations to construct an energy functional for single-image dehazing. The proposed functional is minimized using a proximal gradient descent scheme, which is unfolded into two different networks: one built with residual blocks and the other with residual channel attention blocks. Both methods provide straightforward yet effective solutions for dehazing, achieving competitive results with simple and interpretable architectures.
Download

Paper Nr: 212
Title:

Local Foreground Selection Aware Attentive Feature Reconstruction for Few-Shot Fine-Grained Plant Species Classification

Authors:

Aisha Zulfiqar and Ebroul Izquierdo

Abstract: Plant species exhibit subtle distinctions, requiring a reduction in intra-class variation and an increase in inter-class differences to improve accuracy. This paper addresses plant species classification using a limited number of labelled samples and introduces a novel Local Foreground Selection(LFS) attention mechanism. Based on the proposed attention Local Foreground Selection Module(LFSM) is a straightforward module designed to generate discriminative support and query feature maps. It operates by integrating two types of attention: local attention, which captures local spatial details to enhance feature discrimination and increase inter-class differentiation, and foreground selection attention, which emphasizes the foreground plant object while mitigating background interference. By focusing on the foreground, the query and support features selectively highlight relevant feature sequences and disregard less significant background sequences, thereby reducing intra-class differences. Experimental results from three plant species datasets demonstrate the effectiveness of the proposed LFS attention and its complementary advantages over previous feature reconstruction methods.
Download

Paper Nr: 218
Title:

REPVSR: Efficient Video Super-Resolution via Structural Re-Parameterization

Authors:

KunLei Hu and Dahai Yu

Abstract: Recent advances in video super-resolution (VSR) explored the power of deep learning to achieve a better reconstruction performance. However, the high computational cost still hinders it from practical usage that demands real-time performance (24 fps). In this paper, we propose a re-parameterization video superresolution(REPVSR) to accelerate the reconstruction speed with efficient and generic network. Specifically, we propose re-parameterizable building blocks, namely Super-Resolution Multi-Branch block (SRMB) for efficient SR part design and FlowNet Multi-Branch block (FNMB) for optical flow estimation part. The blocks extract features in multiple paths in the training stage, and merge the multiple operations into one single 3×3 convolution in the inference stage. We then propose an extremely efficient VSR network based on SRMB and FNMB, namely REPVSR. Extensive experiments demonstrate the effectiveness and efficiency of REPVSR.
Download

Paper Nr: 219
Title:

From Noise Estimation to Restoration: A Unified Diffusion and Bayesian Risk Approach for Unsupervised Denoising

Authors:

Reeshad Khan, Ukash Nakarmi and John M. Gauch

Abstract: Deep Neural Networks (DNNs) have revolutionized image denoising, challenging traditional methods such as Stein’s Unbiased Risk Estimator (SURE) and its extensions (eSURE and PURE), along with Extended Poisson Unbiased Risk Estimator (ePURE). These traditional approaches often struggle to generalize across different noise types, especially when noise characteristics are unknown or vary widely, and they are not equipped to handle mixed noise scenarios effectively. In response, we present a novel unsupervised learning strategy that leverages an enhanced diffusion model combined with a dynamically trained Deep Convolutional Neural Network (DnCNN). We introduce adaptive Bayesian loss functions—Bayesian-SURE, Bayesian-PURE, and a newly developed Bayesian-Poisson-Gaussian Unbiased Risk Estimator (Bayesian-PGURE)—that adjust to estimated noise levels and types without prior knowledge. This innovative method enables significant improvements in handling mixed noise conditions and ensures robustness across varied imaging scenarios. Our comprehensive evaluations on MRI data corrupted by Gaussian, Poisson, and mixed noise demonstrate that our approach outperforms existing algorithms, achieving superior denoising performance and image fidelity under diverse, unpredictable conditions. Our contributions advance the state-of-the-art in medical imaging denoising, establishing a new benchmark for unsupervised learning frameworks in managing complex noise dynamics.
Download

Paper Nr: 221
Title:

Weight Factorization Based Incremental Learning in Generalized Few Shot Segmentation

Authors:

Anuska Roy and Viswanath Gopalakrishnan

Abstract: Generalized Few-shot Semantic Segmentation (GFSS) targets to segment novel object categories using a few annotated examples after learning the segmentation on a set of base classes. A typical GFSS training involves two stages - base class learning followed by novel class addition and learning. While existing methods have shown promise, they often struggle when novel classes are significant in number. Most current approaches freeze the encoder backbone to retain base class accuracy; however, freezing the encoder backbone can potentially impede the assimilation of novel information from the new classes. To address this challenge, we propose to use an incremental learning strategy in GFSS for learning both encoder backbone and novel class prototypes. Inspired by the recent success of Low Rank Adaptation techniques (LoRA), we introduce incremental learning to the GFSS encoder backbone with a novel weight factorization method. Our newly proposed rank adaptive weight merging strategy is sensitive to the varying degrees of novelty assimilated across various layers of the encoder backbone. In our work, we also introduce the incremental learning strategy to class prototype learning for novel categories. Our extensive experiments on Pascal-5i and COCO-20i databases showcase the effectiveness of incremental learning, especially when the novel classes outnumber base classes. With our proposed Weight Factorization based Incremental Learning (WFIL) method, a new set of state-of-the-art accuracy values is established in Generalized Few-shot Semantic Segmentation.
Download

Paper Nr: 227
Title:

A Computer Vision Approach to Fertilizer Detection and Classification

Authors:

Jens Lippel, Richard Bihlmeier and André Stuhlsatz

Abstract: This paper introduces a computer vision based pipline for the classification of different types of fertilizers from collected images. For robust boundary detection of individual corns in a heap of grains, we used YOLO11 for classification and Segment Anything 2 for segmentation in an active learning fashion. The segmenter as well as the classifier are iteratively improved starting with an initial set of handcrafted training samples. Despite the high diversity in grain structures, the relatively simple camera setup and the limited number of handcrafted training samples, a classification accuracy of 99.996% was achieved.
Download

Paper Nr: 229
Title:

Enhancing Small Object Detection in Resource-Constrained ARAS Using Image Cropping and Slicing Techniques

Authors:

Chinmaya Kaundanya, Paulo Cesar, Barry Cronin, Andrew Fleury, Mingming Liu and Suzanne Little

Abstract: Powered two-wheelers, such as motorcycles, e-bikes, and e-scooters, exhibit disproportionately high fatality rates in road traffic incidents worldwide. Advanced Rider Assistance Systems (ARAS) have the potential to enhance rider safety by providing real-time hazard alerts. However, implementing effective ARAS on the resource-constrained hardware typical of micromobility vehicles presents significant challenges, particularly in detecting small or distant objects using monocular cameras and lightweight convolutional neural networks (CNNs). This study evaluates two computationally efficient image preprocessing techniques aimed at improving small and distant object detection in ARAS applications: image center region-of-interest (ROI) cropping and image slicing and re-slicing. Utilizing the YOLOv8-nano object detection model at relatively low input resolutions of 160×160, 320×320, and 640×640 pixels, we conducted experiments on the VisDrone and KITTI datasets, which represent scenarios where small and distant objects are prevalent. Our results indicate that the image center ROI cropping technique improved the detection of small objects, particularly at a 320×320 resolution, achieving enhancements of 6.67× and 1.27× in mean Average Precision (mAP) on the VisDrone and KITTI datasets, respectively. However, excessive cropping negatively impacted the detection of medium and large objects due to the loss of peripheral contextual information and the exclusion of objects outside the cropped region. Image slicing and re-slicing demonstrated impressive improvements in detecting small objects, especially using the grid-based slicing strategy on the VisDrone dataset, with an mAP increase of 2.24× over the baseline. Conversely, on the KITTI dataset, although a performance gain of 1.66× over the baseline was observed for small objects at a 320×320 resolution, image slicing adversely affected the detection of medium and large objects. The fragmentation of objects at image slice borders caused partial visibility, which reduced detection accuracy. These findings contribute to the development of more effective and efficient ARAS technologies, ultimately enhancing the safety of powered two-wheeler riders. Our evaluation code scripts are publicly accessible at: https://github.com/Luna-Scooters/SOD using image preprocessing.
Download

Paper Nr: 242
Title:

Accuracy Improvement of Semi-Supervised Segmentation Using Supervised ClassMix and Sup-Unsup Feature Discriminator

Authors:

Takahiro Mano, Reiji Saito and Kazuhiro Hotta

Abstract: In semantic segmentation, the creation of pixel-level labels for training data incurs significant costs. To address this problem, semi-supervised learning, which utilizes a small number of labeled images alongside unlabeled images to enhance the performance, has gained attention. A conventional semi-supervised learning method, ClassMix, pastes class labels predicted from unlabeled images onto other images. However, since ClassMix performs operations using pseudo-labels obtained from unlabeled images, there is a risk of handling inaccurate labels. Additionally, there is a gap in data quality between labeled and unlabeled images, which can impact the feature maps. This study addresses these two issues. First, we propose a method where class labels from labeled images, along with the corresponding image regions, are pasted onto unlabeled images and their pseudo-labeled images. Second, we introduce a method that trains the model to make predictions on unlabeled images more similar to those on labeled images. Experiments on the Chase and COVID-19 datasets demonstrated an average improvement of 2.07% in mIoU compared to conventional semi-supervised learning methods.
Download

Paper Nr: 248
Title:

MEFA: Multimodal Image Early Fusion with Attention Module for Pedestrian and Vehicle Detection

Authors:

Yoann Dupas, Olivier Hotel, Grégoire Lefebvre and Christophe Cérin

Abstract: Pedestrian and vehicle detection represents a significant challenge in autonomous driving, particularly in adverse weather conditions. Multimodal image fusion addresses this challenge. This paper proposes a new early-fusion attention-based approach from visible, infrared, and LiDAR images, designated as MEFA (Multi-modal image Early Fusion with Attention). In this study, we compare our MEFA proposal with a channel-wise concatenation early-fusion approach. When coupled with YOLOv8 or RT-DETRv1 for pedestrian and vehicle detection, our contribution is promising in adverse weather conditions (i.e. rainy days or foggy nights). Furthermore, our MEFA proposal demonstrated superior mAP accuracy on the DENSE dataset.
Download

Paper Nr: 255
Title:

Enforcing Graph Structures to Enhance Key Information Extraction in Document Analysis

Authors:

Rajashree Majumder, Zhewei Wang, Ye Yue, Mukut Kalita and Jundong Liu

Abstract: Key Information Extraction (KIE) is a critical and often final step in the comprehensive process of document analysis. Various graph-based solutions, including SDMG-R, have been proposed to address the challenges posed by the relationships between document components. In this paper, we propose a spatial structure-guided framework to integrate known structures of the data and tasks, which are represented as ground-truth graphs. This integration is enforce by minimizing a (dis-)similarity loss defined on graph edges. To optimize graph similarity, different loss functions are explored for the edge loss. In addition, we enhance the text feature extraction component in SDMG-R from character-level Bi-LSTM to word-level embeddings using a fine-tuned BERT, thereby integrating deeper language knowledge into the text labeling procedure. Experiments on the FUNSD and WildReceipt datasets demonstrate the effectiveness of our proposed model in extracting key information from document images with unseen templates, significantly outperforming baseline models.
Download

Paper Nr: 269
Title:

Attached Shadow Constrained Shape from Polarization

Authors:

Momoka Yoshida, Ryo Kawahara and Takahiro Okabe

Abstract: This paper tackles a long-standing challenge in computer vision: single-shot, per-pixel surface normal recovery. Although polarization provides a crucial clue to solving this problem, it leaves ambiguity in the normal estimation even when the refractive index is known. Therefore, previous studies require additional clues or assumptions. In this paper, we propose a novel approach to resolve this ambiguity and the unknown refractive index simultaneously. Our key idea is to leverage attached shadows to resolve normal ambiguity while measuring the refractive index based on wavelength characteristics in a single-shot scenario. We achieve this by separating the contributions of three appropriately placed narrow-band light sources in the RGB channel. We further introduce disambiguation uncertainty to address cast shadows and achieve more accurate normal recovery. Our experimental evaluations with synthetic and real images confirm the effectiveness of our method both qualitatively and quantitatively.
Download

Paper Nr: 276
Title:

Evaluating Combinations of Optimizers and Loss Functions for Cloud Removal Using Diffusion Models

Authors:

Leandro Henrique Furtado Pinto Silva, João Fernando Mari, Mauricio C. Escarpinati and André R. Backes

Abstract: Cloud removal is crucial for photogrammetry applications, including urban planning, precision agriculture, and climate monitoring. Recently, generative models, especially those based on latent diffusion, have shown remarkable results in high-quality synthetic image generation, making them suitable for cloud removal tasks. These approaches require optimizing numerous trainable parameters with various optimizers and loss functions. This study evaluates the impact of combining three optimizers (SGD, Adam, and AdamW) with the MAE, MSE, and Huber loss functions. For evaluation, we used the SEN MTC New dataset, which contains pairs of 4-band images with and without clouds, divided into training, validation, and test sets. The results, measured in terms of PSNR and SSIM, show that the diffusion model combining AdamW and the Huber loss function delivers exceptional performance in cloud removal.
Download

Paper Nr: 280
Title:

ShadowScout: Robust Unsupervised Shadow Detection for RGB Imagery

Authors:

Estephan Rustom, Henrique Cabral, Sreeraj Rajendran and Elena Tsiporkova

Abstract: Accurate shadow detection and correction are critical for improving image classification and segmentation but remain challenging due to the lack of well-labeled datasets and the context-specific nature of shadows, which limit the generalizability of supervised models. Existing unsupervised approaches, on the other hand, often require specialized data or are computationally intensive due to high parameterization. In this paper, we introduce ShadowScout, a novel, low-parameterized, unsupervised deep learning method for shadow detection using standard RGB images. ShadowScout is fast, achieves performance comparable to state-of-the-art supervised methods, and surpasses existing unsupervised techniques across various datasets. Additionally, the model can seamlessly incorporate extra data, such as near-infrared channels, to enhance shadow detection accuracy further. ShadowScout is available on the authors’ GitHub repository (https://github.com/EluciDATALab/elucidatalab.starterkits/tree/ main/models/shadows).
Download

Paper Nr: 281
Title:

SPNeRF: Open Vocabulary 3D Neural Scene Segmentation with Superpoints

Authors:

Weiwen Hu, Niccolò Parodi, Marcus Zepp, Ingo Feldmann, Oliver Schreer and Peter Eisert

Abstract: Open-vocabulary segmentation, powered by large visual-language models like CLIP, has expanded 2D segmentation capabilities beyond fixed classes predefined by the dataset, enabling zero-shot understanding across diverse scenes. Extending these capabilities to 3D segmentation introduces challenges, as CLIP’s image-based embeddings often lack the geometric detail necessary for 3D scene segmentation. Recent methods tend to address this by introducing additional segmentation models or replacing CLIP with variations trained on segmentation data, which lead to redundancy or loss on CLIP’s general language capabilities. To overcome this limitation, we introduce SPNeRF, a NeRF based zero-shot 3D segmentation approach that leverages geometric priors. We integrate geometric primitives derived from the 3D scene into NeRF training to produce primitive-wise CLIP features, avoiding the ambiguity of point-wise features. Additionally, we propose a primitive-based merging mechanism enhanced with affinity scores. Without relying on additional segmentation models, our method further explores CLIP’s capability for 3D segmentation and achieves notable improvements over orig-inal LERF.
Download

Paper Nr: 296
Title:

Dynamic Hierarchical Token Merging for Vision Transformers

Authors:

Karim Haroun, Thibault Allenet, Karim Ben Chehida and Jean Martinet

Abstract: Vision Transformers (ViTs) have achieved impressive results in computer vision, excelling in tasks such as image classification, segmentation, and object detection. However, their quadratic complexity O(N2), where N is the token sequence length, poses challenges when deployed on resource-limited devices. To address this issue, dynamic token merging has emerged as an effective strategy, progressively reducing the token count during inference to achieve computational savings. Some strategies consider all tokens in the sequence as merging candidates, without focusing on spatially close tokens. Other strategies either limit token merging to a local window, or constrains it to pairs of adjacent tokens, thus not capturing more complex feature relationships. In this paper, we propose Dynamic Hierarchical Token Merging (DHTM), a novel token merging approach, where we advocate that spatially close tokens share more information than distant tokens and consider all pairs of spatially close candidates instead of imposing fixed windows. Besides, our approach draws on the principles of Hierarchical Agglomerative Clustering (HAC), where we iteratively merge tokens in each layer, fusing a fixed number of selected neighbor token pairs based on their similarity. Our proposed approach is off-the-shelf, i.e., it does not require additional training. We evaluate our approach on the ImageNet-1K dataset for classification, achieving substantial computational savings while minimizing accuracy reduction, surpassing existing token merging methods.
Download

Paper Nr: 299
Title:

Breast Density Estimation in Mammograms Using Unsupervised Image Segmentation

Authors:

Khaldoon Alhusari and Salam Dhou

Abstract: Breast cancer is very common, and early detection through mammography is paramount. Breast density, a strong risk factor for breast cancer, can be estimated from mammograms. Current density estimation methods can be subjective, labor-intensive, and proprietary. This work proposes a framework for breast density estimation based on the unsupervised segmentation of mammograms. A state-of-the-art unsupervised image segmentation algorithm is adopted for the purpose of breast density segmentation. Mammographic percent density is estimated through a process of arithmetic division. The percentages are then discretized into qualitative assessments of density (“Fatty” and “Dense”) using a thresholding approach. Evaluation reveals robust segmentation at the pixel-level with silhouette scores averaging 0.95 and significant unsupervised labeling quality at the per-image level with silhouette scores averaging 0.61. The proposed framework is highly adaptable, generalizable, and non-subjective, and has the potential to be a beneficial support tool for radiologists.
Download

Paper Nr: 309
Title:

Deep Learning for Image Analysis and Diagnosis Aid of Prostate Cancer

Authors:

Maxwell Gomes da Silva, Bruno Augusto Nassif Travençolo and André R. Backes

Abstract: Prostate cancer remains one of the most critical health challenges, ranking among the leading causes of cancer-related deaths in men worldwide. This study seeks to automate the identification and classification of cancerous regions in histological images using deep learning, specifically convolutional neural networks (CNNs). Using PANDA dataset and Mask R-CNN, our approach achieved an accuracy of 91.3%. This result highlights the potential of our methodology to enhance early detection, improve patient outcomes, and provide valuable support to pathologists in their diagnostic processes.
Download

Paper Nr: 310
Title:

Automatic Drywall Analysis for Progress Tracking and Quality Control in Construction

Authors:

Mariusz Trzeciakiewicz, Aleixo Cambeiro Barreiro, Niklas Gard, Anna Hilsmann and Peter Eisert

Abstract: Digitalization in the construction industry has become essential, enabling centralized, easy access to all relevant information of a building. Automated systems can facilitate the timely and resource-efficient documentation of changes, which is crucial for key processes such as progress tracking and quality control. This paper presents a method for image-based automated drywall analysis enabling construction progress and quality assessment through on-site camera systems. Our proposed solution integrates a deep learning-based instance segmentation model to detect and classify various drywall elements with an analysis module to cluster individual wall segments, estimate camera perspective distortions, and apply the corresponding corrections. This system extracts valuable information from images, enabling more accurate progress tracking and quality assessment on construction sites. Our main contributions include a fully automated pipeline for drywall analysis, improving instance segmentation accuracy through architecture modifications and targeted data augmentation, and a novel algorithm to extract important information from the segmentation results. Our modified model, enhanced with data augmentation, achieves significantly higher accuracy compared to other architectures, offering more detailed and precise information than existing approaches. Combined with the proposed drywall analysis steps, it enables the reliable automation of construction progress and quality assessment.
Download

Paper Nr: 317
Title:

FRCol: Face Recognition Based Speaker Video Colorization

Authors:

Rory Ward and John Breslin

Abstract: Automatic video colorization has recently gained attention for its ability to adapt old movies for today’s modern entertainment industry. However, there is a significant challenge: limiting unnatural color hallucination. Generative artificial intelligence often generates erroneous results, which in colorization manifests as unnatural colorizations. In this work, we propose to ground our automatic video colorization system in relevant exemplars by leveraging a face database, which we retrieve from using facial recognition technology. This retrieved exemplar guides the colorization of the latent-diffusion-based speaker video colorizer. We dub our system FRCol. We focus on speakers as humans have evolved to pay particular attention to certain aspects of colorization, with human faces being one of them. We improve the previous state-of-the-art (SOTA) DeOldify by an average of 13% on the standard metrics of PSNR, SSIM, FID, and FVD on the Grid and Lombard Grid datasets. Our user study also consolidates these results where FRCol was preferred to contemporary colorizers 81% of the time.
Download

Paper Nr: 320
Title:

Environment Setup and Model Benchmark of the MuFoRa Dataset

Authors:

Islam Fadl, Torsten Schön, Valentino Behret, Thomas Brandmeier, Frank Palme and Thomas Helmer

Abstract: Adverse meteorological conditions, particularly fog and rain, present significant challenges to computer vision algorithms and autonomous systems. This work presents MuFoRa a novel, controllable, and measured multimodal dataset recorded at CARISSMA’s indoor test facility, specifically designed to assess perceptual difficulties in foggy and rainy environments. The dataset bridges research gap in the public benchmarking datasets, where quantifiable weather parameters are lacking. The proposed dataset comprises synchronized data from two sensor modalities: RGB stereo cameras and LiDAR sensors, captured under varying intensities of fog and rain. The dataset incorporates synchronized meteorological annotations, such as visibility through fog and precipitation levels of rain, and the study contributes a detailed explanation of the diverse weather effects observed during data collection in the methods section. The dataset’s utility is demonstrated through a baseline evaluation example, assessing the performance degradation of state-of-the-art YOLO11 and DETR 2D object detection algorithms under controlled and quantifiable adverse weather conditions. The public release of the dataset (https://doi.org/10.5281/zenodo.14175611) facilitates various benchmarking and quanti- tative assessments of advanced multimodal computer vision and deep learning models under the challenging conditions of fog and rain.
Download

Paper Nr: 330
Title:

One-Shot Polarization-Based Material Classification with Optimal Illumination

Authors:

Miho Kurachi, Ryo Kawahara and Takahiro Okabe

Abstract: Image-based classification of surface materials is important for machine vision applications such as visual inspection. In this paper, we propose a novel method for one-shot per-pixel classification of raw materials on the basis of polarimetric feature such as the degree of linear polarization (DoLP) and the angle of linear polarization (AoLP). It is known that the polarimetric feature depends not only on the intrinsic properties of surface materials but also on the directions and wavelengths of light sources. Accordingly, our proposed method jointly optimizes the non-negative light source intensities for feature extraction and the discriminant hyperplane in the feature space via margin maximization so that the appearances of different materials are discriminative. We conducted a number of experiments using real images captured by using a light stage, and show that our method using a single input image works better than/comparably to the existing methods using a single/multiple input images.
Download

Paper Nr: 349
Title:

Skeleton-Based Bilateral Symmetry: Theoretical Concepts and Detection via Dynamic Programming

Authors:

Nikita Lomov, Oleg Seredin and Olesia Kushnir

Abstract: This study presents a formal definition and an algorithm for the detection of bilateral symmetry in flexible planar objects. We proposed to analyze the skeleton of a 2D figure and detect its symmetry using the automorphism of the original and reflected skeleton graphs enhanced with additional requirements. The axis of symmetry is formed by the skeleton edges invariant under reflection. We developed a dynamic programming algorithm that finds the optimal mapping of the half-edges of the skeleton considering their “duality”. We also implemented a symmetrization of the skeleton and original figure, making the detected axis the figure’s vertical axis of reflective symmetry. We showed that the optimized target value when searching for the skeletal mapping agrees well with the Jaccard similarity index.
Download

Paper Nr: 352
Title:

Simultaneous Optimization of Abnormality Discriminator and Illumination Conditions for Image Inspection of Textile Products

Authors:

Yuma Nishikawa, Fumihiko Sakaue and Jun Sato

Abstract: In this study, we propose a method to simultaneously learn and optimize the illumination conditions suitable for anomaly inspection and a neural network for anomaly inspection in textile product anomaly inspection. In the inspection of abnormalities in industrial products such as textile products, it is necessary to optimize the imaging environment including the lighting environment, but this process is mostly done manually by trial and error. In this study, we show that highly accurate inspection of abnormalities can be achieved by using a display whose light source position and brightness can be easily changed, and by presenting a proof pattern suitable for abnormalities on the display. Furthermore, we show how to simultaneously optimize the neural network and illumination conditions used for such anomaly inspection. We also show that the proposed method can appropriately detect anomalies using images actually taken.
Download

Paper Nr: 370
Title:

Breast Cancer Image Classification Using Deep Learning and Test-Time Augmentation

Authors:

João Fernando Mari, Larissa Ferreira Rodrigues Moreira, Leandro Henrique Furtado Pinto Silva, Mauricio C. Escarpinati and André R. Backes

Abstract: Deep learning-based computer vision methods can improve diagnostic accuracy, efficiency, and productivity. While traditional approaches primarily apply Data Augmentation (DA) during the training phase, Test-Time Augmentation (TTA) offers a complementary strategy to improve the predictive capabilities of trained models without increasing training time. In this study, we propose a simple and effective TTA strategy to enhance the classification of histopathological images of breast cancer. After optimizing hyperparameters, we evaluated the TTA strategy across all magnifications of the BreakHis dataset using three deep learning architectures, trained with and without DA. We compared five sets of transformations and multiple prediction rounds. The proposed strategy significantly improved the mean accuracy across all magnifications, demonstrating its effectiveness in improving model performance.
Download

Paper Nr: 371
Title:

A Multifractal-Based Masked Auto-Encoder: An Application to Medical Images

Authors:

Joao Batista Florindo and Viviane de Moura

Abstract: Masked autoencoders (MAE) have shown great promise in medical image classification. However, the random masking strategy employed by traditional MAEs may overlook critical areas in medical images, where even subtle changes can indicate disease. To address this limitation, we propose a novel approach that utilizes a multifractal measure (Renyi entropy) to optimize the masking strategy. Our method, termed Multifractal-Optimized Masked Autoencoder (MO-MAE), employs a multifractal analysis to identify regions of high complexity and information content. By focusing the masking process on these areas, MO-MAE ensures that the model learns to reconstruct the most diagnostically relevant features. This approach is particularly beneficial for medical imaging, where fine-grained inspection of tissue structures is crucial for accurate diagnosis. We evaluate MO-MAE on several medical datasets covering various diseases, including MedMNIST and COVID-CT. Our results demonstrate that MO-MAE achieves promising performance, surpassing other basiline and state-of-the-art models. The proposed method also adds minimum computational overhead as the computation of the proposed measure is straightforward. Our findings suggest that the multifractal-optimized masking strategy enhances the model’s ability to capture and reconstruct complex tissue structures, leading to more accurate and efficient medical image representation. The proposed MO-MAE framework offers a promising direction for improving the accuracy and efficiency of deep learning models in medical image analysis, potentially advancing the field of computer-aided diagnosis.
Download

Paper Nr: 380
Title:

Image Compositing Is all You Need for Data Augmentation

Authors:

Ang Jia Ning Shermaine, Michalis Lazarou and Tania Stathaki

Abstract: This paper investigates the impact of various data augmentation techniques on the performance of object detection models. Specifically, we explore classical augmentation methods, image compositing, and advanced generative models such as Stable Diffusion XL and ControlNet. The objective of this work is to enhance model robustness and improve detection accuracy, particularly when working with limited annotated data. Using YOLOv8, we fine-tune the model on a custom dataset consisting of commercial and military aircraft, applying different augmentation strategies. Our experiments show that image compositing offers the highest improvement in detection performance, as measured by precision, recall, and mean Average Precision (mAP@0.50). Other methods, including Stable Diffusion XL and ControlNet, also demonstrate significant gains, highlighting the potential of advanced data augmentation techniques for object detection tasks. The results underline the importance of dataset diversity and augmentation in achieving better generalization and performance in real-world applications. Future work will explore the integration of semi-supervised learning methods and further optimizations to enhance model performance across larger and more complex datasets.
Download

Paper Nr: 389
Title:

Multi-Modal Multi-View Perception Feature Tracking for Handover Human Robot Interaction Applications

Authors:

Chaitanya Bandi and Ulrike Thomas

Abstract: Object handover is a fundamental task in human-robot interaction (HRI) that relies on robust perception features such as hand pose estimation, object pose estimation, and human pose estimation. While human pose estimation has been extensively researched, this work focuses on creating a comprehensive architecture to track and analyze hand and object poses, thereby enabling effective handover state determination. We propose an end-to-end architecture that integrates unified hand-object pose estimation with hand pose tracking, leveraging an early and efficient fusion of RGB and depth modalities. Our method incorporates existing state-of-the-art techniques for human pose estimation and introduces novel advancements for hand-object pose estimation. The architecture is evaluated on three large-scale open-source datasets, demonstrating state-of-the-art performance in unified hand-object pose estimation. Finally, we implement our approach in a human-robot interaction scenario to determine the handover state by extracting and tracking the necessary perception features. This integration highlights the potential of the proposed system for enhancing collaboration in HRI applications.
Download

Paper Nr: 401
Title:

Automated Performance Metrics for Objective Surgical Skill Assessment in Laparoscopic Training

Authors:

Asaf Arad, Julia Leyva I. Torres, Kristian Nyborg Jespersen, Nicolaj Boelt Pedersen, Pablo Rey Valiente, Alaa El-Hussuna and Andreas Møgelmose

Abstract: The assessment of surgical skill is critical in advancing surgical training and enhancing the performance of surgeons. Traditional evaluation methods relying on human observation and checklists are often biased and inefficient, prompting the need for automated and objective systems. This study explores the use of Automated Performance Metrics (APMs) in laparoscopic surgeries, using video-based data and advanced object tracking techniques. A pipeline was developed, combining a fine-tuned YOLO11 model for detection with state-of-the-art multi-object trackers (MOTs) for tracking surgical tools. Metrics such as path length, velocity, acceleration, jerk, and working area were calculated to assess technical performance. BoT-SORT emerged as the most effective tracker, achieving the highest HOTA and MOTA, enabling robust tool tracking. The system successfully extracted APMs to evaluate and compare surgical performance, demonstrating its potential for objective assessment. This work validates state-of-the-art algorithms for surgical video analysis, contributing to improved surgical training and performance evaluation. Future efforts should address limitations like pixel-based measurements and dataset variability to enhance the system’s accuracy and applicability, ultimately advancing patient safety and reducing training costs.
Download

Paper Nr: 404
Title:

Layerwise Image Vectorization via Bayesain-Optimized Contour

Authors:

Ghfran Jabour, Sergey Muravyov and Valeria Efimova

Abstract: This work presents a novel method LIVBOC for complex image vectorization that addresses key challenges in path initialization, color assignment, and optimization. Unlike existing approaches such as LIVE, our method generates Bayesian-optimized contour for path initialization, which is then optimized using a customized loss function to align it better with the target shape in the image. In our method, adaptive selection of points and parameters for efficient and accurate vectorization is enabled to reduce unnecessary iterations and computational overhead. LIVBOC achieves superior reconstruction fidelity with fewer paths, and that is due to the path initialization technique, which initializes paths as contours that approximate target shapes in the image, reducing redundancy in points and paths. The experimental evaluation indicates that LIVBOC outperforms LIVE in all key metrics, including a significant reduction in L2 loss, processing time, and file size. LIVBOC achieves comparable results with just 100 iterations, compared to LIVE’s 500 iterations, while preserving finer details and generating smoother, more coherent paths. These improvements make LIVBOC more suitable for applications that require scalable, compact vector graphics, and computational efficiency. By achieving both accuracy and efficiency, LIVBOC offers a new robust alternative for image vectorization tasks. The LIVBOC code is available at https://github.com/CTLab-ITMO/LIVBOC.
Download

Paper Nr: 410
Title:

Handwriting Trajectory Recovery of Latin Characters with Deep Learning: A Novel Exploring the Amount of Points per Character and New Evaluation Method

Authors:

Simone Bello Kaminski Aires, Erikson Freitas de Morais and Yu Han Lin

Abstract: The research on handwriting trajectory recovery (HTR) has gained prominence in offline handwriting recognition by utilizing online recognition resources to simulate writing patterns. Traditional approaches commonly employ graph-based methods that skeletonize characters to trace their paths, while recent studies have focused on deep learning techniques due to their superior generalization capabilities. However, despite promising results, the absence of standardized evaluation metrics limits meaningful comparisons across studies. This work presents a novel approach to recovering handwriting trajectories of Latin characters using deep learning networks, coupled with a standardized evaluation framework. The proposed evaluation model quantitatively and qualitatively assesses the recovery of stroke sequences and character geometry, providing a consistent basis for comparison. Experimental results demonstrate the significant influence of the number of coordinate points per character on deep learning performance, offering valuable insights into optimizing both evaluation and recovery rates. This study provides a practical solution for enhancing HTR accuracy and establishing a standardized evaluation methodology.
Download

Paper Nr: 413
Title:

Exploring Machine Learning and Remote Sensing Techniques for Mapping Pinus Invasion Beyond Crop Areas

Authors:

Andrey Naligatski Dias, Maria Eduarda Guedes Pinto Gianisella, Amanda Dos Santos Gonçalves, Rodrigo Minetto and Mauren Louise Sguario Coelho de Andrade

Abstract: The spread of the exotic tree species from the Pinus spp. family has been increasing over the years in the Ponta Grossa region and other areas of southern Brazil, making its monitoring necessary. This study proposes to monitor this spread using deep neural networks trained on satellite images from the Campos Gerais region. For this task, three deep neural network models focused on pixel-by-pixel classification were employed: U-Net, SegNet, and FCN (Fully Convolutional Network). These models were trained on a dataset containing 34 images with a resolution of 2048x2048 pixels, obtained from Google Earth satellites. All images were downloaded using the QuickMapServices extension available in QGIS, and labeled using the same program. Promising results suggest that the U-Net model outperformed the others, achieving 82.49% accuracy, 69.62% Jaccard index, 41.19% recall, and 78.47% precision. The SegNet model showed good accuracy at 82.84%, but underperformed on the Jaccard index at 45.93%, with 58.34% recall and 68.35% precision. Meanwhile, the FCN model produced less reliable results among the three, with 79.37% accuracy, 29.17% Jaccard index, 34% recall, and 67.21% precision.
Download

Paper Nr: 419
Title:

Cross-Modal Transferable Image-to-Video Attack on Video Quality Metrics

Authors:

Georgii Gotin, Ekaterina Shumitskaya, Anastasia Antsiferova and Dmitriy Vatolin

Abstract: Recent studies have revealed that modern image and video quality assessment (IQA/VQA) metrics are vulnerable to adversarial attacks. An attacker can manipulate a video through preprocessing to artificially increase its quality score according to a certain metric, despite no actual improvement in visual quality. Most of the attacks studied in the literature are white-box attacks, while black-box attacks in the context of VQA have received less attention. Moreover, some research indicates a lack of transferability of adversarial examples generated for one model to another when applied to VQA. In this paper, we propose a cross-modal attack method, IC2VQA, aimed at exploring the vulnerabilities of modern VQA models. This approach is motivated by the observation that the low-level feature spaces of images and videos are similar. We investigate the transferability of adversarial perturbations across different modalities; specifically, we analyze how adversarial perturbations generated on a white-box IQA model with an additional CLIP module can effectively target a VQA model. The addition of the CLIP module serves as a valuable aid in increasing transferability, as the CLIP model is known for its effective capture of low-level semantics. Extensive experiments demonstrate that IC2VQAachieves a high success rate in attacking three black-box VQA models. We compare our method with existing black-box attack strategies, highlighting its superiority in terms of attack success within the same number of iterations and levels of attack strength. We believe that the proposed method will contribute to the deeper analysis of robust VQA metrics.
Download

Paper Nr: 423
Title:

OrthoCNN: Mitigating Adversarial Noise in Convolutional Neural Networks via Orthogonal Projections

Authors:

Aristeidis Bifis and Emmanouil Psarakis

Abstract: Adversarial training is the standard method for improving the robustness of neural networks against adversarial attacks. However, a well-known trade-off exists: while adversarial training increases resilience to perturbations, it often results in a significant reduction in accuracy on clean (unperturbed) data. This compromise leads to models that are more resistant to adversarial attacks but less effective on natural inputs. In this paper, we introduce an extension to adversarial training by applying novel constraints on convolutional layers, that address this trade-off. Specifically, we use orthogonal projections to decompose the learned features into clean signal and adversarial noise, projecting them onto the range and null spaces of the network’s weight matrices. These constraints improve the separation of adversarial noise from useful signals during training, enhancing robustness while preserving the same performance on clean data as adversarial training. Our approach achieves significant improvements in robust accuracy while maintaining comparable clean accuracy, providing a balanced and effective adversarial defense strategy.
Download

Paper Nr: 17
Title:

Thermal Image Super-Resolution Using Real-ESRGAN for Human Detection

Authors:

Vinícius H. G. Correa, Peter Funk, Nils Sundelius, Rickard Sohlberg, Mastura Ab Wahid and Alexandre C. B. Ramos

Abstract: Unmanned Aerial Vehicles (UAVs) are increasingly crucial in Search and Rescue (SAR) operations due to their ability to enhance efficiency and reduce costs. Search and Rescue is a vital activity as it directly impacts the preservation of life and safety in critical situations, such as locating and rescuing individuals in perilous or remote environments. However, the effectiveness of these operations heavily depends on the quality of sensor data for accurate target detection. This study investigates the application of the Real Enhanced Super-Resolution Generative Adversarial Networks (Real-ESRGAN) algorithm to enhance the resolution and detail of infrared images captured by UAV sensors. By improving image quality through super-resolution, we then assess the performance of the YOLOv8 target detection algorithm on these enhanced images. Preliminary results indicate that Real-ESRGAN significantly improves the quality of low-resolution infrared data, even when using pre-trained models not specifically tailored to our dataset, this highlights a considerable potential of applying the algorithm in the preprocessing stages of images generated by UAVs for search and rescue operations.
Download

Paper Nr: 26
Title:

Customized Atrous Spatial Pyramid Pooling with Joint Convolutions for Urban Tree Segmentation

Authors:

Danilo Samuel Jodas, Giuliana Del Nero Velasco, Sergio Brazolin, Reinaldo Araujo de Lima, Leandro Aparecido Passos and João Paulo Papa

Abstract: Urban trees provide several benefits to the cities, including local climatic regulation and better life quality. Assessing the tree conditions is essential to gather important insights related to its biomechanics and the possible risk of falling. The common strategy is ruled by fieldwork campaigns to collect the tree’s physical measures like height, the trunk’s diameter, and canopy metrics for a first-glance assessment and further prediction of the possible risk to the city’s infrastructure. The canopy and trunk of the tree play an important role in the resistance analysis when exposed to severe windstorm events. However, fieldwork analysis is laborious and time-expensive because of the massive number of trees. Therefore, strategies based on computational analysis are highly demanded to promote a rapid assessment of tree conditions. This paper presents a deep learning-based approach for semantic segmentation of the trunk and canopy of trees in images acquired from the street-view perspective. The proposed strategy combines convolutional modules, spatial pyramid pooling, and attention mechanism into a U-Net-based architecture to improve the prediction capacity. Experiments performed over two image datasets showed the proposed model attained competitive results compared to previous works employing large-sized semantic segmentation models.
Download

Paper Nr: 90
Title:

Making Real Estate Walkthrough Videos Interactive

Authors:

Mathijs Lens, Floris De Feyter and Toon Goedemé

Abstract: This paper presents an automated system designed to streamline the creation of interactive real estate video tours. These virtual walkthrough tours allow potential buyers to explore properties by skipping or focusing on rooms of interest, enhancing the decision-making process. However, the current manual method for producing these tours is costly and time-consuming. We propose a system that automates key aspects of the walkthrough video creation process, including the identification of room transitions and room label extraction. Our proposed system utilizes transformer-based video segmentation, addressing challenges such as the lack of clear visual boundaries between open-plan rooms and the difficulty of classifying rooms in unfurnished properties. We demonstrate in an ablation study that the combined usage of ResNet frame embeddings, and a transformer-based temporal postprocessing that uses a separately trained doorway detection network as extra input yields the best results for room segmentation and classification. This method improves the edit score by +35% compared to frame-by-frame classification. All experiments are performed on a large real-life dataset of 839 walkthrough videos.
Download

Paper Nr: 146
Title:

Transferability of Labels Between Multilens Cameras

Authors:

Ignacio de Loyola Páez-Ubieta, Daniel Frau-Alfaro and Santiago T. Puente

Abstract: In this work, a novel approach for the automated transfer of Bounding Box (BB) and mask labels across different channels on multilens cameras is presented. For that purpose, the proposed method combines the well-known phase correlation method with a refinement process. In the initial step, images are aligned by localising the peak of intensity obtained in the spatial domain after performing the cross-correlation process in the frequency domain. The second step consists of obtaining the optimal transformation through an iterative process that maximises the IoU (Intersection over Union) metric. The results show that the proposed method enables the transfer of labels across different lenses on a camera with an accuracy of over 90% in the majority of cases, with a processing time of just 65 ms. Once the transformations have been obtained, artificial RGB images are generated for labelling purposes, with the objective of transferring this information into each of the other lenses. This work will facilitate the use of this type of camera in a wider range of fields, beyond those of satellite or medical imagery, thereby enabling the labelling of even invisible objects in the visible spectrum.
Download

Paper Nr: 165
Title:

Combining Supervised Ground Level Learning and Aerial Unsupervised Learning for Efficient Urban Semantic Segmentation

Authors:

Youssef Bouaziz, Eric Royer and Achref Elouni

Abstract: Semantic segmentation of aerial imagery is crucial for applications in urban planning, environmental monitoring, and autonomous navigation. However, it remains challenging due to limited annotated data, occlusions, and varied perspectives. We present a novel framework that combines 2D semantic segmentation with 3D point cloud data using a graph-based label propagation technique. By diffusing semantic information from 2D images to 3D points with pixel-to-point and point-to-point connections, our approach ensures consistency between 2D and 3D segmentations. We validate its effectiveness on urban imagery, accurately segmenting moving objects, structures, roads, and vegetation, and thereby overcoming the limitations of scarce annotated datasets. This hybrid method holds significant potential for large-scale, detailed segmentation of aerial imagery in urban development, environmental assessment, and infrastructure management.
Download

Paper Nr: 168
Title:

Anomalous Water Dataset Captured by Hyperspectral Cameras

Authors:

Youta Noboru and Yuko Ozasa

Abstract: This paper proposes a hyperspectral dataset designed for detecting anomalies in water caused by the mixing of colorless and transparent anomalous liquids. Detecting such anomalous substances, particularly when they are transparent, is crucial for public health and environmental safety, as conventional methods often inadequate. Hyperspectral imaging captures subtle spectral differences, enabling the identification of materials that are visually indistinguishable. The dataset aims to support the development of unsupervised learning models that can detect anomalous substances in water using only a spectral data. We have made this dataset publicly available (https://github.com/033labcodes/visapp25\ Anomalous-Water-Dataset) to facilitate further research in this area.
Download

Paper Nr: 181
Title:

Multi-Object Keypoint Detection and Pose Estimation for Pigs

Authors:

Qinghua Guo, Dawei Pei, Yue Sun, Patrick P. J. H. Langenhuizen, Clémence A. E. M. Orsini, Kristine Hov Martinsen, Øyvind Nordbø, J. Elizabeth Bolhuis, Piter Bijma and Peter H. N. de With

Abstract: Monitoring the daily status of pigs is crucial for enhancing their health and welfare. Pose estimation has emerged as an effective method for tracking pig postures, with keypoint detection and skeleton extraction playing pivotal roles in this process. Despite advancements in human pose estimation, there is limited research focused on pigs. To bridge this gap, this study applies the You Only Look Once model Version 8 (YOLOv8) for keypoint detection and skeleton extraction, evaluated on a manually annotated pig dataset. Additionally, the performance of pose estimation is compared across different data modalities and models, including an image-based model (ResNet-18), a keypoint-based model (Multi-Layer Perceptron, MLP), and a combined image-and-keypoint-based model (YOLOv8-pose). The keypoint detection branch achieves an average Percentage of Detected Joints (PDJ) of 48.96%, an average Percentage of Correct Keypoints (PCK) of 84.85%, and an average Object Keypoint Similarity (OKS) of 89.43%. The best overall accuracy obtained for pose estimation is 99.33% by the YOLOv8-pose model, which indicates the superiority of the joint image-keypoint-based model for pose estimation. The conducted comprehensive experiments and visualization results indicate that the proposed method effectively identifies specific pig body parts in most monitoring frames, facilitating an accurate assessment of pig activity and welfare.
Download

Paper Nr: 187
Title:

Comparison Between CNN and GNN Pipelines for Analysing the Brain in Development

Authors:

Antoine Bourlier, Elodie Chaillou, Jean-Yves Ramel and Mohamed Slimane

Abstract: In this study, we present a new pipeline designed for the analysis and comparison of non-conventional animal brain models, such as sheep, without relying on neuroanatomical priors. This innovative approach combines an automatic MRI segmentation with graph neural networks (GNNs) to overcome the limitations of traditional methods. Conventional tools often depend on predefined anatomical atlases and are typically limited in their ability to adapt to the unique characteristics of developing brains or non-conventional animal models. By generating regions of interest directly from MR images and constructing a graph representation of the brain, our method eliminates biases associated with predefined templates. Our results show that the GNN-based pipeline is more efficient in terms of accuracy for an age prediction task (63.22%) compared to a classical CNN architecture (59.77%). GNNs offer notable advantages, including improved interpretability and the ability to model complex relational structures within brain data. Overall, our approach provides a promising solution for unbiased, adaptable, and interpretable analysis of brain MRIs, particularly for developing brains and non-conventional animal models.
Download

Paper Nr: 191
Title:

ReactSR: Efficient Real-World Super-Resolution Application in a Single Floppy Disk

Authors:

Gleb S. Brykin and Valeria Efimova

Abstract: Image super-resolution methods are increasingly divided into several groups, setting different goals for themselves, which leads to difficulties when using them in real conditions. While some methods maximize the accuracy of detail reconstruction and minimize the complexity of the model, losing realism, other methods use heavy architectures to achieve realistic images. In this paper, we propose a new class of image super-resolution methods called efficient real super-resolution, which occupies the gap between efficient and real super-resolution methods. The main goal of our work is to show the possibility of creating compact super-resolution models that allow generating realistic images, like SOTA in the field of real super-resolution, requiring only a few parameters and small computing resources. We compare our models with SOTA qualitatively and quantitatively using NIQE and LPIPS image naturalness metrics, getting unambiguous positive results. We also offer a self-contained cross-platform application that generates images comparable to SOTA in terms of realism in an acceptable time, and fits entirely on one 3.5-inch floppy disk.
Download

Paper Nr: 209
Title:

Gam-UNet for Semantic Segmentation

Authors:

Rahma Aloui, Pranav Martini, Pandu Devarakota, Apurva Gala and Shishir K. Shah

Abstract: Accurate delineation of critical features, such as salt boundaries in seismic imaging and fine structures in medical images, is essential for effective analysis and decision-making. Traditional convolutional neural networks (CNNs) often face difficulties in handling complex data due to variations in scale, orientation, and noise. These limitations become particularly evident during the transition from proof-of-concept to real-world deployment, where models must perform consistently under diverse conditions. To address these challenges, we propose GAM-UNet, an advanced segmentation architecture that integrates learnable Gabor filters for enhanced edge detection, SCSE blocks for feature refinement, and multi-scale fusion within the U-Net framework. This approach improves feature extraction across varying scales and orientations. Trained using a combined Binary Cross-Entropy and Dice loss function, GAM-UNet demonstrates superior segmentation accuracy and continuity, outperforming existing U-Net variants across diverse datasets.
Download

Paper Nr: 237
Title:

Canine Action Recognition: Exploring Keypoint and Non-Keypoint Approaches Enhanced by Synthetic Data

Authors:

Barbora Bezáková and Zuzana Berger Haladová

Abstract: This study focuses on the implementation of deep neural networks capable of recognizing actions from dog photographs. The objective is to implement and compare two approaches. The first approach uses pose estimation, where keypoints and their positions on photographs are analyzed to recognize the performed action. The second approach focuses on recognizing actions in photographs without the need of pose estimation. The image dataset was created using generative models and augmented to increase variability. Results show that combining synthetic and real data effectively addresses the challenge of limited amount of annotated datasets in the field of dog action recognition. It is demonstrated that the integration of artificially generated data into training process can lead to effective results when tested on real-world photographs.
Download

Paper Nr: 245
Title:

SBC-UNet3+: Classification of Nuclei in Histology Imaging Based on Multi Branch UNET3+ Segmentation Model

Authors:

Roua Jaafar, Hedi Yazid, Wissem Farhat and Najoua Essoukri Ben Amara

Abstract: Histological images are crucial for cancer diagnosis and treatment, providing valuable information about cellular structures and abnormalities. Deep learning has emerged as a promising tool to automate the analysis of histological images, especially for tasks like cell segmentation and classification, which aim to improve cancer detection efficiency and accuracy. Existing methods, show promising results in segmentation and classification but are limited in handling overlapping nuclei and boundary delineation. We propose a cell segmentation and classification approach applied to histological images, part of a Content-Based Histopathological Image Retrieval (CBHIR) project. By integrating boundary detection and classification-guided modules, our approach overcomes the limitations of existing methods, enhancing segmentation precision and robustness. Our approach leverages deep learning models and the UNET3+ architecture, comparing its performance with state-of-the-art methods on the PanNuke Dataset (Gamper et al., 2020). Our multitask approach outperforms current models in F1-score and recall, demonstrating its potential for accurate and efficient cancer diagnosis.
Download

Paper Nr: 263
Title:

Homography VAE: Automatic Bird’s Eye View Image Reconstruction from Multi-Perspective Views

Authors:

Keisuke Toida, Naoki Kato, Osamu Segawa, Takeshi Nakamura and Kazuhiro Hotta

Abstract: We propose Homography VAE, a novel architecture that combines Variational AutoEncoders with Homog-raphy transformation for unsupervised standardized view image reconstruction. By incorporating coordinate transformation into the VAE framework, our model decomposes the latent space into feature and transformation components, enabling the generation of consistent standardized view from multi-viewpoint images without explicit supervision. Effectiveness of our approach is demonstrated through experiments on MNIST and GRID datasets, where standardized reconstructions show significantly improved consistency across all evaluation metrics. For the MNIST dataset, the cosine similarity among standardized view achieved 0.66, while original and transformed views show 0.29 and 0.37, respectively. The number of PCA components required to explain 95% of the variance decreases from 193.5 to 33.2, indicating more consistent representations. Even more pronounced improvements are observed on GRID dataset, in which standardized view achieved a cosine similarity of 0.92 and required only 7 PCA components compared to 167 for original images. Furthermore, the first principal component of standardized view explains 71% of the total variance, suggesting highly consistent geometric patterns. These results validate that Homography VAE successfully learns to generate consistent standardized view representations from various viewpoints without requiring ground truth Homog-raphy matrices.
Download

Paper Nr: 265
Title:

Segmentation of Intraoperative Glioblastoma Hyperspectral Images Using Self-Supervised U-Net++

Authors:

Marco Gazzoni, Marco La Salvia, Emanuele Torti, Elisa Marenzi, Raquel Leon, Samuel Ortega, Beatriz Martinez, Himar Fabelo, Gustavo Callicò and Francesco Leporati

Abstract: Brain tumour resection yields many challenges for neurosurgeons and even though histopathological analysis can help to complete tumour elimination, it is not feasible due to the extent of time and tissue demand for margin inspection. This paper presents a novel attention-based self-supervised methodology to improve current research on medical hyperspectral imaging as a tool for computer-aided diagnosis. We designed a novel architecture comprising the U-Net++ and the attention mechanism on the spectral domain, trained in a self-supervised framework to exploit contrastive learning capabilities and overcome dataset size problems arising in medical scenarios. We operated fifteen hyperspectral images from the publicly available HELICoiD dataset. Enhanced by extensive data augmentation, transfer-learning and self-supervision, we measured accuracy, specificity and recall values above 90% in the automatic end-to-end segmentation of intraoperative glioblastoma hyperspectral images. We evaluated our outcomes with the ground truths produced by the HELICoiD project, obtaining results that are comparable concerning the gold-standard procedure.
Download

Paper Nr: 298
Title:

QR Code Detection with Perspective Correction and Decoding in Real-World Conditions Using Deep Learning and Enhanced Image Processing

Authors:

David Joshua Corpuz, Lance Victor Del Rosario, Jonathan Paul Cempron, Paulo Luis Lozano and Joel Ilao

Abstract: QR codes have become a vital tool across various industries, facilitating data storage and accessibility in compact, scannable formats. However, real-world environmental challenges, including lighting variability, perspective distortions, and physical obstructions, often impair traditional QR code readers such as the one included in OpenCV and ZBar, which require precise alignment and full code visibility. This study presents an adaptable QR code detection and decoding system, leveraging the YOLO deep learning model combined with advanced image processing techniques, to overcome these limitations. By incorporating edge detection, perspective transformation, and adaptive decoding, the proposed method achieves robust QR code detection and decoding across a range of challenging scenarios, including tilted angles, partial obstructions, and low lighting. Evaluation results demonstrate significant improvements over traditional readers, with enhanced accuracy and reliability in identifying and decoding QR codes under complex conditions. These findings support the system’s application potential in sectors with high demands for dependable QR code decoding, such as logistics and automated inventory tracking. Future work will focus on optimizing processing speed, extending multi-code detection capabilities, and refining the method’s performance across diverse environmental contexts.
Download

Paper Nr: 377
Title:

A Survey on Feature-Based and Deep Image Stitching

Authors:

Sima Soltanpour and Chris Joslin

Abstract: Image stitching is a process of merging multiple images with overlapped parts to generate a wide-view image. There are many applications in a variety of fields for image stitching such as 360-degree cameras, virtual reality, photography, sports broadcasting, video surveillance, street view, and entertainment. Image stitching methods are divided into feature-based and deep learning algorithms. Feature-based stitching methods rely heavily on accurate localization and distribution of hand-crafted features. One of the main challenges related to these methods is handling parallax problems. In this survey, we categorize feature-based methods in terms of parallax tolerance which has not been discovered in the existing survey papers. Moreover, considerable research efforts have been dedicated to applying deep learning methods for image stitching. In this way, we also comprehensively review and compare the different types of deep learning methods for image stitching and categorize them into three different groups including deep homography, deep features, and deep end-to-end framework.
Download

Paper Nr: 395
Title:

Advanced Vision Techniques in Soccer Match Analysis: From Detection to Classification

Authors:

Jakub Eichner, Jan Nowak, Bartłomiej Grzelak, Tomasz Górecki, Tomasz Piłka and Krzysztof Dyczkowski

Abstract: This paper introduces an integrated pipeline for detecting, classifying, and tracking key objects within soccer match footage. Our research uses datasets from KKS Lech Poznań, SoccerDB, and SoccerNet, considering various stadium environments and technical conditions, such as equipment quality and recording clarity. These factors mirror the real-world scenarios encountered in competitions, training sessions, and observations. We assessed the effectiveness of cutting-edge object detection models, focusing on several R-CNN frameworks and the YOLOv8 methodology. Additionally, for assigning players to their respective teams, we compared the performance of the K-means algorithm with that of the Multi-Modal Vision Transformer CogVLM model. Despite challenges like suboptimal video resolution and fluctuating weather conditions, our proposed solutions have successfully demonstrated high precision in detecting and classifying key elements such as players and the ball within soccer match footage. These findings establish a robust basis for further video analysis in soccer, which could enhance tactical strategies and the automation of match summarization.
Download

Paper Nr: 398
Title:

Rethinking Deblurring Strategies for 3D Reconstruction: Joint Optimization vs. Modular Approaches

Authors:

Akash Malhotra, Nacéra Seghouani, Ahmad Abu Saiid, Alaa Almatuwa and Koumudi Ganepola

Abstract: In this paper, we present a comparison between joint optimization and modular frameworks for addressing deblurring in multiview 3D reconstruction. Casual captures, especially with handheld devices, often contain blurry images that degrade the quality of 3D reconstruction. Joint optimization frameworks tackle this issue by integrating deblurring and 3D reconstruction into a unified learning process, leveraging information from overlapping blurry images. While effective, these methods increase the complexity and training time. Conversely, modular approaches decouple deblurring from 3D reconstruction, enabling the use of stand-alone deblurring algorithms such as Richardson-Lucy, DeepRFT, and Restormer. In this study, we evaluate the trade-offs between these strategies in terms of reconstruction quality, computational complexity, and suitability for varying levels of blur. Our findings reveal that modular approaches are more effective for low to medium blur scenarios, while Deblur-NeRF, a joint optimization framework, excels at handling extreme blur when computational costs are not a constraint.
Download

Paper Nr: 407
Title:

Exploring Histopathological Image Augmentation Through StyleGAN2ADA: A Quantitative Analysis

Authors:

Glenda P. Train, Johanna E. Rogalsky, Sergio O. Ioshii, Paulo M. Azevedo-Marques and Lucas F. Oliveira

Abstract: Due to the rapid development of technology in the last decade, pathology has entered its digital era with the diffusion of WSIs. With this improvement, providing reliable automated diagnoses has become highly desirable to reduce the time and effort of experts in time-consuming and exhaustive tasks. However, with the scarcity of publicly labeled medical data and the imbalance between data classes, it is necessary to use various data augmentation techniques to mitigate these problems. This paper presents experiments that investigate the impact of adding synthetic IHC images on the classification of staining intensity levels of cancer cells with estrogen and progesterone biomarkers. We tested models SVM, CNN, DenseNet, and ViT, trained with and without images generated by StyleGAN2ADA and AutoAugment. The experiments covered class balancing and adding synthetic images to the training process, improving the classification F1-Score by up to 14 percentage points. In almost all experiments using StyleGAN2ADA images, the F1-Score was enhanced.
Download

Paper Nr: 408
Title:

Edge AI System for Real-Time and Explainable Forest Fire Detection Using Compressed Deep Learning Models

Authors:

Sidi Ahmed Mahmoudi, Maxime Gloesener, Mohamed Benkedadra and Jean-Sébastien Lerat

Abstract: Forests are vital natural resources but are highly vulnerable to disasters, both natural (e.g., lightining strikes) and human induced. Early and automated detection of forest fire and smoke is critical for mitigating damages. The main challenge of this kind of application is to provide accurate, explainable, real-time and lightweight solutions that can be easily deployable by and for users like firefighters. This paper presents an embedded and explainable artificial intelligence “Edge AI” system, for real-time forest fire, and smoke detection, using compressed Deep Learning (DL) models. Our model compression approach allowed to provide lightweight models for Edge AI deployment. Experimental evaluation on a preprocessed dataset composed of 1500 images demonstrated a test accuracy of 98% with a lightweight model running in real-time on a Jetson Xavier Edge AI resource. The compression methods preserved the same accuracy, while accelerating computation (3× to 18× speedup), reducing memory consumption ( 3.8× to 10.6×), and reducing energy consumption (3.5× to 6.3×).
Download

Paper Nr: 411
Title:

BEVMOSNet: Multimodal Fusion for BEV Moving Object Segmentation

Authors:

Hiep Truong Cong, Ajay Kumar Sigatapu, Arindam Das, Yashwanth Sharma, Venkatesh Satagopan, Ganesh Sistu and Ciarán Eising

Abstract: Accurate motion understanding of the dynamic objects within the scene in bird’s-eye-view (BEV) is critical to ensure a reliable obstacle avoidance system and smooth path planning for autonomous vehicles. However, this task has received relatively limited exploration when compared to object detection and segmentation with only a few recent vision-based approaches presenting preliminary findings that significantly deteriorate in lowlight, nighttime, and adverse weather conditions such as rain. Conversely, LiDAR and radar sensors remain almost unaffected in these scenarios, and radar provides key velocity information of the objects. Therefore, we introduce BEVMOSNet, to our knowledge, the first end-to-end multimodal fusion leveraging cameras, LiDAR, and radar to precisely predict the moving objects in BEV. In addition, we perform a deeper analysis to find out the optimal strategy for deformable cross-attention-guided sensor fusion for cross-sensor knowledge sharing in BEV. While evaluating BEVMOSNet on the nuScenes dataset, we show an overall improvement in IoU score of 36.59% compared to the vision-based unimodal baseline BEV-MoSeg (Sigatapu et al., 2023), and 2.35% compared to the multimodel SimpleBEV (Harley et al., 2022), extended for the motion segmentation task, establishing this method as the state-of-the-art in BEV motion segmentation.
Download

Area 2 - Mobile, Egocentric, and Robotic Vision

Full Papers
Paper Nr: 22
Title:

PrIcosa: High-Precision 3D Camera Calibration with Non-Overlapping Field of Views

Authors:

Oguz Kedilioglu, Tasnim Tabassum Nova, Martin Landesberger, Lijiu Wang, Michael Hofmann, Jörg Franke and Sebastian Reitelshöfer

Abstract: Multi-camera systems are being used more and more frequently, from autonomous mobile robots to intelligent visual servoing cells. Determining the pose of the cameras to each other very accurately is essential for many applications. However, choosing the most suitable calibration object geometry and utilizing it as effectively as possible still remains challenging. Disadvantageous geometries provide only subpar datasets, increasing the need for a larger dataset and decreasing the accuracy of the calibration results. Moreover, an unrefined calibration method can lead to worse accuracies even with a good dataset. Here, we introduce a probabilistic method to increase the accuracy of 3D camera calibration. Furthermore, we analyze the effects of the calibration object geometry on the data properties and the resulting calibration accuracy for the geometries cube and icosahedron. The source code for this project is available at GitHub (Nova, 2024).
Download

Paper Nr: 37
Title:

Fine-Grained Self-Localization from Coarse Egocentric Topological Maps

Authors:

Daiki Iwata, Kanji Tanaka, Mitsuki Yoshida, Ryogo Yamamoto, Yuudai Morishita and Tomoe Hiroki

Abstract: Topological maps are increasingly favored in robotics for their cognitive relevance, compact storage, and ease of transferability to human users. While these maps provide scalable solutions for navigation and action planning, they present challenges for tasks requiring fine-grained self-localization, such as object goal navigation. This paper investigates the action planning problem of active self-localization from a novel perspective: can an action planner be trained to achieve fine-grained self-localization using coarse topological maps? Our approach acknowledges the inherent limitations of topological maps; overly coarse maps lack essential information for action planning, while excessively high-resolution maps diminish the need for an action planner. To address these challenges, we propose the use of egocentric topological maps to capture fine scene varia-tions. This representation enhances self-localization accuracy by integrating an output probability map as a place-specific score vector into the action planner as a fixed-length state vector. By leveraging sensor data and action feedback, our system optimizes self-localization performance. For the experiments, the de facto standard particle filter-based sequential self-localization framework was slightly modified to enable the transformation of ranking results from a graph convolutional network (GCN)-based topological map classifier into real-valued vector state inputs by utilizing bag-of-place-words and reciprocal rank embeddings. Experimental validation of our method was conducted in the Habitat workspace, demonstrating the potential for effective action planning using coarse maps.
Download

Paper Nr: 300
Title:

GIFF: Graph Iterative Attention Based Feature Fusion for Collaborative Perception

Authors:

Ahmed N. Ahmed, Siegfried Mercelis and Ali Anwar

Abstract: Multi-agent collaborative perception has gained significant attention due to its ability to overcome the challenges stemming from the limited line-of-sight visibility of individual agents that raised safety concerns for autonomous navigation. This paper introduces GIFF, a graph-based iterative attention collaborative perception framework designed to improve situational awareness among multi-agent systems, including vehicles and roadside units. GIFF enhances autonomous driving perception by fusing perceptual data shared among neighboring agents, allowing agents to “see” through occlusions, detect distant objects, and increase resilience to sensor noise and failures, at low computational cost. To achieve this, we propose a novel framework that integrates both channel and spatial attention mechanisms, learned iteratively and in parallel. We evaluate our approach on object detection task using the V2X-Sim and OPV2V datasets by conducting extensive experiments. GIFF has demonstrated effectiveness compared to state-of-the-art methods and has proved to achieve notable improvements in average precision and the number of model parameters.
Download

Short Papers
Paper Nr: 38
Title:

SSGA: Synthetic Scene Graph Augmentation via Multiple Pipeline Variants

Authors:

Kenta Tsukahara, Ryogo Yamamoto, Kanji Tanaka and Tomoe Hiroki

Abstract: Cross-view image localization, which involves predicting the view of a robot with respect to a single-view landmark image, is important in landmark-sparse and mapless navigation scenarios such as image-goal navigation. Typical scene graph-based methods assume that all objects in a landmark image are visible in the query image and cannot address view inconsistencies between the query and landmark images. We observed that scene graph augmentation (SGA), a technique that has recently emerged to address scene graph-specific data augmentation, is particularly relevant to our problem. However, the existing SGA methods rely on the availability of rich multi-view training images and are not suitable for single-view setups. In this study, we introduce a new SGA method tailored for cross-view scenarios where scene graph generation and scene synthesis are intertwined. We begin with the fundamental pipeline of cross-view self-localization, and without loss of generality, identify several pipeline variants. These pipeline variants are used as supervision cues to improve robustness and discriminability. Evaluation in an image-goal navigation scenario demonstrates that the proposed approach yields significant and consistent improvements in accuracy and robustness.
Download

Paper Nr: 270
Title:

Online Detection of End of Take and Release Actions from Egocentric Videos

Authors:

Alessandro Sebastiano Catinello, Giovanni Maria Farinella and Antonino Furnari

Abstract: In this work, we tackle the problem of detecting “take” and “release” actions from egocentric videos. We address the task following a new Online Detection of Action End (ODAE) formulation in which algorithms have to determine the end of an action in an online fashion. We show that ODAE has advantages over previous formulations that focus on detecting actions at the contact frame or offline, thanks to the reduced uncertainty due to the complete observation of events before a prediction is made. We adapt to this task and benchmark different state-of-the-art temporal online action detection models on the EPIC-KITCHENS dataset, highlighting the specific challenges of the ODAE task, such as sparse annotations and high action density. Analysis on THUMOS14 shows that most conclusions are valid also in a third-person vision scenario. We also investigate the impact of techniques such as label propagation to address annotation imbalance. Our results show that the problem is far from being solved, Mamba-based models consistently outperform transformer-based models in all settings.
Download

Paper Nr: 91
Title:

Low Latency Pedestrian Detection Based on Dynamic Vision Sensor and RGB Camera Fusion

Authors:

Bingyu Huang, Gianni Allebosch, Peter Veelaert, Tim Willems, Wilfried Philips and Jan Aelterman

Abstract: Advanced driver assistance systems currently adopt RGB cameras as visual perception sensors, which rely primarily on static features and are limited in capturing dynamic changes due to fixed frame rates and motion blur. A very promising sensor alternative is the dynamic vision sensor(DVS) with microsecond temporal resolution that records an asynchronous stream of per-pixel brightness changes, also known as event stream. However, in autonomous driving scenarios, it’s challenging to distinguish between events caused by the vehicle’s motion and events caused by actual moving objects in the environment. To address this, we design a motion segmentation algorithm based on epipolar geometry and apply it to DVS data, effectively removing static background events and focusing solely on dynamic objects. Furthermore, we propose a system that fuses the dynamic information captured by event cameras and rich appearance details from RGB cameras. Experiments show that our proposed method can effectively improve detection performance while showing great potential in decision latency.
Download

Paper Nr: 164
Title:

Automated Individualization of Object Detectors for the Semantic Environment Perception of Mobile Robots

Authors:

Christian Hofmann, Christopher May, Patrick Ziegler, Iliya Ghotbiravandi, Jörg Franke and Sebastian Reitelshöfer

Abstract: Large Language Models (LLMs) and Vision Language Models (VLMs) enable robots to perform complex tasks. However, many of today’s mobile robots cannot carry the computing hardware required to run these models on board. Furthermore, access via communication systems to external computers running these models is often impractical. Therefore, lightweight object detection models are often utilized to enable mobile robots to semantically perceive their environment. In addition, mobile robots are used in different environments, which also change regularly. Thus, an automated adaptation of object detectors would simplify the deployment of mobile robots. In this paper, we present a method for automated environment-specific individualization and adaptation of lightweight object detectors using LLMs and VLMs, which includes the automated identification of relevant object classes. We comprehensively evaluate our method and show its successful application in principle, while also pointing out shortcomings regarding semantic ambiguities and the application of VLMs for pseudo-labeling datasets with bounding box annotations.
Download

Paper Nr: 372
Title:

Robotic Visual Attention Architecture for ADAS in Critical Embedded Systems for Smart Vehicles

Authors:

Diego Renan Bruno, William D’Abruzzo Martins, Rafael Alceste Berri and Fernando Santos Osório

Abstract: This paper presents the development of a perception architecture for Advanced Driver Assistance Systems (ADAS) capable of integrating (a) external and (b) internal vehicle perception to evaluate obstacles, traffic signs, pedestrians, navigable areas, potholes and deformations in road, as well as monitor driver behavior, respectively. For external perception, in previous works we used advanced sensors, such as the Velodyne LIDAR-64, the Bumblebee 3D camera for object depth analysis, but in this work, focusing on reducing hardware, processing and time costs, we apply 2D cameras with depth estimation generated by the Depth-Anything V2 network model. Internal perception is performed using the Kinect v2 and the Jetson Nano in conjunction with a SVM (Support Vector Machine) model, allowing the identification of driver posture characteristics and the detection of signs of drunkenness, drowsiness or disrespect for traffic laws. The motivation for this system lies in the fact that more than 90% of traffic accidents in Brazil are caused by human error, while only 1% are detected by surveillance means. The proposed system offers an innovative solution to reduce these rates, integrating cutting-edge technologies to provide advanced road safety. This perception architecture for ADAS offers a solution for road safety, alerting the driver and allowing corrective actions to prevent accidents. The tests carried out demonstrated an accuracy of more than 92% for external and internal perception, validating the effectiveness of the proposed approach.
Download

Area 3 - Motion, Tracking, and 3D Vision

Full Papers
Paper Nr: 52
Title:

HandMvNet: Real-Time 3D Hand Pose Estimation Using Multi-View Cross-Attention Fusion

Authors:

Muhammad Asad Ali, Nadia Robertini and Didier Stricker

Abstract: In this work, we present HandMvNet, one of the first real-time method designed to estimate 3D hand motion and shape from multi-view camera images. Unlike previous monocular approaches, which suffer from scale-depth ambiguities, our method ensures consistent and accurate absolute hand poses and shapes. This is achieved through a multi-view attention-fusion mechanism that effectively integrates features from multiple viewpoints. In contrast to previous multi-view methods, our approach eliminates the need for camera parameters as input to learn 3D geometry. HandMvNet also achieves a substantial reduction in inference time while delivering competitive results compared to the state-of-the-art methods, making it suitable for real-time applications. Evaluated on publicly available datasets, HandMvNet qualitatively and quantitatively outperforms previous methods under identical settings. Code is available at github.com/pyxploiter/handmvnet.
Download

Paper Nr: 178
Title:

MuSt-NeRF: A Multi-Stage NeRF Pipeline to Enhance Novel View Synthesis

Authors:

Sudarshan Raghavan Iyengar, Subash Sharma and Patrick Vandewalle

Abstract: Neural Radiance Fields (NeRFs) have emerged as a powerful technique for novel view synthesis, but accurately capturing both intricate geometry and complex view-dependent effects, especially in challenging real-world scenes, remains a limitation of existing methods. This work presents MuSt-NeRF, a novel multi-stage pipeline designed to enhance the fidelity and robustness of NeRF-based reconstructions. The approach strategically chains complementary NeRF architectures, organized into two stages: a depth-guided stage that establishes a robust geometric foundation, followed by a refinement stage that enhances details and accurately renders view-dependent effects. Crucially, MuSt-NeRF allows flexible stage ordering, enabling either geometry-first or photometry-first reconstruction based on scene characteristics and desired outcomes. Experiments on diverse datasets, including synthetic scenes and complex indoor environments from the ScanNet dataset, demonstrate that MuSt-NeRF consistently outperforms single-stage NeRF and 3D Gaussian Splatting methods, achieving higher scores on established metrics like PSNR, SSIM, and LPIPS, while producing visually superior reconstructions. MuSt-NeRF’s flexibility and robust performance make it a promising approach for high-fidelity novel view synthesis in complex, real-world scenes. The code is made available at https://github.com/sudarshan-iyengar/MuSt-NeRF.
Download

Paper Nr: 192
Title:

Urban Re-Identification: Fusing Local and Global Features with Residual Masked Maps for Enhanced Vehicle Monitoring in Small Datasets

Authors:

William A. Ramirez, Cesar A. Sierra Franco, Thiago R. da Motta and Alberto Raposo

Abstract: This paper presents an optimized vehicle re-identification (Re-ID) approach focused on small datasets. While most existing literature concentrates on deep learning techniques applied to large datasets, this work addresses the specific challenges of working with smaller datasets, mainly when dealing with incomplete partitioning information. Our approach explores automated regional proposal methods, examining residuality and uniform sampling techniques for connected regions through statistical methods. Additionally, we integrate global and local attributes based on mask extraction to improve the generalization of the learning process. This led to a more effective balance between small and large datasets, achieving up to an 8.3% improvement in Cumulative Matching Characteristics (CMC) at k=5 compared to attention-based methods for small datasets. We improved generalization regarding context changes of up to 13% in CMC for large datasets. The code, model, and DeepStream-based implementations are available at https://github.com/will9426/will9426-automatic-Regionproposal-for-cars-in-Re-id-models.
Download

Paper Nr: 202
Title:

2D Motion Generation Using Joint Spatial Information with 2CM-GPT

Authors:

Ryota Inoue, Tsubasa Hirakawa, Takayoshi Yamashita and Hironobu Fujiyoshi

Abstract: Various methods have been proposed for generating human motion from text due to advancements in large language models and diffusion models. However, most research has focused primarily on 3D motion generation. While 3D motion enables realistic representations, the creation and collection of datasets using motion-capture technology is costly, and its application to downstream tasks, such as pose-guided human video generation, is limited. Therefore, we propose 2D Convolutional Motion Generative Pre-trained Transformer (2CM-GPT), a method for generating two-dimensional (2D) motion from text. 2CM-GPT is based on the framework of Mo-tionGPT, a method for 3D motion generation, and uses a motion tokenizer to convert 2D motion into motion tokens while learning the relationship between text and motion using a language model. Unlike MotionGPT, which utilizes 1D convolution for processing 3D motion, 2CM-GPT uses 2D convolution for processing 2D motion. This enables more effective capture of spatial relationships between joints. Evaluation experiments demonstrated that 2CM-GPT is effective in both motion reconstruction and text-guided 2D motion generation. The generated 2D motion is also shown to be effective for pose-guided human video generation.
Download

Paper Nr: 262
Title:

Segmentation-Guided Neural Radiance Fields for Novel Street View Synthesis

Authors:

Yizhou Li, Yusuke Monno, Masatoshi Okutomi, Yuuichi Tanaka, Seiichi Kataoka and Teruaki Kosiba

Abstract: Recent advances in Neural Radiance Fields (NeRF) have shown great potential in 3D reconstruction and novel view synthesis, particularly for indoor and small-scale scenes. However, extending NeRF to large-scale outdoor environments presents challenges such as transient objects, sparse cameras and textures, and varying lighting conditions. In this paper, we propose a segmentation-guided enhancement to NeRF for outdoor street scenes, focusing on complex urban environments. Our approach extends ZipNeRF and utilizes Grounded SAM for segmentation mask generation, enabling effective handling of transient objects, modeling of the sky, and regularization of the ground. We also introduce appearance embeddings to adapt to inconsistent lighting across view sequences. Experimental results demonstrate that our method outperforms the baseline ZipNeRF, improving novel view synthesis quality with fewer artifacts and sharper details.
Download

Paper Nr: 285
Title:

ConMax3D: Frame Selection for 3D Reconstruction Through Concept Maximization

Authors:

Akash Malhotra, Nacéra Seghouani, Gilbert Badaro and Christophe Blaya

Abstract: This paper proposes a novel best frames selection algorithm, ConMax3D, for multiview 3D reconstruction that utilizes image segmentation and clustering to identify and maximize concept diversity. This method aims to improve the accuracy and interpretability of selecting frames for a photorealistic 3D model generation with NeRF or 3D Gaussian Splatting without relying on camera pose information. We evaluate ConMax3D on the LLFF dataset and show that it outperforms current state-of-the-art baselines, with improvements in PSNR of up to 43.65%, while retaining computational efficiency.
Download

Paper Nr: 323
Title:

Improving Adaptive Density Control for 3D Gaussian Splatting

Authors:

Glenn Grubert, Florian Barthel, Anna Hilsmann and Peter Eisert

Abstract: 3D Gaussian Splatting (3DGS) has become one of the most influential works in the past year. Due to its efficient and high-quality novel view synthesis capabilities, it has been widely adopted in many research fields and applications. Nevertheless, 3DGS still faces challenges to properly manage the number of Gaussian primitives that are used during scene reconstruction. Following the adaptive density control (ADC) mechanism of 3D Gaussian Splatting, new Gaussians in under-reconstructed regions are created, while Gaussians that do not contribute to the rendering quality are pruned. We observe that those criteria for densifying and pruning Gaussians can sometimes lead to worse rendering by introducing artifacts. We especially observe under-reconstructed background or overfitted foreground regions. To encounter both problems, we propose three new improvements to the adaptive density control mechanism. Those include a correction for the scene extent calculation that does not only rely on camera positions, an exponentially ascending gradient threshold to improve training convergence, and significance-aware pruning strategy to avoid background artifacts. With these adaptions, we show that the rendering quality improves while using the same number of Gaussians primitives. Furthermore, with our improvements, the training converges considerably faster, allowing for more than twice as fast training times while yielding better quality than 3DGS. Finally, our contributions are easily compatible with most existing derivative works of 3DGS making them relevant for future works.
Download

Paper Nr: 325
Title:

Sensor Calibration and Data Analysis of the MuFoRa Dataset

Authors:

Valentino Behret, Regina Kushtanova, Islam Fadl, Simon Weber, Thomas Helmer and Frank Palme

Abstract: Autonomous driving sensors face significant challenges under adverse weather conditions such as fog and rain, which can seriously degrade their performance and reliability. Existing datasets often lack the reproducible and measurable data needed to adequately quantify these effects. To address this gap, a new multimodal dataset (MuFoRa) has been collected under controlled adverse weather conditions at the CARISSMA facility, using a stereo camera and two solid-state LiDAR sensors. This dataset is used to quantitatively assess sensor degradation by measuring the entropy for images and the number of inliers for point clouds on a spherical target. These metrics are used to evaluate the impact on performance under varying conditions of fog (5 to 150 m visibility) and rain (20 to 100 mm/h intensity) at different distances (5 to 50 m). Additionally, two calibration target detection approaches extemdash Deep-learning and Hough-based extemdash are evaluated to achieve accurate sensor alignment. The contributions include the introduction of a new dataset focused on fog and rain, the evaluation of sensor degradation, and an improved calibration approach. This dataset is intended to support the development of more robust sensor fusion and object detection algorithms for autonomous driving.
Download

Paper Nr: 329
Title:

Uncertainty and Feature-Based Weighted Loss for 3D Wheat Part Segmentation

Authors:

R. Reena, John H. Doonan, Kevin Williams, Fiona M. K. Corke, Huaizhong Zhang and Yonghuai Liu

Abstract: Deep learning techniques and point clouds have proved their efficacy in 3D segmentation tasks of objects. Nevertheless, the accurate plant organ segmentation is a formidable challenge due to their complex structure and variability. Furthermore, presence of over-represented and under-represented parts, occlusion, and uneven distribution complicates the 3D part segmentation tasks. Even though deep learning techniques often exhibit exceptional performance, they also face challenges in applications where accurate trait estimation is required. To handle these issues, we propose a novel uncertainty and feature based weighted loss that incorporates uncertainty metrics and features of the plant or crop. We use Gradient Attention Module (GAM) with PointNet++ baseline to validate our approach. By dynamically introducing uncertainty and feature scores into the training process, it promotes more balanced learning. Through comprehensive evaluation, we illustrate the advantages of UFL (Uncertainty and Feature based Loss) as compared to standard CE (Cross entropy loss) with our own constructed real Wheat dataset. The outcomes demonstrate consistent improvements in Accuracy (ranging from 0.9% to 4.2%) and Ear mIoU (ranging from 1.8% to 15.3%) over the standard Cross-Entropy (CE) loss function. As a result, our work contributes to the development of more robust and reliable segmentation models. This approach not only pushes forward the boundaries of precision agriculture but also has the potential to influence related areas where accurate segmentation is pivotal.
Download

Short Papers
Paper Nr: 70
Title:

D-PLS: Decoupled Semantic Segmentation for 4D-Panoptic-LiDAR-Segmentation

Authors:

Maik Steinhauser, Laurenz Reichardt, Nikolas Ebert and Oliver Wasenmüller

Abstract: This paper introduces a novel approach to 4D Panoptic LiDAR Segmentation that decouples semantic and instance segmentation, leveraging single-scan semantic predictions as prior information for instance segmentation. Our method D-PLS first performs single-scan semantic segmentation and aggregates the results over time, using them to guide instance segmentation. The modular design of D-PLS allows for seamless integration on top of any semantic segmentation architecture, without requiring architectural changes or retraining. We evaluate our approach on the SemanticKITTI dataset, where it demonstrates significant improvements over the baseline in both classification and association tasks, as measured by the LiDAR Segmentation and Tracking Quality (LSTQ) metric. Furthermore, we show that our decoupled architecture not only enhances instance prediction but also surpasses the baseline due to advancements in single-scan semantic segmentation.
Download

Paper Nr: 103
Title:

Real-Time Kinematic Positioning and Optical See-Through Head-Mounted Display for Outdoor Tracking: Hybrid System and Preliminary Assessment

Authors:

Muhannad Ismael and Maël Cornil

Abstract: This paper presents an outdoor tracking system using Real-Time Kinematic (RTK) positioning and Optical See-Through Head Mounted Display(s) (OST-HMD(s)) in urban areas where the accurate tracking of objects is critical and where displaying occluded information is important for safety reasons. The approach presented here replaces 2D screens/tablets and offers distinct advantages, particularly in scenarios demanding hands-free operation. The integration of RTK, which provides centimeter-level accuracy of tracked objects, with OST-HMD represents a promising solution for outdoor applications. This paper provides valuable insights into leveraging the combined potential of RTK and OST-HMD for outdoor tracking tasks from the perspectives of systems integration, performance optimization, and usability. The main contributions of this paper are: 1) a system for seamlessly merging RTK systems with OST-HMD to enable relatively precise and intuitive outdoor tracking, 2) an approach to determine a global location to achieve the position relative to the world, 3) an approach referred to as ’semi-dynamic’ for system assessment.
Download

Paper Nr: 147
Title:

Noisemaker 3D: Comprehensive Framework for Mesh Noise Generation

Authors:

Vladimir Mashurov, Vasilii Latonov, Anastasia Martynova and Natalia Semenova

Abstract: In this article, we present a comprehensive library for generating node and topological noise in meshes. The library provides a versatile tool for creating corrupted mesh datasets, which are essential for learning-based denoising algorithms. Our main contributions include cluster and patch noise generation techniques for mesh topology corruption. Cluster generation supports two modes: separated and merged clusters. We also compare the node noise generated by the library to real noise from a scanned object dataset. Finally, we create a noisy object dataset using the library and test it with filter-based and machine learning-based denoising methods.
Download

Paper Nr: 160
Title:

Evaluating Homography Error for Accurate Multi-Camera Multi-Object Tracking of Dairy Cows

Authors:

Shunpei Aou, Yota Yamamoto, Kazuaki Nakamura and Yukinobu Taniguchi

Abstract: In dairy farming, accurate and early detecting of the signs of illness or estrus in dairy cows is essential for improving both health management and production efficiency. It is well-known that diseases or estrus in dairy cows are reflected in their activity levels, and monitoring behaviors such as walking distance and periods of feeding, drinking, and lying down time serves as a means to detect these signs. Therefore, tracking the movement of dairy cows can provide insights into their health condition. In this paper, we propose a tracking method that addresses homography errors, which have been identified as one of the causes of reduced accuracy in the location-based multi-camera multi-object tracking methods previously used for tracking dairy cows. Additionally, we demonstrate the effectiveness of the proposed method through validation experiments conducted in two different barn environments.
Download

Paper Nr: 197
Title:

FiDaSS: A Novel Dataset for Firearm Threat Detection in Real-World Scenes

Authors:

Murilo S. Regio and Isabel H. Manssour

Abstract: For a society to thrive, people must feel safe; otherwise, fear and stress reduce the quality of life. A variety of security measures are used, but as populations grow and firearms become more accessible, societal safety faces new challenges. Existing works on threat detection focus primarily on security cameras but lack common benchmarks, standard datasets, or consistent constraints, making it difficult to assess their real-world performance, especially with low-quality footage. This work introduces a challenging dataset for Firearm Threat Detection, comprising 7450 annotated frames across 291 videos, created under rigorous quality controls. We also developed tools to streamline dataset creation and expansion through semi-automatic annotations. To our knowledge, this is the largest real-world dataset with frame-level annotations in the area. Our dataset is available online alongside the tools developed, including some to facilitate its extension. We evaluated popular detectors and state-of-the-art transformer-based methods on the dataset to validate its difficulty.
Download

Paper Nr: 246
Title:

Learning Neural Velocity Fields from Dynamic 3D Scenes via Edge-Aware Ray Sampling

Authors:

Sota Ito, Yoshikazu Hayashi, Hiroaki Aizawa and Kunihito Kato

Abstract: Neural Velocity Fields enables future frame extrapolation by learning not only the geometry and appearance but also the velocity of dynamic 3D scenes, by incorporating physics-based constraints. While the divergence theorem employed in NVFi enforces velocity continuity, it also inadvertently imposes continuity at the boundaries between dynamic objects and background regions. Consequently, the velocities of dynamic objects are reduced by the influence of background regions with zero velocity, which diminishes the quality of extrapolated frames. In our proposed method, we identify object boundaries based on geometric information extracted from NVFi and apply the divergence theorem exclusively to non-boundary regions. This approach allows for more accurate learning of velocities, enhancing the quality of both interpolated and extrapolated frames. Our experiments on the Dynamic Object Dataset demonstrated a 1.6% improvement in PSNR [dB] for interpolated frames and a 0.8% improvement for extrapolated frames.
Download

Paper Nr: 250
Title:

3DSES: An Indoor Lidar Point Cloud Segmentation Dataset with Real and Pseudo-Labels from a 3D Model

Authors:

Maxime Mérizette, Nicolas Audebert, Pierre Kervella and Jérôme Verdun

Abstract: Semantic segmentation of indoor point clouds has found various applications in the creation of digital twins for robotics, navigation and building information modeling (BIM). However, most existing datasets of labeled indoor point clouds have been acquired by photogrammetry. In contrast, Terrestrial Laser Scanning (TLS) can acquire dense sub-centimeter point clouds and has become the standard for surveyors. We present 3DSES (3D Segmentation of ESGT point clouds), a new dataset of indoor dense TLS colorized point clouds covering 427 m2 of an engineering school. 3DSES has a unique double annotation format: semantic labels annotated at the point level alongside a full 3D CAD model of the building. We introduce a model-to-cloud algorithm for automated labeling of indoor point clouds using an existing 3D CAD model. 3DSES has 3 variants of various semantic and geometrical complexities. We show that our model-to-cloud alignment can produce pseudolabels on our point clouds with a > 95% accuracy, allowing us to train deep models with significant time savings compared to manual labeling. First baselines on 3DSES show the difficulties encountered by existing models when segmenting objects relevant to BIM, such as light and safety utilities. We show that segmentation accuracy can be improved by leveraging pseudo-labels and Lidar intensity, an information rarely considered in current datasets. Code and data will be open sourced.
Download

Paper Nr: 272
Title:

MAESTRO: A Full Point Cloud Approach for 3D Anomaly Detection Based on Reconstruction

Authors:

Remi Lhoste, Antoine Vacavant and Damien Delhay

Abstract: 3D anomaly detection is a critical task in industrial manufacturing, for maintaining product quality and operational safety. However, many existing methods function more as 2.5D anomaly detection techniques, primarily relying on image data and underexploiting point clouds. These methods often face challenges related to real scenarios, and reliance on large pretrained models or memory banks. To address these issues, we propose MAESTRO, a Masked AutoEncoder Self-Supervised Through Reconstruction Only. This novel 3D anomaly detection method based solely on point cloud reconstruction without utilizing pretrained models or memory banks, making it particularly suitable for industrial applications. Experiments demonstrate that our method can outperform previous state-of-the-art methods on several classes of the MVTec 3D-AD dataset (Bergmann et al., 2022).
Download

Paper Nr: 307
Title:

Improving Geometric Consistency for 360-Degree Neural Radiance Fields in Indoor Scenarios

Authors:

Iryna Repinetska, Anna Hilsmann and Peter Eisert

Abstract: Photo-realistic rendering and novel view synthesis play a crucial role in human-computer interaction tasks, from gaming to path planning. Neural Radiance Fields (NeRFs) model scenes as continuous volumetric functions and achieve remarkable rendering quality. However, NeRFs often struggle in large, low-textured areas, producing cloudy artifacts known as ”floaters” that reduce scene realism, especially in indoor environments with featureless architectural surfaces like walls, ceilings, and floors. To overcome this limitation, prior work has integrated geometric constraints into the NeRF pipeline, typically leveraging depth information derived from Structure from Motion or Multi-View Stereo. Yet, conventional RGB-feature correspondence methods face challenges in accurately estimating depth in textureless regions, leading to unreliable constraints. This challenge is further complicated in 360-degree ”inside-out” views, where sparse visual overlap between adjacent images further hinders depth estimation. In order to address these issues, we propose an efficient and robust method for computing dense depth priors, specifically tailored for large low-textured architectural surfaces in indoor environments. We introduce a novel depth loss function to enhance rendering quality in these challenging, low-feature regions, while complementary depth-patch regularization further refines depth consistency across other areas. Experiments with Instant-NGP on two synthetic 360-degree indoor scenes demonstrate improved visual fidelity with our method compared to standard photometric loss and Mean Squared Error depth supervision.
Download

Paper Nr: 346
Title:

Leveraging Unreal Engine for UAV Object Tracking: The AirTrackSynth Synthetic Dataset

Authors:

Mingyang Zhang, Kristof Van Beeck and Toon Goedemé

Abstract: Nowadays, synthetic datasets are often used to advance the state-of-the-art in many application domains of computer vision. For these tasks, deep learning approaches are used which require vasts amounts of data. Acquiring these large annotated datasets is far from trivial, since it is very time-consuming, expensive and prone to errors during the labelling process. These synthetic datasets aim to offer solutions to the aforementioned problems. In this paper, we introduce our AirTrackSynth dataset, developed to train and evaluate deep learning models for UAV object tracking. This dataset, created using the Unreal Engine and AirSim, comprises 300GB of data in 200 well-structured video sequences. AirTrackSynth is notable for its extensive variety of objects and complex environments, setting a new standard in the field. This dataset is characterized by its multi-modal sensor data, accurate ground truth labels and a variety of environmental conditions, including distinct weather patterns, lighting conditions, and challenging viewpoints, thereby offering a rich platform to train robust object tracking models. Through the evaluation of the SiamFC object tracking algorithm on Air-TrackSynth, we demonstrate the dataset’s ability to present substantial challenges to existing methodologies and notably highlight the importance of synthetic data, especially when the availability of real data is limited. This enhancement in algorithmic performance under diverse and complex conditions underscores the critical role of synthetic data in developing advanced tracking technologies.
Download

Paper Nr: 353
Title:

Recovery of Detailed Posture and Shape from Motion Video Images by Deforming SMPL

Authors:

Yumi Ando, Fumihiko Sakaue and Jun Sato

Abstract: In this research, we propose a method for estimating detailed human shape and posture from video images of a person in motion. The SMPL (Skinned Multi-Person Linear Model) model can represent various body shapes with a small number of parameters, but it cannot represent detailed information such as the subject’s clothing or hairstyle. In this research, we separate such detailed deformations into deformations common to all time periods and temporary deformations that appear at different times, and recover each of them to realize detailed human shape recovery from video images of people shot with various postures.
Download

Paper Nr: 362
Title:

Deep Learning-Powered Visual SLAM Aimed at Assisting Visually Impaired Navigation

Authors:

Marziyeh Bamdad, Hans-Peter Hutter and Alireza Darvishy

Abstract: Despite advancements in SLAM technologies, robust operation under challenging condition such as low-texture, motion-blur, or challenging lighting remains an open challenge. Such conditions are common in applications such as assistive navigation for the visually impaired. These challenges undermine localization accuracy and tracking stability, reducing navigation reliability and safety. To overcome these limitations, we present SELM-SLAM3, a deep learning-enhanced visual SLAM framework that integrates SuperPoint and LightGlue for robust feature extraction and matching. We evaluated our framework using TUM RGB-D, ICL-NUIM, and TartanAir datasets, which feature diverse and challenging scenarios. SELM-SLAM3 outperforms conventional ORB-SLAM3 by an average of 87.84% and exceeds state-of-the-art RGB-D SLAM systems by 36.77%. Our framework demonstrates enhanced performance under challenging conditions, such as low-texture scenes and fast motion, providing a reliable platform for developing navigation aids for the visually impaired.
Download

Paper Nr: 421
Title:

A Computer Vision Approach to Counting Farmed Fish in Flowing Water

Authors:

Masanori Nishiguchi, Hitoshi Habe, Koji Abe, Masayuki Otani and Nobukazu Iguchi

Abstract: Aquaculture is an expanding industry that depends on accurate fish counting for effective production management, including growth monitoring and feed optimization. Manual counting is time-consuming and labor-intensive, while commercial counting devices face challenges such as high costs and space constraints. In ecology, tracking animal movement trajectories is essential, but using devices on small organisms is impractical, prompting the adoption of video and machine learning techniques. In contrast to traditional biological studies that often rely on offline analysis, real-time fish counting is vital in aquaculture. This study introduces a fish count method based on a Multiple Object Tracking (MOT) algorithm explicitly tailored for aquaculture. The method prioritizes counting accuracy over precise movement tracking, optimizing existing techniques. The proposed approach provides a viable solution to count fish in aquaculture and potentially other fields.
Download

Paper Nr: 425
Title:

Shape from Mirrored Polarimetric Light Field

Authors:

Shunsuke Nakagawa, Takahiro Okabe and Ryo Kawahara

Abstract: While mirror reflections provide valuable cues for vision tasks, recovering the shape of mirror-like objects remains challenging because they reflect their surroundings rather than displaying their own textures. A common approach involves placing reference objects and analyzing their reflected correspondences, but this often introduces depth ambiguity and relies on additional assumptions. In this paper, we propose a unified framework that integrates polarization and geometric transformations for shape estimation. We introduce a 9-dimensional polarized ray representation, extending the Plücker coordinate system to incorporate the polarization properties of light as defined by the plane of its electric field oscillation. This enables the seamless evaluation of polarized ray agreement within a homogeneous coordinate system. By analyzing the constraints of polarized rays before and after reflection, we derive a method for per-pixel shape estimation. Our experimental evaluations with synthetic and real images demonstrate the effectiveness of our method qualitatively and quantitatively.
Download

Paper Nr: 72
Title:

Adaptable Distributed Vision System for Robot Manipulation Tasks

Authors:

Marko Pavlic and Darius Burschka

Abstract: Existing robotic manipulation systems use stationary depth cameras to observe the workspace, but they are limited by their fixed field of view (FOV), workspace coverage, and depth accuracy. This also limits the performance of robot manipulation tasks, especially in occluded workspace areas or highly cluttered environments where a single view is insufficient. We propose an adaptable distributed vision system for better scene understanding. The system integrates a global RGB-D camera connected to a powerful computer and a monocular camera mounted on an embedded system at the robot’s end-effector. The monocular camera facilitates the exploration and 3D reconstruction of new workspace areas. This configuration provides enhanced flexibility, featuring a dynamic FOV and an extended depth range achievable through the adjustable base length, controlled by the robot’s movements. The reconstruction process can be distributed between the two processing units as needed, allowing for flexibility in system configuration. This work evaluates various configurations regarding reconstruction accuracy, speed, and latency. The results demonstrate that the proposed system achieves precise 3D reconstruction while providing significant advantages for robotic manipulation tasks.
Download

Paper Nr: 215
Title:

Comparative Analysis of Deep Learning-Based Multi-Object Tracking Approaches Applied to Sports User-Generated Videos

Authors:

Elton Alencar, Larissa Pessoa, Fernanda Costa, Guilherme Souza and Rosiane de Freitas

Abstract: The growth of video-sharing platforms has led to a significant increase in audiovisual content production, especially from mobile devices like smartphones. Sports user-generated videos (UGVs) pose unique challenges for automated analysis due to variations in image quality, diverse camera angles, and fast-moving objects. This paper presents a comparative qualitative analysis of multiple object tracking (MOT) techniques applied to sports UGVs. We evaluated three approaches: DeepSORT, StrongSORT, and TrackFormer, representing detection and attention-based tracking paradigms. Additionally, we propose integrating StrongSORT with YOLO-World, an open-vocabulary detector, to improve tracking by reducing irrelevant object detection and focusing on key elements such as players and balls. To assess the techniques, we developed UVY, a custom sports UGV database, having YouTube as its data source. A qualitative analysis of the results from applying the different tracking methods to UVY-Track videos revealed that tracking-by-detection techniques, DeepSORT and StrongSORT, performed better at tracking relevant sports objects than TrackFormer, which focus on pedestrians. The new StrongSORT version with YOLO-World showed promise by detecting fewer irrelevant objects. These findings suggest that integrating open-vocabulary detectors into MOT models can significantly improve sports UGV analysis. This work contributes to developing more effective and scalable solutions for object tracking in sports videos.
Download

Paper Nr: 316
Title:

Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models

Authors:

Yushan Wang, Shuhei Tarashima and Norio Tagawa

Abstract: : Fully-transformer frameworks have gradually replaced traditional convolutional neural networks (CNNs) in recent 3D human pose and shape estimation tasks, especially due to its attention mechanism that can capture long-range and complex relationships between input tokens, surpassing CNN’s representation capabilities. Recent attention designs have reduced the computational complexity of transformers in core computer vision tasks like classification and segmentation, achieving extraordinary strong results. However, their potential for more complex, higher-level tasks remains unexplored. For the first time, we propose to integrate the group-mix attention mechanism to 3D human pose and shape estimation task. We combine token-to-token, token-to-group, and group-to-group correlations, enabling a broader capture of human body part relationships and making it promising for challenging scenarios like occlusion+blur. We believe this mix of tokens and groups is well suited to our task, where we need to learn the relevance of various parts of the human body, which are often not individual tokens, but larger in scope. We quantitatively and qualitatively validated our method successfully reduces the parameter count by 97.3% (from 620M to 17M) and the FLOPs count by 96.1% (from 242.1G to 9.5G), with a performance gap of less than 3%.
Download

Paper Nr: 406
Title:

Benchmarking Neural Rendering Approaches for 3D Reconstruction of Underwater Environments

Authors:

Salvatore Mario Carota, Alessandro Privitera, Daniele Di Mauro, Antonino Furnari, Giovanni Maria Farinella and Francesco Ragusa

Abstract: We tackle the problem of 3D reconstruction of underwater scenarios using neural rendering techniques. We propose a benchmark adopting the SeaThru-NeRF dataset, performing a systematic analysis by comparing several established methods based on NERF and 3D Gaussian Splatting through a series of experiments. The results were evaluated both quantitatively, using various 2D and 3D metrics, and qualitatively, through a user survey assessing the fidelity of the reconstructed images. This serves to provide critical insight into how to select the optimal techniques for 3D reconstruction of underwater scenarios. The results indicate that, in the context of this application, among the algorithms tested, NeRF-based methods performed better in both mesh generation and novel view synthesis than the 3D Gaussian Splatting based methods.
Download

Paper Nr: 420
Title:

An Event Camera Simulator for Arbitrary Viewpoints Based on Neural Radiance Fields

Authors:

Diego Hernández Rodríguez, Motoharu Sonogashira, Kazuya Kitano, Yuki Fujimura, Takuya Funatomi, Yasuhiro Mukaigawa and Yasutomo Kawanishi

Abstract: Event cameras are novel sensors that offer significant advantages over standard cameras, such as high temporal resolution, high dynamic range, and low latency. Despite recent efforts, however, event cameras remain rela-tively expensive and difficult to obtain. Simulators for these sensors are crucial for developing new algorithms and mitigating accessibility issues. However, existing simulators based on a real-world video often fail to generalize to novel viewpoints or temporal resolutions, making the generation of realistic event data from a single scene unfeasible. To address these challenges, we propose enhancing event camera simulators with neural radiance fields (NeRFs). NeRFs can synthesize novel views of complex scenes from a low-frame-rate video sequence, providing a powerful tool for simulating event cameras from arbitrary viewpoints. This approach not only simplifies the simulation process but also allows for greater flexibility and realism in generating event camera data, making the technology more accessible to researchers and developers.
Download

Area 4 - Image and Video Understanding

Full Papers
Paper Nr: 41
Title:

Adaptive Out-of-Distribution Detection with Coarse-to-Fine Grained Representation

Authors:

Kohei Fukuda and Hiroaki Aizawa

Abstract: Out-of-distribution (OOD) detection, which aims to identify data sampled from a distribution different from the training data, is crucial for practical machine learning applications. Despite the coarse-to-fine structure of OOD data, which includes features at various granularities of detail, such as object shapes (coarse features) and textures (fine features), most existing methods represent an image as a fixed-length feature vector and perform detection by calculating a single OOD score from this vector. To consider the coarse-to-fine structure of OOD data, we propose a method for detecting OOD data that uses feature vectors that contain information at different granularities obtained by Matryoshka representation learning. Adaptive sub-feature vectors are selected for each OOD dataset. The OOD scores calculated from these vectors are taken as the final OOD scores. Experiments show that the proposed method outperforms existing methods in terms of OOD detection. Moreover, we analyze the relationship between each OOD dataset and the sub-feature vectors selected by our method.
Download

Paper Nr: 45
Title:

Distortion-Aware Adversarial Attacks on Bounding Boxes of Object Detectors

Authors:

Pham Phuc, Son Vuong, Khang Nguyen and Tuan Dang

Abstract: Deep learning-based object detection has become ubiquitous in the last decade due to its high accuracy in many real-world applications. With this growing trend, these models are interested in being attacked by adversaries, with most of the results being on classifiers, which do not match the context of practical object detection. In this work, we propose a novel method to fool object detectors, expose the vulnerability of state-of-the-art detectors, and promote later works to build more robust detectors to adversarial examples. Our method aims to generate adversarial images by perturbing object confidence scores during training, which is crucial in predicting confidence for each class in the testing phase. Herein, we provide a more intuitive technique to embed additive noises based on detected objects’ masks and the training loss with distortion control over the original image by leveraging the gradient of iterative images. To verify the proposed method, we perform adversarial attacks against different object detectors, including the most recent state-of-the-art models like YOLOv8, Faster R-CNN, RetinaNet, and Swin Transformer. We also evaluate our technique on MS COCO 2017 and PASCAL VOC 2012 datasets and analyze the trade-off between success attack rate and image distortion. Our experiments show that the achievable success attack rate is up to 100% and up to 98% when performing white-box and black-box attacks, respectively. The source code and relevant documentation for this work are available at the following link https://github.com/anonymous20210106/attack detector.git.
Download

Paper Nr: 46
Title:

Pose-Centric Motion Synthesis Through Adaptive Instance Normalization

Authors:

Oliver Hixon-Fisher, Jarek Francik and Dimitrios Makris

Abstract: In pose-centric motion synthesis, existing methods often depend heavily on architecture-specific mechanisms to comprehend temporal dependencies. This paper addresses this challenge by introducing the use of adaptive instance normalization layers to capture temporal coherence within pose-centric motion synthesis. We demonstrate the effectiveness of our contribution through state-of-the-art performance in terms of Fréchet Inception Distance (FID) and comparable diversity scores. Evaluations conducted on the CMU MoCap and the HumanAct12 datasets showcase our method’s ability to generate plausible and high-quality motion sequences, underscoring its potential for diverse applications in motion synthesis.
Download

Paper Nr: 81
Title:

ConvKAN: Towards Robust, High-Performance and Interpretable Image Classification

Authors:

Achref Ouni, Chafik Samir, Yousef Bouaziz and Anis Fradi

Abstract: This paper introduces ConvKAN, a novel convolutional model for image classification in artificial vision systems. ConvKAN integrates Kolmogorov-Arnold Networks (KANs) with convolutional layers within Con-volutional Neural Networks (CNNs). We demonstrate that this combination outperforms standard CNN-MLP architectures and state-of-the-art methods. Our study investigates the impact of this integration on classification performance across benchmarks and assesses the robustness of ConvKAN models compared to established CNN architectures. Varied and extensive experimental results show that ConvKAN achieves substantial gains in accuracy, precision, and recall, surpassing current state-of-the-art methods.
Download

Paper Nr: 86
Title:

Latent Space Characterization of Autoencoder Variants

Authors:

Anika Shrivastava, Renu Rameshan and Samar Agnihotri

Abstract: Understanding the latent spaces learned by deep learning models is crucial in exploring how they represent and generate complex data. Autoencoders (AEs) have played a key role in the area of representation learning, with numerous regularization techniques and training principles developed not only to enhance their ability to learn compact and robust representations, but also to reveal how different architectures influence the structure and smoothness of the lower-dimensional non-linear manifold. We strive to characterize the structure of the latent spaces learned by different autoencoders including convolutional autoencoders (CAEs), denoising autoencoders (DAEs), and variational autoencoders (VAEs) and how they change with the perturbations in the input. By characterizing the matrix manifolds corresponding to the latent spaces, we provide an explanation for the well-known observation that the latent spaces of CAE and DAE form non-smooth manifolds, while that of VAE forms a smooth manifold. We also map the points of the matrix manifold to a Hilbert space using distance preserving transforms and provide an alternate view in terms of the subspaces generated in the Hilbert space as a function of the distortion in the input. The results show that the latent manifolds of CAE and DAE are stratified with each stratum being a smooth product manifold, while the manifold of VAE is a smooth product manifold of two symmetric positive definite matrices and a symmetric positive semi-definite matrix.
Download

Paper Nr: 89
Title:

Beyond Labels: Self-Attention-Driven Semantic Separation Using Principal Component Clustering in Latent Diffusion Models

Authors:

Felix Stillger, Frederik Hasecke, Lukas Hahn and Tobias Meisen

Abstract: High-quality annotated datasets are crucial for training semantic segmentation models, yet their manual creation and annotation are labor-intensive and costly. In this paper, we introduce a novel method for generating class-agnostic semantic segmentation masks by leveraging the self-attention maps of latent diffusion models, such as Stable Diffusion. Our approach is entirely learning-free and explores the potential of self-attention maps to produce semantically meaningful segmentation masks. Central to our method is the reduction of individual self-attention information to condense the essential features required for semantic distinction. We employ multiple instances of unsupervised k-means clustering to generate clusters, with increasing cluster counts leading to more specialized semantic abstraction. We evaluate our approach using state-of-the-art models such as Segment Anything (SAM) and Mask2Former, which are trained on extensive datasets of manually annotated masks. Our results, demonstrated on both synthetic and real-world images, show that our method generates high-resolution masks with adjustable granularity, relying solely on the intrinsic scene understanding of the latent diffusion model - without requiring any training or fine-tuning.
Download

Paper Nr: 95
Title:

Experience Replay and Zero-Shot Clustering for Continual Learning in Diabetic Retinopathy Detection

Authors:

Gusseppe Bravo-Rocca, Peini Liu, Jordi Guitart, Ajay Dholakia, David Ellison and Rodrigo M. Carrillo-Larco

Abstract: We present an approach to mitigate catastrophic forgetting in Continual Learning (CL), focusing on domain incremental scenarios in medical imaging. Our method leverages Large Language Models (LLMs) to generate task-agnostic descriptions from multimodal inputs, enabling zero-shot clustering of tasks without supervision. This clustering underpins an enhanced Experience Replay (ER) strategy, strategically sampling data points to refresh the model’s memory while preserving privacy. By incrementally updating a multi-head classifier using only data embeddings, our approach maintains both efficiency and data confidentiality. Evaluated on a challenging diabetic retinopathy dataset, our method demonstrates significant improvements over traditional CL techniques, including Elastic Weight Consolidation (EWC), Gradient Episodic Memory (GEM), and Learning Without Forgetting (LWF). Extensive experiments across Multi-Layer Perceptron (MLP), Residual, and Attention architectures show consistent performance gains (up to 3.1% in Average Mean Class Accuracy) and reduced forgetting, with only 6% computational overhead. These results highlight our approach’s potential for privacy-preserving, efficient CL in sensitive domains like healthcare, offering a promising direction for developing adaptive AI systems that can learn continuously while respecting data privacy constraints.
Download

Paper Nr: 137
Title:

Detection of Door-Closing Defects by Learning from Physics-Based Simulations

Authors:

Ryoga Takahashi, Yota Yamamoto, Ryosuke Furuta and Yukinobu Taniguchi

Abstract: In this paper, we propose a method that applies physics-based simulations for detecting door-closing defects. Quantitative inspection of industrial products is essential to reduce human errors and variation in inspection results. Door-closing inspections, which now rely on human sensory evaluation, are prime targets for quantification and automation. Developing a visual inspection model based on deep learning requires time-consuming and labor-intensive data collection with dedicated measuring instruments. To eliminate the need for expensive data collection, our proposal uses physics-based simulation data instead of real data to learn the physical relationships. Specifically, we simultaneously learn a binary classification task for normal and defective doors and a task for estimating door-closing energy while sharing parameters, which allows us to learn the relationships between them in a preliminary step. Experiments demonstrate that our method has greater accuracy than existing methods and achieves an accuracy comparable to the method that uses ground-truth data collected with dedicated measuring instruments.
Download

Paper Nr: 156
Title:

Leveraging Vision Language Models for Understanding and Detecting Violence in Videos

Authors:

Jose Alejandro Avellaneda Gonzalez, Tetsu Matsukawa and Einoshin Suzuki

Abstract: Detecting violent behaviors in video content is crucial for public safety and security. Ensuring accurate identification of such behaviors can prevent harm and enhance surveillance. Traditional methods rely on manual feature extraction and classical machine learning algorithms, which lack robustness and adaptability in diverse real-world scenarios. These methods struggle with environmental variability and often fail to generalize across contexts. Due to the nature of violence content, ethical and legal challenges in dataset collection result in a scarcity of data. This limitation impacts modern deep learning approaches, which, despite their effectiveness, often produce models that struggle to generalize well across diverse contexts. To address these challenges, we propose VIVID: Vision-Language Integration for Violence Identification and Detection. VIVID leverages Vision Language Models (VLMs) and a database of violence definitions to mitigate biases in Large Language Models (LLMs) and operates effectively with limited video data. VIVID functions through two steps: key-frame selection based on optical flow to capture high-motion frames, and violence detection using VLMs to translate visual representations into tokens, enabling LLMs to comprehend video content. By incorporating an external database with definitions of violence, VIVID ensures accurate and contextually relevant understanding, addressing inherent biases in LLMs. Experimental results on five datasets—Movies, Surveillance Fight, RWF-2000, Hockey, and XD-Violence— demonstrate that VIVID outperforms LLM-based methods and achieves competitive performance compared with deep learning-based methods, with the added benefit of providing explanations for its detections.
Download

Paper Nr: 167
Title:

Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

Authors:

Eric Brouwer, Jan Erik van Woerden, Gertjan Burghouts, Matias Valdenegro-Toro and Marco Zullich

Abstract: Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data. This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training (CLIP) model through adaptive prompt tuning, guided by real-time visual inputs. Unlike existing techniques such as Context Optimization (CoOp) and Visual Prompt Tuning (VPT), which are constrained by static prompts or visual token reliance, the proposed approach leverages a cross-attention mechanism to dynamically refine text prompts for the image at hand. This enables an image-specific alignment of textual features with image patches extracted from the Vision Transformer, making the model more effective for datasets with high intra-class variance and low inter-class differences. The method is evaluated on several datasets, including CUBirds, Oxford Flowers, and FGVC Aircraft, showing significant performance gains over static prompt tuning approaches. To ensure these performance gains translate into trustworthy predictions, we integrate Monte-Carlo Dropout in our approach to improve the reliability of the model predictions and uncertainty estimates. This integration provides valuable insights into the model’s predictive confidence, helping to identify when predictions can be trusted and when additional verification is necessary. This dynamic approach offers a robust solution, advancing the state-of-the-art for few-shot fine-grained classification.
Download

Paper Nr: 169
Title:

MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification

Authors:

Laura Fieback, Jakob Spiegelberg and Hanno Gottschalk

Abstract: Large Vision Language Models (LVLMs) have shown remarkable capabilities in multimodal tasks like visual question answering or image captioning. However, inconsistencies between the visual information and the generated text, a phenomenon referred to as hallucinations, remain an unsolved problem with regard to the trustworthiness of LVLMs. To address this problem, recent works proposed to incorporate computationally costly Large (Vision) Language Models in order to detect hallucinations on a sentence- or subsentence-level. In this work, we introduce MetaToken, a lightweight binary classifier to detect hallucinations on token-level at negligible cost. Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs. MetaToken can be applied to any open-source LVLM without any knowledge about ground truth data providing a calibrated detection of hallucinations. We evaluate our method on four state-of-the-art LVLMs outperforming baseline methods by up to 46.50pp in terms of area under precision recall curve values.
Download

Paper Nr: 175
Title:

ReST: High-Precision Soccer Player Tracking via Motion Vector Segmentation

Authors:

Fahad Majeed, Khaled Ahmed Lutf Al Thelaya, Nauman Ullah Gilal, Kamilla Swart-Arries, Marco Agus and Jens Schneider

Abstract: We present a novel real-time framework for the detection, instance segmentation, and tracking of soccer players in video footage. Our method, called ReST, is designed to overcome challenges posed by complex player interactions and occlusions. This is achieved by enhancing video frames by incorporating motion vectors obtained using the Scharr filter and frame differencing. This provides additional shape cues over RGB frames that are not considered in traditional approaches. We use the Generalized Efficient Layer Aggregation Network (GELAN), combining the best qualities of CSPDarknet53 and ELAN as a robust backbone for instance segmentation and tracking. We evaluate our method rigorously on both publicly available and our proprietary (SoccerPro) datasets to validate its performance across diverse soccer video contexts. We train our model concurrently on multiple datasets, thereby improving generalization and reducing dataset bias. Our results demonstrate an impressive 97% accuracy on the DFL Bundesliga Data Shootout, 98% on SoccerNet-Tracking, and 99% on the SoccerPro dataset. These findings underscore the framework’s efficacy and practical relevance for advancing real-time soccer video analysis.
Download

Paper Nr: 186
Title:

Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study

Authors:

Zejian Zhang, Cristina Palmero and Sergio Escalera

Abstract: Deep learning models need to encode both local and global temporal dependencies for accurate temporal action localization (TAL). Recent approaches have relied on Transformer blocks, which has a quadratic complexity. By contrast, Mamba blocks have been adapted for TAL due to their comparable performance and lower complexity. However, various factors can influence the choice between these models, and a thorough analysis of them can provide valuable insights into the selection process. In this work, we analyze the Transformer block, Mamba block, and their combinations as temporal feature encoders for TAL, measuring their overall performance, efficiency, and sensitivity across different contexts. Our analysis suggests that Mamba blocks should be preferred due to their performance and efficiency. Hybrid encoders can serve as an alternative choice when sufficient computational resources are available.
Download

Paper Nr: 188
Title:

DeepSpace: Navigating the Frontier of Deepfake Identification Using Attention-Driven Xception and a Task-Specific Subspace

Authors:

Ayush Roy, Sk Mohiuddin, Maxim Minenko, Dmitrii Kaplun and Ram Sarkar

Abstract: The recent advancements in deepfake technology pose significant challenges in detecting manipulated media content and preventing its malicious use in different areas. Using ConvNets feature spaces and fine-tuning them for deepfake classification can lead to unwanted modifications and artifacts in the feature space. To address this, we propose a model that uses Xception as the backbone and a Spatial Attention Module (SAM) to leverage spatial information using shallower features like texture, color, and shape, as well as deeper finegrained features. We also create a task-specific subspace for projecting spatially enriched features, which boosts the overall model performance. To do this, we utilize Gram-Smith orthogonalization on the flattened features of real and fake images to produce the basis vectors for our subspace. We evaluate the proposed method using two widely used and standard deepfake video datasets: FaceForensics++ and Celeb-DF (V2). We conduct experiments following two different setups: intra-dataset (trained and tested on the same dataset) and inter-dataset (trained and tested on separate datasets). The performance of the proposed model is comparable to that of state-of-the-art methods, confirming its robustness and generalization ability. The code is made available at https://github.com/AyushRoy2001/DeepSpace.
Download

Paper Nr: 198
Title:

Self-Supervised Iterative Refinement for Anomaly Detection in Industrial Quality Control

Authors:

Muhammad Aqeel, Shakiba Sharifi, Marco Cristani and Francesco Setti

Abstract: This study introduces the Self-Supervised Iterative Refinement Process (IRP), a robust anomaly detection methodology tailored for high-stakes industrial quality control. The IRP leverages self-supervised learning to improve defect detection accuracy by employing a cyclic data refinement strategy that iteratively removes misleading data points, thereby improving model performance and robustness. We validate the effectiveness of the IRP using two benchmark datasets, Kolektor SDD2 (KSDD2) and MVTec-AD, covering a wide range of industrial products and defect types. Our experimental results demonstrate that the IRP consistently outperforms traditional anomaly detection models, particularly in environments with high noise levels. This study highlights the potential of IRP to significantly enhance anomaly detection processes in industrial settings, effectively managing the challenges of sparse and noisy data.
Download

Paper Nr: 213
Title:

VectorWeaver: Transformers-Based Diffusion Model for Vector Graphics Generation

Authors:

Ivan Jarsky, Maxim Kuzin, Valeria Efimova, Viacheslav Shalamov and Andrey Filchenkov

Abstract: Diffusion models generate realistic results for raster images. However, vector image generation is not so successful because of significant differences in image structure. Unlike raster images, vector ones consist of paths that are described by their coordinates, colors, and stroke widths. The number of paths needed to be generated is unknown in advance. We tackle the vector image synthesis problem by developing a new diffusion-based model architecture, that we call VectorWeaver, including two transformer-based stacked encoders and two transformer-based stacked decoders. For training the model, we collected a vector images dataset from public resources, however, its size was not enough. To enrich and enlarge it we proposed new augmentation operations specific for vector images. To train the model, we designed a specific loss function, which allowed the generation of objects with smooth contours without artifacts. Qualitative experiments demonstrate the superiority and computational efficiency of the proposed model compared to the existing vector image generation methods. The vector image generation code is available at https://github.com/CTLab-ITMO/VGLib/tree/main/VectorWeaver.
Download

Paper Nr: 305
Title:

Data-Free Dynamic Compression of CNNs for Tractable Efficiency

Authors:

Lukas Meiner, Jens Mehnert and Alexandru Paul Condurache

Abstract: To reduce the computational cost of convolutional neural networks (CNNs) on resource-constrained devices, structured pruning approaches have shown promise in lowering floating-point operations (FLOPs) without substantial drops in accuracy. However, most methods require fine-tuning or specific training procedures to achieve a reasonable trade-off between retained accuracy and reduction in FLOPs, adding computational overhead and requiring training data to be available. To this end, we propose HASTE (Hashing for Tractable Efficiency), a data-free, plug-and-play convolution module that instantly reduces a network’s test-time inference cost without training or fine-tuning. Our approach utilizes locality-sensitive hashing (LSH) to detect redundancies in the channel dimension of latent feature maps, compressing similar channels to reduce input and filter depth simultaneously, resulting in cheaper convolutions. We demonstrate our approach on the popular vision benchmarks CIFAR-10 and ImageNet, where we achieve a 46.72% reduction in FLOPs with only a 1.25% loss in accuracy by swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
Download

Paper Nr: 337
Title:

Enhancing 3D Human Pose Estimation: A Novel Post-Processing Method

Authors:

Elham Iravani, Frederik Hasecke, Lukas Hahn and Tobias Meisen

Abstract: Human Pose Estimation (HPE) is a critical task in computer vision, involving the prediction of human body joint coordinates from images or videos. Traditional 3D HPE methods often predict joint positions relative to a central body part, such as the hip. Transformer-based models like PoseFormer (Zheng et al., 2021), MHFormer (Li et al., 2022b), and PoseFormerV2 (Zhao et al., 2023) have advanced the field by capturing spatial and temporal relationships to improve prediction accuracy. However, these models primarily output relative joint positions, requiring additional steps for absolute pose estimation. In this work, we present a novel post-processing technique that refines the output of other HPE methods from monocular images. By leveraging projection and spatial constraints, our method enhances the accuracy of relative joint predictions and seamlessly transitions them to absolute poses. Validated on the Human3.6M dataset (Ionescu et al., 2013), our approach demonstrates significant improvements over existing methods, achieving state-of-the-art performance in both relative and absolute 3D human pose estimation. Our method achieves a notable error reduction, with a 33.9% improvement compared to PoseFormer and a 27.2% improvement compared to MHFormer estimations.
Download

Paper Nr: 343
Title:

Temporally Accurate Events Detection Through Ball Possessor Recognition in Soccer

Authors:

Marc Peral, Guillem Capellera, Antonio Rubio, Luis Ferraz, Francesc Moreno-Noguer and Antonio Agudo

Abstract: Recognizing specific actions in soccer games has become an increasingly important research topic. One key area focuses on accurately identifying when passes and receptions occur, as these are frequent actions in games and critical for analysts reviewing match strategies. However, most current methods do not pinpoint when these actions happen precisely enough and often fail to show which player is making the move. Our new method uses video footage to detect passes and receptions and identifies which player is involved in each action by following possession of the ball at each moment. We create video clips, or tubes, for every player on the field, determine who has the ball, and use this information to recognize when these key actions take place. Our results show that our system is better than the latest models in spotting passes and can identify most events with an accuracy down to 0.6 seconds.
Download

Paper Nr: 375
Title:

Improving Image Classification Tasks Using Fused Embeddings and Multimodal Models

Authors:

Artur A. Oliveira, Mateus Espadoto, Roberto Hirata Jr. and Roberto M. Cesar Jr.

Abstract: In this paper, we address the challenge of flexible and scalable image classification by leveraging CLIP embeddings, a pre-trained multimodal model. Our novel strategy uses tailored textual prompts (e.g., “This is digit 9”, “This is even/odd”) to generate and fuse embeddings from both images and prompts, followed by clustering for classification. We present a prompt-guided embedding strategy that dynamically aligns multimodal representations to task-specific or grouped semantics, enhancing the utility of models like CLIP in clustering and constrained classification workflows. Additionally, we evaluate the embedding structures through clustering, classification, and t-SNE visualization, demonstrating the impact of prompts on embedding space separability and alignment. Our findings underscore CLIP’s potential for flexible and scalable image classification, supporting zero-shot scenarios without the need for retraining.
Download

Paper Nr: 417
Title:

Paint Blob Detection and Decoding for Identification of Honey Bees

Authors:

Andrea P. Gómez-Jaime, Luke Meyers, Josué A. Rodríguez-Cordero, José L. Agosto-Rivera, Tugrul Giray and Rémi Mégret

Abstract: This paper evaluates a new method for the automated re-identification of honey bees marked with paint codes using fewer annotations than previous methods. Monitoring honey bees and understanding their biology can benefit from studies that measure traits at the individual level, requiring methods for re-identification. Marking with colored paint is one method used by biologists for re-identification in the field because it is noninvasive and readable by humans. This work uses the YOLOv8 object detection approach to detect and classify colored paint markings. A new algorithm to decode the identity based on bi-color left/right paint code is proposed. The proposed approach was evaluated on an extensive dataset with 64 distinct color code identities composed of combinations of 8 different colors, with the test set featuring over 4000 images of 64 unseen individuals. The proposed approach reached 93% top-1 accuracy in the recognition of 1 vs 64 identities, achieving better performance than previous methods while requiring fewer annotated images per identity. The proposed approach also provides insights into the factors affecting re-identification accuracy, such as illumination and paint color combinations, facilitating improved experimental design and data collection strategies for future insect monitoring applications.
Download

Short Papers
Paper Nr: 20
Title:

Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model

Authors:

Shiryu Ueno, Yoshikazu Hayashi, Shunsuke Nakatsuka, Yusei Yamada, Hiroaki Aizawa and Kunihito Kato

Abstract: We propose general visual inspection model using Vision-Language Model (VLM) with few-shot images of non-defective or defective products, along with explanatory texts that serve as inspection criteria. Although existing VLM exhibit high performance across various tasks, they are not trained on specific tasks such as visual inspection. Thus, we construct a dataset consisting of diverse images of non-defective and defective products collected from the web, along with unified formatted output text, and fine-tune VLM. For new products, our method employs In-Context Learning, which allows the model to perform inspections with an example of non-defective or defective image and the corresponding explanatory texts with visual prompts. This approach eliminates the need to collect a large number of training samples and re-train the model for each product. The experimental results show that our method achieves high performance, with MCC of 0.804 and F1-score of 0.950 on MVTec AD in a one-shot manner. Our code is available at https://github.com/ia-gu/Vision-Language- In-Context-Learning-Driven-Few-Shot-Visual-Inspection-Model.
Download

Paper Nr: 28
Title:

CodeSCAN: ScreenCast ANalysis for Video Programming Tutorials

Authors:

Alexander Naumann, Felix Hertlein, Jacqueline Höllig, Lucas Cazzonelli and Steffen Thoma

Abstract: Programming tutorials in the form of coding screencasts play a crucial role in programming education, serving both novices and experienced developers. However, the video format of these tutorials presents a challenge due to the difficulty of searching for and within videos. Addressing the absence of large-scale and diverse datasets for screencast analysis, we introduce the CodeSCAN dataset. It comprises 12,000 screenshots captured from the Visual Studio Code environment during development, featuring 24 programming languages, 25 fonts, and over 90 distinct themes, in addition to diverse layout changes and realistic user interactions. Moreover, we conduct detailed quantitative and qualitative evaluations to benchmark the performance of Integrated Development Environment (IDE) element detection, color-to-black-and-white conversion, and Optical Character Recognition (OCR). We hope that our contributions facilitate more research in coding screencast analysis, and we make the source code for creating the dataset and the benchmark publicly available at a-nau.github.io/codescan.
Download

Paper Nr: 55
Title:

Spiideo SoccerNet SynLoc: Single Frame World Coordinate Athlete Detection and Localization with Synthetic Data

Authors:

Håkan Ardö, Mikael Nilsson, Anthony Cioppa, Floriane Magera, Silvio Giancola, Haochen Liu, Bernard Ghanem and Marc Van Droogenbroeck

Abstract: Currently, most research and public datasets for video sports analytics are base on detecting players as bounding boxes in broadcast videos. Going from there to precise locations on the pitch is however hard. Modern solutions are making dedicated static cameras covering the entire pitch more readily accessible, and they are now used more and more even in lower tiers. To promote research that can take benefits of such cameras and produce more precise pitch locations, we introduce the Spiideo SoccerNet SynLoc dataset. It consists of synthetic athletes rendered on top of images from real world installation of such cameras. We also introduce a new task of detecting the players in the world pitch coordinate system and a new metric based solely on real world physical properties where the representation in the image is irrelevant. The dataset and code are publicly available at https://github.com/Spiideo/sskit.
Download

Paper Nr: 69
Title:

Deep Image Clustering with Model-Agnostic Meta-Learning

Authors:

Kim Bjerge, Paul Bodesheim and Henrik Karstoft

Abstract: Deep clustering has proven successful in analyzing complex, high-dimensional real-world data. Typically, features are extracted from a deep neural network and then clustered. However, training the network to extract features that can be clustered efficiently in a semantically meaningful way is particularly challenging when data is sparse. In this paper, we present a semi-supervised method to fine-tune a deep learning network using Model-Agnostic Meta-Learning, commonly employed in Few-Shot Learning. We apply episodic training with a novel multivariate scatter loss, designed to enhance inter-class feature separation while minimizing intra-class variance, thereby improving overall clustering performance. Our approach works with state-of-the-art deep learning models, spanning convolutional neural networks and vision transformers, as well as different clustering algorithms like K-means and Spectral clustering. The effectiveness of our method is tested on several commonly used Few-Shot Learning datasets, where episodic fine-tuning with our multivariate scatter loss and a ConvNeXt backbone outperforms other models, achieving adjusted rand index scores of 89.7% on the EU moths dataset and 86.9% on the Caltech birds dataset, respectively. Hence, our proposed method can be applied across various practical domains, such as clustering images of animal species in biology.
Download

Paper Nr: 97
Title:

Classifier Ensemble for Efficient Uncertainty Calibration of Deep Neural Networks for Image Classification

Authors:

Michael Schulze, Nikolas Ebert, Laurenz Reichardt and Oliver Wasenmüller

Abstract: This paper investigates novel classifier ensemble techniques for uncertainty calibration applied to various deep neural networks for image classification. We evaluate both accuracy and calibration metrics, focusing on Expected Calibration Error (ECE) and Maximum Calibration Error (MCE). Our work compares different methods for building simple yet efficient classifier ensembles, including majority voting and several metamodel-based approaches. Our evaluation reveals that while state-of-the-art deep neural networks for image classification achieve high accuracy on standard datasets, they frequently suffer from significant calibration errors. Basic ensemble techniques like majority voting provide modest improvements, while metamodel-based ensembles consistently reduce ECE and MCE across all architectures. Notably, the largest of our compared metamodels demonstrate the most substantial calibration improvements, with minimal impact on accuracy. Moreover, classifier ensembles with metamodels outperform traditional model ensembles in calibration performance, while requiring significantly fewer parameters. In comparison to traditional post-hoc calibration methods, our approach removes the need for a separate calibration dataset. These findings underscore the potential of our proposed metamodel-based classifier ensembles as an efficient and effective approach to improving model calibration, thereby contributing to more reliable deep learning systems.
Download

Paper Nr: 99
Title:

Conditioned Generative AI for Synthetic Training of 6D Object Pose Detection

Authors:

Mathijs Lens, Aaron Van Campenhout and Toon Goedemé

Abstract: In this paper, we propose a method to generate synthetic training images for a more complex computer vision task compared to image classification, specifically 6D object pose detection. We demonstrate that conditioned diffusion models can generate unlimited training images for training an object pose detection model for a custom object type. Moreover, we investigate the potential of (automatically) filtering out ill-produced images in the dataset, which increases the quality of the image dataset, and show the importance of finetuning the trained model with a limited amount of real-world images to bridge the remaining sim2real domain gap. We demonstrate our pipeline in the use case of parcel box detection for the automation of delivery vans. All code is publicly available on our GitLab https://gitlab.com/EAVISE/avc/generative-ai-synthetic-training-pose-detection.
Download

Paper Nr: 110
Title:

Deep Local Feature Matching Image Anomaly Detection with Patch Adaptive Average Pooling Technique

Authors:

Afshin Dini and Esa Rahtu

Abstract: We present a new visual defect detection approach based on a deep feature-matching model and a patch adaptive technique. The main idea is to utilize a pre-trained feature-matching model to identify the training sample(s) being most similar to each test sample. By applying a patch-adaptive average pooling on the extracted features and defining an anomaly map using a pixel-wise Mahalanobis distance between the normal and test features, anomalies can be detected properly. By evaluating our method on the MVTec dataset, we discover that our method has many advantages over similar techniques as (1) it skips the training phase and the difficulties of fine-tuning model parameters that may vary from one dataset to another, (2) it performs quite well on datasets with only a few training samples, reducing the costs of collecting large training datasets in real-world applications, (3) it can automatically adjust itself without compromising performance in terms of shift in data domain, and (4) the model’s performance is better than similar state-of-the-art methods.
Download

Paper Nr: 116
Title:

CTypiClust: Confidence-Aware Typical Clustering for Budget-Agnostic Active Learning with Confidence Calibration

Authors:

Takuya Okano, Yohei Minekawa and Miki Hayakawa

Abstract: Active Learning (AL) has been widely studied to reduce annotation costs in deep learning. In AL, the appropriate method varies depending on the number of annotatable data (budget). In low-budget settings, it is appropriate to prioritize sampling typical data, while in high-budget settings, it is better to prioritize sampling data with high uncertainty. This study proposes Confidence-aware Typical Clustering (CTypiClust), an AL method that performs well regardless of the budget. CTypiClust dynamically switches between typical data sampling and low-confidence data sampling based on confidence. Additionally, to mitigate the overconfidence problem in low-budget settings, we propose a new confidence calibration method Cluster-Enhanced Confidence (CEC). By applying CEC to CTypiClust, we suppress the occurrence of overconfidence in low-budget settings. To evaluate the effectiveness of the proposed method, we conducted experiments using multiple benchmark datasets, and confirmed that CTypiClust consistently shows high performance regardless of the budget.
Download

Paper Nr: 139
Title:

Uncertainty Estimation for Super-Resolution Using ESRGAN

Authors:

Maniraj Sai Adapa, Marco Zullich and Matias Valdenegro-Toro

Abstract: Deep Learning-based image super-resolution (SR) has been gaining traction with the aid of Generative Adversarial Networks. Models like SRGAN and ESRGAN are constantly ranked between the best image SR tools. However, they lack principled ways for estimating predictive uncertainty. In the present work, we enhance these models using Monte Carlo-Dropout and Deep Ensemble, allowing the computation of predictive uncertainty. When coupled with a prediction, uncertainty estimates can provide more information to the model users, highlighting pixels where the SR output might be uncertain, hence potentially inaccurate, if these estimates were to be reliable. Our findings suggest that these uncertainty estimates are decently calibrated and can hence fulfill this goal, while providing no performance drop with respect to the corresponding models without uncertainty estimation.
Download

Paper Nr: 154
Title:

Inductive Self-Supervised Dimensionality Reduction for Image Retrieval

Authors:

Deryk Willyan Biotto, Guilherme Henrique Jardim, Vinicius Atsushi Sato Kawai, Bionda Rozin, Denis Henrique Pinheiro Salvadeo and Daniel Carlos Guimarães Pedronette

Abstract: The exponential growth of multimidia data creates a pressing need for approaches that are capable of efficiently handling Content-Based Image Retrieval (CBIR) in large and continuosly evolving datasets. Dimensionality reduction techniques, such as t-SNE and UMAP, have been widely used to transform high-dimensional features into more discriminative, low-dimensional representations. These transformations improve the effectiveness of retrieval systems by not only preserving but also enhancing the underlying structure of the data. However, their transductive nature requires access to the entire dataset during the reduction process, limiting their use in dynamic environments where data is constantly added. In this paper, we propose ISSDiR, a self-supervised, inductive dimensionality reduction method that generalizes to unseen data, offering a practical solution for continuously expanding datasets. Our approach integrates neural networks-based feature extraction with clustering-based pseudo-labels and introduces a hybrid loss function that combines cross-entropy and constrastive loss, weighted by cluster distances. Extensive experiments demonstrate the competitive performance of the proposed method in multiple datasets. This indicates its potential to contribute to the field of image retrieval by introducing a novel inductive approach specifically designed for dimensionality reduction in retrieval tasks.
Download

Paper Nr: 173
Title:

A Method for Detecting Hands Moving Objects from Videos

Authors:

Rikuto Konishi, Toru Abe and Takuo Suganuma

Abstract: In this paper, we propose a novel method to recognize human actions of moving objects with their hands from video. Hand-object interaction plays a central role in human-object interaction, and the action of moving an object with the hand is also important as a reliable clue that a person is touching and affecting the object. To detect such specific actions, it is expected that detection model training and model-based detection can be made more efficient by using features designed to appropriately integrate different types of information obtained from the video. The proposed method focuses on the knowledge that an object moved by a hand shows movements similar to those of the forearm. Using this knowledge, our method integrates skeleton and motion information of the person obtained from the video to evaluate the difference in movement between the forearm region and the surrounding region of the hand, and detects the hand moving an object by determining whether the similar movements as the forearm occur around the hand from these differences.
Download

Paper Nr: 174
Title:

Rescuing Easy Samples in Self-Supervised Pretraining

Authors:

Qin Wang, Kai Krajsek and Hanno Scharr

Abstract: Many recent self-supervised pretraining methods use augmented versions of the same image as samples for their learning schemes. We observe that ’easy’ samples, i.e. samples being too similar to each other after augmentation, have only limited value as learning signal. We therefore propose to rescue easy samples and make them harder. To do so, we select the top k easiest samples using cosine similarity, strongly augment them, forward-pass them through the model, calculate cosine similarity of the output as loss, and add it to the original loss in a weighted fashion. This method can be adopted to all contrastive or other augmented-pair based learning methods, whether they involve negative pairs or not, as it changes handling of easy positives, only. This simple but effective approach introduces greater variability into such self-supervised pretraining processes, significantly increasing the performance on various downstream tasks as observed in our experiments. We pretrain models of different sizes, i.e. ResNet-50, ViT-S, ViT-B, or ViT-L, using ImageNet with SimCLR, MoCo v3, or DINOv2 training schemes. Here, e.g., we consistently find to improve results for ImageNet top-1 accuracy with a linear classifier establishing new SOTA for this task.
Download

Paper Nr: 180
Title:

Knowledge Amalgamation for Single-Shot Context-Aware Emotion Recognition

Authors:

Tristan Cladière, Olivier Alata, Christophe Ducottet, Hubert Konik and Anne-Claire Legrand

Abstract: Fine-grained emotion recognition using the whole context inside images is a challenging task. Usually, the approaches to solve this problem analyze the scene from different aspects, for example people, place, object or interactions, and make a final prediction that takes all this information into account. Despite giving promising results, this requires specialized pre-trained models, and multiple pre-processing steps, which inevitably results in long and complex frameworks. To obtain a more practicable solution that would work in real time scenario with limited resources, we propose a method inspired by the amalgamation process to incorporate specialized knowledge from multiple teachers inside a student composed of a single architecture. Moreover, the student is not only capable of treating all subjects simultaneously by creating emotion maps, but also to detect the subjects in a bottom-up manner. We also compare our approach with the traditional method of fine-tuning pre-trained models, and show its superiority on two databases used in the context-aware emotion recognition field.
Download

Paper Nr: 183
Title:

Handling Drift in Industrial Defect Detection Through MMD-Based Domain Adaptation

Authors:

Xuban Barberena, Fátima A. Saiz and Iñigo Barandiaran

Abstract: This study enhances industrial quality control by automating defect detection using artificial vision and deep learning techniques. It addresses the challenge of model drift, where variations in input data distribution affect performance. To tackle this, the paper proposes a simpler, practical approach to unsupervised Domain Adaptation (UDA) for object detection, focusing on industrial applicability. A technique based on the Faster R-CNN architecture and a Maximum Mean Discrepancy (MMD) regularization method for feature alignment is proposed. The study aims to detect data drift using state-of-the-art methods and evaluate the proposed UDA technique’s effectiveness in improving surface defect detection. Results show that statistical tests effectively identify variations, enabling timely adaptations. The proposed UDA method achieved mean Average Precision (mAP50) improvements of 3.1% and 6.1% under vibration and noise scenarios, respectively, and a significant 17.8% improvement for conditions with particles, advancing existing methods in the literature.
Download

Paper Nr: 211
Title:

Membership Inference Attacks for Face Images Against Fine-Tuned Latent Diffusion Models

Authors:

Lauritz Christian Holme, Anton Mosquera Storgaard and Siavash Arjomand Bigdeli

Abstract: The rise of generative image models leads to privacy concerns when it comes to the huge datasets used to train such models. This paper investigates the possibility of inferring if a set of face images was used for fine-tuning a Latent Diffusion Model (LDM). A Membership Inference Attack (MIA) method is presented for this task. Using generated auxiliary data for the training of the attack model leads to significantly better performance, and so does the use of watermarks. The guidance scale used for inference was found to have a significant influence. If a LDM is fine-tuned for long enough, the text prompt used for inference has no significant influence. The proposed MIA is found to be viable in a realistic black-box setup against LDMs fine-tuned on face-images.
Download

Paper Nr: 232
Title:

VLLM Guided Human-Like Guidance Navigation Generation

Authors:

Masaki Nambata, Tsubasa Hirakawa, Takayoshi Yamashita, Hirobobu Fujiyoshi, Takehito Teraguchi, Shota Okubo and Takuya Nanri

Abstract: In the field of Advanced Driver Assistance Systems (ADAS), car navigation systems have become an essential part of modern driving. However, the guidance provided by existing car navigation systems is often difficult to understand, making it difficult for drivers to understand solely through voice instructions. This challenge has led to growing interest in Human-like Guidance (HLG), a task focused on delivering intuitive navigation instructions that mimic the way a passenger would guide a driver. Despite this, previous studies have used rule-based systems to generate HLG datasets, which have resulted in inflexible and low-quality due to limited textual representation. In contrast, high-quality datasets are crucial for improving model performance. In this study, we propose a method to automatically generate high-quality navigation sentences from image data using a Large Language Model with a novel prompting approach. Additionally, we introduce a Mixture of Experts (MoE) framework for data cleaning to filter out unreliable data. The resulting dataset is both expressive and consistent. Furthermore, our proposed MoE evaluation framework makes it possible to perform appropriate evaluation from multiple perspectives, even for complex tasks such as HLG.
Download

Paper Nr: 243
Title:

CLIP-MDGAN: Multi-Discriminator GAN Using CLIP Task Allocation

Authors:

Shonosuke Gonda, Fumihiko Sakaue and Jun Sato

Abstract: In a Generative Adversarial Network (GAN), in which the generator and discriminator learn adversarially, the performance of the generator can be improved by improving the discriminator’s discriminatory ability. Thus, in this paper, we propose a method to improve the generator’s generative ability by adversarially training a single generator with multiple discriminators, each with different expertise. By each discriminator having different expertise, the overall discriminatory ability of the discriminator is improved, which improves the generator’s performance. However, it is not easy to give multiple discriminators independent expertise. To address this, we propose CLIP-MDGAN, which leverages CLIP, a large-scale learning model that has recently attracted a lot of attention, to classify a dataset into multiple classes with different visual features. Based on CLIP-based classification, each discriminator is assigned a specific subset of images to promote the development of independent expertise. Furthermore, we introduce a method to gradually increase the number of discriminators in adversarial training to reduce instability in training multiple discriminators and reduce training costs.
Download

Paper Nr: 289
Title:

Neighbor Embedding Projection and Graph Convolutional Networks for Image Classification

Authors:

Gustavo Rosseto Leticio, Vinicius Atsushi Sato Kawai, Lucas Pascotti Valem and Daniel Carlos Guimarães Pedronette

Abstract: The exponential increase in image data has heightened the need for machine learning applications, particularly in image classification across various fields. However, while data volume has surged, the availability of labeled data remains limited due to the costly and time-intensive nature of labeling. Semi-supervised learning offers a promising solution by utilizing both labeled and unlabeled data; it employs a small amount of labeled data to guide learning on a larger unlabeled set, thus reducing the dependency on extensive labeling efforts. Graph Convolutional Networks (GCNs) introduce an effective method by applying convolutions in graph space, allowing information propagation across connected nodes. This technique captures individual node features and inter-node relationships, facilitating the discovery of intricate patterns in graph-structured data. Despite their potential, GCNs remain underutilized in image data scenarios, where input graphs are often computed using features extracted from pre-trained models without further enhancement. This work proposes a novel GCN-based approach for image classification, incorporating neighbor embedding projection techniques to refine the similarity graph and improve the latent feature space. Similarity learning approaches, commonly employed in image retrieval, are also integrated into our workflow. Experimental evaluations across three datasets, four feature extractors, and three GCN models revealed superior results in most scenarios.
Download

Paper Nr: 293
Title:

Graph Convolutional Networks and Particle Competition and Cooperation for Semi-Supervised Learning

Authors:

Gustavo Rosseto Leticio, Matheus Henrique Jacob dos Santos, Lucas Pascotti Valem, Vinicius Atsushi Sato Kawai, Fabricio Aparecido Breve and Daniel Carlos Guimarães Pedronette

Abstract: Given the substantial challenges associated with obtaining labeled data, including high costs, time consumption, and the frequent need for expert involvement, semi-supervised learning has garnered increased attention. In these scenarios, Graph Convolutional Networks (GCNs) offer an attractive and promising solution, as they can effectively leverage labeled and unlabeled data for classification. Through their ability to capture complex relationships within data, GCNs provide a powerful framework for tasks that rely on limited labeled information. There are also other promising approaches that exploit the graph structure for more effective learning, such as the Particle Competition and Cooperation (PCC), an algorithm that models label propagation through particles that compete and cooperate on a graph constructed from the data, exploiting similarity relationships between instances. In this work, we propose a novel approach that combines PCC, GCN, and dimensionality reduction approaches for improved classification performance. The experimental results showed that our method provided gains in most cases.
Download

Paper Nr: 23
Title:

Action Tube Generation by Person Query Matching for Spatio-Temporal Action Detection

Authors:

Kazuki Omi, Jion Oshima and Toru Tamaki

Abstract: This paper proposes a method for spatio-temporal action detection (STAD) that directly generates action tubes from the original video without relying on post-processing steps such as IoU-based linking and clip splitting. Our approach applies query-based detection (DETR) to each frame and matches DETR queries to link the same person across frames. We introduce the Query Matching Module (QMM), which uses metric learning to bring queries for the same person closer together across frames compared to queries for different people. Action classes are predicted using the sequence of queries obtained from QMM matching, allowing for variable-length inputs from videos longer than a single clip. Experimental results on JHMDB, UCF101-24 and AVA datasets demonstrate that our method performs well for large position changes of people while offering superior computational efficiency and lower resource requirements.
Download

Paper Nr: 84
Title:

Improving Periocular Recognition Accuracy: Opposite Side Learning Suppression and Vertical Image Inversion

Authors:

Masakazu Fujio, Yosuke Kaga and Kenta Takahashi

Abstract: Periocular recognition has emerged as an effective biometric identification method in recent years, particularly when the face is partially occluded, or the iris image is unavailable. This paper proposes a deep learning-based periocular recognition method specifically designed to address the overlooked issue of simultaneously training left and right periocular images from the same person. Our proposed method enhances recognition accuracy by identifying the eye side, applying a vertical flip during training and inference, and stopping backpropagation for the opposite side of the current periocular. Experimental results on visible and NIR image datasets, using six different off-the-shelf deep CNN models, demonstrate an approximate 1∼2% improvement in recognition accuracies compared to conventional approaches that employ horizontal flip to align the appearance of the right and left eyes. The proposed approach’s performance was compared with state-of-the-art methods in the literature on three unconstrained periocular datasets: CASIA-Iris-Distance, UBIPr. The experimental results indicated that our approach consistently outperformed the state-of-the-art methods on these datasets. From the perspective of implementation costs, the proposed method is applied during training and does not affect the computational complexity during inference. Moreover, during training, the method only sets the gradient values of the periocular image class of the opposite side to zero, thus having a minimal impact on the computational cost. It can be combined easily with other periocular authentication methods.
Download

Paper Nr: 92
Title:

Segment-Level Road Obstacle Detection Using Visual Foundation Model Priors and Likelihood Ratios

Authors:

Youssef Shoeb, Nazir Nayal, Azarm Nowzad, Fatma Güney and Hanno Gottschalk

Abstract: Detecting road obstacles is essential for autonomous vehicles to navigate dynamic and complex traffic environments safely. Current road obstacle detection methods typically assign a score to each pixel and apply a threshold to generate final predictions. However, selecting an appropriate threshold is challenging, and the per-pixel classification approach often leads to fragmented predictions with numerous false positives. In this work, we propose a novel method that leverages segment-level features from visual foundation models and likelihood ratios to predict road obstacles directly. By focusing on segments rather than individual pixels, our approach enhances detection accuracy, reduces false positives, and offers increased robustness to scene variability. We benchmark our approach against existing methods on the RoadObstacle and LostAndFound datasets, achieving state-of-the-art performance without needing a predefined threshold.
Download

Paper Nr: 125
Title:

Neural Network Meta Classifier: Improving the Reliability of Anomaly Segmentation

Authors:

Jurica Runtas and Tomislav Petković

Abstract: Deep neural networks (DNNs) are a contemporary solution for semantic segmentation and are usually trained to operate on a predefined closed set of classes. In open-set environments, it is possible to encounter semantically unknown objects or anomalies. Road driving is an example of such an environment in which, from a safety standpoint, it is important to ensure that a DNN indicates it is operating outside of its learned semantic domain. One possible approach to anomaly segmentation is entropy maximization, which is paired with a logistic regression based post-processing step called meta classification, which is in turn used to improve the reliability of detection of anomalous pixels. We propose to substitute the logistic regression meta classifier with a more expressive lightweight fully connected neural network. We analyze advantages and drawbacks of the proposed neural network meta classifier and demonstrate its better performance over logistic regression. We also introduce the concept of informative out-of-distribution examples which we show to improve training results when using entropy maximization in practice. Finally, we discuss the loss of interpretability and show that the behavior of logistic regression and neural network is strongly correlated. The code is publicly available at https://github.com/JuricaRuntas/meta-ood.
Download

Paper Nr: 132
Title:

New Paths in Document Data Augmentation Using Templates and Language Models

Authors:

Lucas Wojcik, Luiz Coelho, Roger Granada and David Menotti

Abstract: Document Recognition has been tackled with a state of the art (SOTA) mostly composed of multi-modal transformers. Usually, these are trained in an unsupervised pre-training phase followed by a supervised fine-tuning phase where real-world tasks are solved, meaning both model and training procedures are borrowed from NLP research. However, there is a lack of available data with rich annotations for some of these downstream tasks, balanced by the copious amounts of pre-training data available. We can also solve this problem through data augmentation. We present two novel data augmentation methods for documents, each one used in different scopes. The first is based on simple structured graph objects that encode a document’s layout, called templates, used to augment the EPHOIE and NBID datasets. The other one uses a Large Language Model (LLM) to provide alternative versions of the document’s texts, used to augment the FUNSD dataset. These methods create instances by augmenting layout and text together (imageless), and so we use LiLT, a model that deals only with text and layout for validation. We show that our augmentation procedure significantly improves the model’s baseline, opening up many possibilities for future research.
Download

Paper Nr: 151
Title:

Herbicide Efficacy Prediction Based on Object Segmentation of Glasshouse Imagery

Authors:

Majedaldein Almahasneh, Baihua Li, Haibin Cai, Nasir Rajabi, Laura Davies and Qinggang Meng

Abstract: In this work, we explore the possibility of incorporating deep learning (DL) to propose a solution for the herbicidal efficacy prediction problem based on glasshouse (GH) images. Our approach utilises RGB images of treated and control plant images to perform the analysis and operates in three stages, 1) plant region detection and 2) leaf segmentation, where growth characteristics are inferred about the tested plant, and 3) herbicide activity estimation stage, where these metrics are used to estimate the herbicidal activity in a contrastive manner. The model shows a desirable performance across different species and activity levels, with a mean F1-score of 0.950. These results demonstrate the reliability and promising potential of our framework as a solution for herbicide efficacy prediction based on glasshouse images. We also present a semi-automatic plant labelling approach to address the lack of available public datasets for our target task. While existing works focus on plant detection and phenotyping, to the best of our knowledge, our work is the first to tackle the prediction of herbicide activity from GH images using DL.
Download

Paper Nr: 201
Title:

Beyond Data Augmentations: Generalization Abilities of Few-Shot Segmentation Models

Authors:

Muhammad Ahsan, Guy Ben-Yosef and Gemma Roig

Abstract: Few-shot learning in semantic segmentation has gained significant attention recently for its adaptability in applications where only a few or no examples are available as support for training. Here we advocate for a new testing paradigm, we coin it half-shot learning (HSL), which evaluates model’s ability to generalise to new categories when support objects are partially viewed, significantly cropped, occluded, noised, or aggressively transformed. This new paradigm introduces challenges that will spark advances in the field, allowing us to benchmark existing models and analyze their acquired sense of objectness. Humans are remarkably exceptional at recognizing objects even when partially obstructed. HSL seeks to bridge the gap between human-like perception and machine learning models by forcing them to recognize objects from incomplete, fragmented, or noisy views - just as humans do. We propose a highly augmented image set for HSL that is built by intentionally manipulating PASCAL-5i and COCO-20i to fit this paradigm. Our results reveal the shortcomings of state-of-the-art few-shot learning models and suggest improvements through data augmentation or the incorporation of additional attention-based modules to enhance the generalization capabilities of few-shot semantic segmentation (FSS). To improve the training method, we propose a channel and spatial attention module (Woo et al., 2018), where an FSS model is retrained with attention module and tested against the highly augmented support information. Our experiments demonstrate that an FSS model trained with the proposed method achieves significantly a higher accuracy (approximately 5%) when exposed to limited or highly cropped support data.
Download

Paper Nr: 216
Title:

Minimizing Number of Distinct Poses for Pose-Invariant Face Recognition

Authors:

Carter Ung, Pranav Mantini and Shishir K. Shah

Abstract: In unconstrained environments, extreme pose variations of the face are a long-standing challenge for person identification systems. The natural occlusion of necessary facial landmarks is notable to model performance degradation in face recognition. Pose-invariant models are data-hungry and require large variations of pose in training data to achieve comparable accuracy in recognizing faces from extreme viewpoints. However, data collection is expensive and time-consuming, resulting in a scarcity of facial datasets with large pose variations for model training. In this study, we propose a training framework to enhance pose-invariant face recognition by identifying the minimum number of poses for training deep convolutional neural network (CNN) models, enabling higher accuracy with minimum cost for training data. We deploy ArcFace, a state-of-the-art recognition model, as a baseline to evaluate model performance in a probe-gallery matching task across groups of facial poses categorized by pitch and yaw Euler angles. We perform training and evaluation of ArcFace on varying pose bins to determine the rank-1 accuracy and observe how recognition accuracy is affected. Our findings reveal that: (i) a group of poses at -45◦, 0◦, and 45◦yaw angles achieve uniform rank-1 accuracy across all yaw poses, (ii) recognition performance is better with negative pitch angles than positive pitch angles, and (iii) training with image augmentations like horizontal flips results in similar or better performance, further minimizing yaw poses to a frontal and 3 4 view.
Download

Paper Nr: 244
Title:

Simultaneous Estimation of Driving Intentions for Multiple Vehicles Using Video Transformer

Authors:

Junya Isogawa, Fumihiko Sakaue and Jun Sato

Abstract: In autonomous driving, it is important for the vehicle to appropriately determine the next action to be taken on the road. In complex situations such as on public roads, the better action for the own vehicle can be determined by considering the driving intention of other vehicles around the vehicle. Thus, in this paper, we propose a method to determine the next action of the own vehicle by simultaneously estimating the next driving intentions of all vehicles, including other vehicles around the own vehicle. The time series of vehicle motions on the road can be represented as sequential images centered on the vehicle. In this paper, we analyze the sequential images of vehicle trajectories using the Video Transformer and simultaneously predict the driving intentions of all vehicles on the road. In general, driving intentions change over time. Thus, in this research, we first propose a method to predict the next intention, and then extend it to predict the transition of driving intentions over the next few seconds. We also apply our method to predict driving trajectories, and show that the prediction of the driving trajectory can be improved by using the driving intentions estimated from the proposed method.
Download

Paper Nr: 254
Title:

Human Pose Estimation from an Extremely Low-Resolution Image Sequence by Pose Transition Embedding Network

Authors:

Yasutomo Kawanishi, Hitoshi Nishimura and Hiroshi Murase

Abstract: This paper addresses the problem of human pose estimation from an extremely low-resolution (ex-low) image sequence. In an ex-low image (e.g., 16 × 16 pixels), it is challenging, even for human beings, to estimate the human pose smoothly and accurately only from a frame because of resolution and noise. This paper proposes a human pose estimation method, named Pose Transition Embedding Network, that considers the temporal continuity of human pose transition by using a pose-embedded manifold. This method first builds a pose transition manifold from the ground truth of human pose sequences to learn feasible pose transitions using an encoder-decoder model named Pose Transition Encoder-Decoder. Then, an image encoder, named Ex-Low Image Encoder Transformer, encodes an ex-low image sequence into an embedded vector using a transformer-based network. Finally, the estimated human pose is reconstructed using a pose decoder named Pose Transition Decoder. The performance of the method is confirmed by evaluating an ex-low human pose dataset generated from a publicly available action recognition dataset.
Download

Paper Nr: 259
Title:

Multi-Scale Foreground-Background Confidence for Out-of-Distribution Segmentation

Authors:

Samuel Marschall and Kira Maag

Abstract: Deep neural networks have shown outstanding performance in computer vision tasks such as semantic segmentation and have defined the state-of-the-art. However, these segmentation models are trained on a closed and predefined set of semantic classes, which leads to significant prediction failures in open-world scenarios on unknown objects. As this behavior prevents the application in safety-critical applications such as automated driving, the detection and segmentation of these objects from outside their predefined semantic space (out-of-distribution (OOD) objects) is of the utmost importance. In this work, we present a multi-scale OOD segmentation method that exploits the confidence information of a foreground-background segmentation model. While semantic segmentation models are trained on specific classes, this restriction does not apply to foreground-background methods making them suitable for OOD segmentation. We consider the per pixel confidence score of the model prediction which is close to 1 for a pixel in a foreground object. By aggregating these confidence values for different sized patches, objects of various sizes can be identified in a single image. Our experiments show improved performance of our method in OOD segmentation compared to comparable baselines in the SegmentMeIfYouCan benchmark.
Download

Paper Nr: 268
Title:

Accuracy Improvement of Neuron Concept Discovery Using CLIP with Grad-CAM-Based Attention Regions

Authors:

Takahiro Sannomiya and Kazuhiro Hotta

Abstract: WWW is a method that computes the similarity between image and text features using CLIP and assigns a concept to each neuron of the target model whose behavior is to be determined. However, because this method calculates similarity using center crop for images, it may include features that are not related to the original class of the image and may not correctly reflect the similarity between the image and text. Additionally, WWW uses cosine similarity to calculate the similarity between images and text. Cosine similarity can sometimes result in a broad similarity distribution, which may not accurately capture the similarity between vectors. To address them, we propose a method that leverages Grad-CAM to crop the model’s attention region, filtering out the features unrelated to the original characteristics of the image. By using t-vMF to measure the similarity between the image and text, we achieved a more accurate discovery of neuron concepts.
Download

Paper Nr: 275
Title:

Expanding Domain Coverage in Injection Molding Quality Inspection with Physically-Based Synthetic Data

Authors:

Dominik Schraml and Gunther Notni

Abstract: Synthetic data has emerged as a vital tool in computer vision research, yet procedural generation using 3D computer graphics remains underexplored compared to generative adversarial networks (GANs). Our method offers greater control over generated images, making it particularly valuable for domains like industrial quality inspection, where real data is often sparse. We present a method for generating physically based rendered images of an injection-molded cup, simulating two common defects - short shot and color streak. The approach automates defect generation with variable size and severity, along with pixel-perfect segmentation masks, significantly reducing labeling effort. Synthetic data was combined with a small set of real images to train semantic segmentation models and explore domain expansion, such as inspecting parts in novel colors not represented in real-world datasets. Experiments demonstrate that the method enhances defect detection and is especially effective for domain expansion tasks, such as inspecting parts in new colors. However, challenges persist in segmenting smaller defects, underscoring the need for balanced synthetic datasets and probably also for customized loss functions.
Download

Paper Nr: 294
Title:

Exploration and Validation of Specialized Loss Functions for Generative Visual-Thermal Image Domain Transfer

Authors:

Simon Fischer, Benedikt Kottler, Eva Strauß and Dimitri Bulatov

Abstract: This paper presents an enhanced approach to visual-to-thermal image translation using an improved InfraGAN model, incorporating additional loss functions to increase realism and fidelity in generated thermal images. Building on the existing InfraGAN architecture, we introduce perceptual, style, and discrete Fourier transform (DFT) losses, aiming to capture intricate image details and enhance texture and frequency consistency. Our model is trained and evaluated on the FLIR Adas dataset, providing paired visual and thermal images across diverse contexts, from urban traffic scenes. To optimize the interplay of loss functions, we employ hyperparameter tuning with the Optuna library, achieving an optimal balance among the components of the loss function. First, experimental results show that these modifications lead to significant improvements in the quality of generated thermal images, underscoring the potential of advanced loss functions for domain transfer tasks. This work contributes a refined framework for generating high-quality thermal imagery, with implications for fields such as surveillance, autonomous driving, and facial recognition in challenging environmental conditions.
Download

Paper Nr: 314
Title:

Semi-Supervised Anomaly Detection in Skin Lesion Images

Authors:

Alina Burgert, Babette Dellen, Uwe Jaekel and Dietrich Paulus

Abstract: Semi-supervised anomaly detection is the task of learning the pattern of normal samples and identifying deviations from this pattern as anomalies. This approach is especially helpful in the medical domain, since healthy samples are usually easy to collect and time-intensive annotation of training data is not necessary. In dermatology the utilization of this approach is not fully explored yet, since most work is limited to cancer detection, with the normal samples being nevi. This study, instead, investigates the use of semi-supervised anomaly detection methods for skin disease detection and localization. Due to the absence of a benchmark dataset a custom dataset was created. Based on this dataset two different models, SimpleNet and an autoencoder, were trained on healthy skin images only. Our experiment shows that both models are able to distinguish between normal and abnormal samples of the test dataset, with SimpleNet achieving an AUROC score of 97 % and the autoencoder a score of 93 %, demonstrating the potential of anomaly detection for dermatological applications. A visual analysis of corresponding anomaly maps revealed that both models have their own strengths and weaknesses when localizing the abnormal regions.
Download

Paper Nr: 335
Title:

Automatic Detection of the Driver Distractions Based on the Analysis of Face Videos

Authors:

Artur Urzędowski and Kazimierz Choroś

Abstract: The objective of the paper is to propose a new driver fatigue detection method using Percentage of Mouth Openness (POM) and Percentage of Eye Closure (PERCLOS) as well as to show its robustness across various real-world conditions. The openness of the eyes and mouth was determined by calculating Aspect Ratios (AR) and checking if AR exceeded a given threshold. Six videos simulated different driving scenarios were recorded to test detection performance under diverse lighting, with and without corrective glasses, moreover with additional complexities such as blinking lights. Furthermore, the method ensures avoidance of misclassification during such driver’s activities as conversing and eating. The method effectively detects fatigue in all test scenarios in which the fatigue state occurred.
Download

Area 5 - Applications and Services

Full Papers
Paper Nr: 158
Title:

Defense Against Model Inversion Attacks Using a Dummy Recognition Model Trained with Synthetic Samples

Authors:

Yuta Kotsuji and Kazuaki Nakamura

Abstract: Recently, biometric recognition models such as face identification models have been rapidly developing. At the same time, the risk of cyber-attacks on such models is increasing, one of whose examples is a model inversion attack (MIA). MIA is an attack to reconstruct or reveal the training samples of a victim recognition model by analyzing the relationship between its inputs and outputs. When MIA is conducted on a biometric model, its training samples such as the face, iris, and fingerprint images could be leaked. Since they are privacy-sensitive personal information, their leakage causes a serious privacy issue. Hence, it is desirable to develop a defense method against MIA. Although several defense methods have been proposed in the past decade, they tend to decrease the recognition accuracy of the victim model. To solve this problem, in this paper, we propose to use a dummy model trained with synthetic images and parallelly combine it with the victim model, where the combined model is released to users instead of the victim model. The key point of our proposed method is to force the dummy model to output a high confidence score only for the limited range of synthetic images. This allows us to maintain the recognition accuracy of the combined model. We experimentally confirmed that the proposed method can reduce the success rate of MIA to less than 30% while maintaining the recognition accuracy of more than 95%.
Download

Short Papers
Paper Nr: 67
Title:

Optimum-Path Forest Ensembles to Estimate the Internal Decay in Urban Trees

Authors:

Giovani Candido, Luis Henrique Morelli, Danilo Samuel Jodas, Giuliana Del Nero Velasco, Reinaldo Araújo de Lima, Kelton Augusto Pontara da Costa and João Paulo Papa

Abstract: Research on urban tree management has recently grown to include various studies using machine learning to address the tree’s risk of falling. One significant challenge is to assess the extent of internal decay, a crucial factor contributing to tree breakage. This paper uses machine and ensemble learning algorithms to determine internal trunk decay levels. Notably, it introduces a novel variation of the Optimum-Path Forest (OPF) ensemble pruning method, OPFsemble, which incorporates a “count class” strategy and performs weighted majority voting for ensemble predictions. To optimize the models’ hyperparameters, we employ a slime mold-inspired metaheuristic, and the optimized models are then applied to the classification task. The optimized hyperparameters are used to randomly select distinct configurations for each model across ensemble techniques such as voting, stacking, and OPFsemble. Our OPFsemble variant is compared to the original one, which serves as a baseline. Moreover, the estimated levels of internal decay are used to predict the tree’s risk of falling and evaluate the proposed approach’s reliability. Experimental results demonstrate the effectiveness of the proposed method in determining internal trunk decay. Furthermore, the findings reveal the potential of the proposed ensemble pruning in reducing ensemble models while attaining competitive performance.
Download

Paper Nr: 159
Title:

Coloring 3D Avatars with Single-Image

Authors:

Pin-Yuan Yang, Yu-Shan Deng, Chieh-Shan Lin, An-Chun Luo and Shih-Chieh Chang

Abstract: 3D avatars are important for various virtual reality (VR) and augmented reality (AR) applications. High-fidelity 3D avatars from real people enhance the realism and interactivity of virtual experience. Creating these avatars accurately and efficiently is a challenging problem. A lifelike 3D human model requires precise color representation. An accurate representation of the color is essential to capture the details of human skin, hair, and clothing to match the real people. Traditional methods, such as 3D scanning and multi-image modeling, are costly and complex, limiting their accessibility to an average user. To address this issue, we introduce a novel approach that requires just a single frontal image to generate 3D avatars. Our method tackles critical challenges in the field of single-image 3D avatar generation: color prediction. To achieve better prediction results, we propose a hybrid coloring technique that combines model-based and projection-based methods. This approach enhances 3D avatars’ fidelity and ensures realistic appearances from all viewpoints. Our advancements have achieved better results in quantitative evaluation and rendering results compared to the previous state-of-the-art method. The entire avatar-generating process is also seven times faster than the NeRF-based method. Our research provides an easily accessible but robust method for reconstructing interactive 3D avatars.
Download

Paper Nr: 162
Title:

Internal State Estimation Based on Facial Images with Individual Feature Separation and Mixup Augmentation

Authors:

Ayaka Asaeda and Noriko Takemura

Abstract: In recent years, the opportunity for e-learning and remote work has increased due to the impact of the COVID-19 pandemic. However, issues such as drowsiness and decreased concentration among learners have become apparent, increasing the need to estimate the internal state of learners. Since facial expressions reflect internal states well, they are often utilized in research on state estimation. However, individual differences in facial structure and expression methods can influence the accuracy of these estimations. This study aims to estimate ambiguous internal states such as drowsiness and concentration by considering individual differences based on the Deviation Learning Network (DLN). Such internal states exhibit very subtle and ambiguous changes in facial expressions, making them more difficult to estimate compared to basic emotions. Therefore, this study proposes a model that uses mixup, which is one form of data augmentation, to account for subtle differences in expressions between classes. In the evaluation experiments, facial images of learners during e-learning will be used to estimate their arousal levels in three categories: Asleep, Drowsy, and Awake.
Download

Paper Nr: 176
Title:

Disease Estimation Using Gait Videos by Separating Individual Features Based on Disentangled Representation Learning

Authors:

Shiori Furukawa and Noriko Takemura

Abstract: With the aging of society, the number of patients with gait disturbance is increasing. Lumbar spinal canal stenosis (LCS) and cervical spondylotic myelopathy (CSM) are representative diseases that cause gait disturbance. However, diagnosing these diseases takes a long time because of the wide variety of medical departments and lack of screening tests. In this study, we propose a method to recognize LCS and CSM using patients’ walking videos. However, the gait images of patients contain not only disease features but also individual features, such as body shape and hairstyle. Such individual features may reduce the accuracy of disease estimation. Therefore, we aim to achieve highly accurate disease estimation by separating and removing individual features from disease features using a deep learning model based on a disentangled representation learning approach. In evaluation experiments, we confirmed the usefulness of the proposed method by verifying the accuracy of different model structures and different diagnostic tasks to be estimated.
Download

Paper Nr: 189
Title:

Towards a Dataset for Paleographic Details in Historical Torah Scrolls

Authors:

Laura Frank, Germaine Götzelmann and Danah Tonne

Abstract: Historical textual witnesses used in religious practice have been a research interest for a long time but still remain mysterious. In particular, medieval Torah scrolls show irregularities in the scripture, whose intentions have not yet been revealed. In this paper, we assess the analysis of letter decorations from the perspective of computer vision and investigate the possibilities of extending qualitative research in Jewish Studies by quantitative analysis methods of computer science. For this purpose, we introduce a methodological approach to obtain a reproducible and extensible dataset of Hebrew letters and present a set of labels usable for various machine learning tasks. The evaluation of the dataset in terms of decoration recognition shows promising prediction accuracy rates of up to 90% with standard transfer learning methods and architectures.
Download

Paper Nr: 359
Title:

Differential Diagnosis of Brain Diseases Using Ensemble Learning and Explainable AI

Authors:

Nighat Bibi, Kathleen M. Curran and Jane Courtney

Abstract: The differential diagnosis of brain diseases by magnetic resonance imaging (MRI) is a crucial step in the diagnostic process, and deep learning (DL) has the potential to significantly improve the accuracy and efficiency of these diagnoses. This study focuses on creating an ensemble learning (EL) model that utilizes the ResNet50, DenseNet121, and EfficientNetB1 architectures to concurrently and accurately classify various brain conditions from MRI images. The proposed ensemble learning model identifies a range of brain disorders that encompass different types of brain tumours, as well as multiple sclerosis. The proposed model trained on two open source datasets, consisting of MRI images of glioma, meningioma, pituitary tumours, non-tumour and multiple sclerosis. Central to this research is the integration of gradient-weighted class activation mapping (Grad-CAM) for model interpretability, aligning with the growing emphasis on explainable AI (XAI) in medical imaging. The application of Grad-CAM improves the transparency of the decision-making process of the model, which is vital for clinical acceptance and trust in AI-assisted diagnostic tools. The EL model achieved an impressive 99.84% accuracy in classifying these various brain conditions, demonstrating its potential as a versatile and effective tool for differential diagnosis in neuroimaging. The model’s ability to distinguish between multiple brain diseases underscores its significant potential in the field of medical imaging. Additionally, Grad-CAM visualizations provide deeper insights into the neural network’s reasoning, contributing to a more transparent and interpretable AI-driven diagnostic process in neuroimaging.
Download

Paper Nr: 365
Title:

Leveraging Affordable Solutions for Stereo Video Capture in Virtual Reality Applications

Authors:

Leina Yoshida, Gustavo Domingues, Fabiana Peres, Claudio R. Mauricio and João M. Teixeira

Abstract: Stereo video is essential for creating immersive virtual reality experiences by providing depth perception and enhancing realism. However, capturing high-quality stereo video often requires expensive professional equipment, such as specialized stereo lenses and high-resolution cameras, which poses significant financial barriers for independent content creators and small studios. This paper explores affordable alternatives for stereo video capture, specifically utilizing the built-in cameras of the Meta Quest 3 MR (mixed reality) headset. We compare the capabilities of the Meta Quest 3 with high-end equipment like the Canon EOS R7 and dual fisheye lenses, whose cost is significantly higher (approximately seven times more). Our analysis includes a comparison of display and camera resolutions of popular VR head-mounted displays, revealing that the current VR headsets’ display resolutions do not fully utilize the high capture resolutions offered by professional cameras. We provide detailed instructions for setting up the Meta Quest 3 for stereo video capture and present examples of videos captured in both indoor and outdoor environments. The findings suggest that affordable devices like the Meta Quest 3 are capable of producing stereo video content suitable for the present virtual reality technology landscape. The cost savings and operational efficiencies make it a practical option for content creators. We conclude that, given the display limitations of current VR HMDs, investing in high-end capture devices may not yield significant benefits. As VR technology advances and HMD display resolutions improve, the advantages of professional capture equipment may become more pronounced.
Download

Paper Nr: 290
Title:

Efficient CNN-Based System for Automated Beetle Elytra Coordinates Prediction

Authors:

Hojin Yoo, Dhanyapriya Somasundaram and Hyunju Oh

Abstract: Beetles represent nearly a quarter of all known animal species and play crucial roles in ecosystems. A key morphological feature, the elytra, provides essential protection and adaptability but measuring their size manually is labor-intensive and prone to errors, especially with large datasets containing multiple specimens per image. To address this, we introduce a deep learning-based framework that automates the detection and measurement of beetle elytra using Convolutional Neural Networks (CNN). Our system integrates advanced object detection techniques to accurately localize individual beetles and predict elytra coordinates, enabling precise measurement of elytra length and width. Additionally, we recreated an existing beetle dataset tailored for elytra coordinate prediction. Through comprehensive experiments and ablation studies, we optimized our framework to achieve a measurement accuracy with an error margin of only 0.1 cm. This automated approach significantly reduces manual effort and facilitates large-scale beetle trait analysis, thereby advancing biodiversity research and ecological assessments. Code is available at https://github.com/yoohj0416/predictbeetle.
Download

Paper Nr: 291
Title:

Effectiveness of Cross-Model Learning Through View-Model Ensemble on Detection of Spatiotemporal EEG Patterns

Authors:

Ömer Muhammet Soysal, Iphy Emeka Kelvin and Muhammed Esad Oztemel

Abstract: Understanding the neural dynamics of human intelligence is one of the top research topics over the decades. Advances in the computational technologies elevated the level of solving the complex problems by means of the computational neuroscience approaches. The patterns extracted from neural responses can be utilized as a biometric for authentication. In this study, we aim to explore cross-model transfer learning approach for extraction of distinct features from Electroencephalography (EEG) neural signals. The discriminative features generated by the deep convolutional neural network and the autoencoder machine learning models. In addition, a 3D spatiotemporal View-matrix is proposed to search distinct patterns over multiple EEG channels, time, and window segments. We proposed a View-model approach to obtain intermediate predictions. At the final stage, these intermediate scores are ensembled through a majority-voting scheme to reach the final decision. The initial results show that the proposed cross-model learning approach can outperform the regular classification-based approaches.
Download

Paper Nr: 295
Title:

A Multimodal Approach to Research Paper Summarization

Authors:

Pranav Bookanakere, Syeda Saniya, Syed Munzer Nouman, S. Pramath and Jayashree Rangareddy

Abstract: As the amount of academic research in the medical field has been growing exponentially, being able to understand and extract important information from these research papers has become all the more challenging. Researchers, students, and professionals often find it hard to navigate through medical-based research papers that contain complex images and textual information. Most summarization tools that already exist have limited effectiveness and cannot handle the multimodal nature of complex research papers. This paper addresses the need for an all-round approach to effectively generate summaries, taking key information from both the text as well as the complex images present in research papers. Our approach can generate section-wise summaries of the text and also generate context-based image descriptions with high levels of accuracy. By putting together advanced Natural Language Processing (NLP) and multimodal (T5, Llava) techniques, this system is able to generate comprehensive and concise summaries of complex research papers. This work demonstrates the potential of multimodal AI models to improve research comprehension and provide deeper understanding of complex subjects in the medical field.
Download

Paper Nr: 376
Title:

Sleep-Stage Efficient Classification Using a Lightweight Self-Supervised Model

Authors:

Eldiane Borges dos Santos Durães and João Batista Florindo

Abstract: Background and Objective: Accurate classification of sleep stages is crucial for diagnosing sleep disorders and automating this process can significantly enhance clinical assessments. This study aims to explore the use of a self-supervised model (more specifically, an adapted version of mulEEG) combined with a Linear SVM classifier to improve sleep stage classification. Methods: The mulEEG model, which learns electroen-cephalogram signal representations in a self-supervised manner, was simplified here by replacing ResNet-50 with 1D-convolutions used as time series encoder by a ResNet-18 backbone. Two other adaptations were conducted: the first one evaluated different configurations of the model and data volume for training, while the second tested the effectiveness of time series features, spectrogram features, and their concatenation as inputs to a Linear SVM classifier. Results: The results showed that reducing the volume of data offered a better cost-benefit ratio compared to simplifying the model. Using the concatenated features with ResNet-18 also outperformed the linear evaluations of the original mulEEG model, achieving higher classification performance. Conclusions: Simplifying the mulEEG model to extract features and pairing it with a robust classifier leads to more efficient and accurate sleep stage classification. This approach holds promise for improving clinical sleep assessments and can be extended to other biological signal classification tasks.
Download

Paper Nr: 379
Title:

DeepCellCount: Cell Counting Using Two-Step Deep Learning

Authors:

Sara Tesfamariam, Isah A. Lawal, Arda Durmaz and Jacob G. Scott

Abstract: This paper addresses the problem of segmenting and counting cells in fluorescent microscopy images. Accurate identification and counting of cells is crucial for automated cell annotation processes in biomedical laboratories. To address this, we trained two convolutional neural networks using publicly available high-throughput microscopy cell image sets. One network is trained for cell segmentation and the other for cell counting. Both models are then used in a two-step image analysis process to identify and count the cells in a given image. We evaluated the performance of this method on previously unseen cell images, and our experimental results show that the proposed method achieved an average Mean Absolute Percentage Error (MAPE) as low as 6.82 on the test images with sparsely populated cells. This performance is comparable to that obtained with a more complex CellProfiler software on the same dataset.
Download

Paper Nr: 424
Title:

Towards Safe Self-Stimulatory Behaviors in Autistic Children: HarmAlert4AutisticChildren (HA4AC)

Authors:

Aleenah Khan and Hassan Foroosh

Abstract: Self-Stimulatory behaviors, or stimming is quite common in Autism and can begin as early as infancy. Autistic infants may show early signs of stimming through repetitive movements such as hand flapping, rocking, or head banging. These stereotypical behaviors help self-regulation and are generally not harmful unless they pose a safety risk (e.g., head banging) or significantly interfere with daily activities. In such cases, the parent or caregiver must immediately intervene to ensure the safety of the child. To foster a safe environment for autistic children, we introduce a novel problem of identifying potentially harmful self-stimulatory behaviors to alert the parent / caregiver. To pave the way for research, we consolidated a video-based dataset “Har-mAlert4AutisticChildren” which categorizes autism-related stimming behaviors into two categories: helpful and harmful. We utilize existing publicly available video datasets that focus on a different problem of self-stimulatory behavior classification in autism. The curation process is based on a systematic review of the literature of clinical research studies that analyze the impacts of various self-stimulatory behaviors in autistic children. In addition to introducing a new research problem and a new dataset, we also provide baseline results using the Contrastive Language-Image Pretraining (CLIP) model. The dataset and code are available on GitHub: https://github.com/AleenahK/HarmAlert4AutisticChildren-HA4AC.
Download