Abstracts Track 2026


Area 1 - 3D Vision, Motion, Robotics, Application & Systems

Nr: 347
Title:

Future Image Generation from Vehicle Video with Multi-Modal Auxiliary Features

Authors:

Yoshimitsu Ichimori, Fumihiko Sakaue and Jun Sato

Abstract: In recent years, technologies for transmitting remote video have been widely adopted as part of social infrastructure, including remote operation using drones and unmanned exploration vehicles, as well as remote meetings. In transmission systems that require real-time performance, not only video quality but also latency is a critical issue. For example, in remote vehicle operation, delays in video transmission can lead to slower situational awareness, potentially resulting in serious accidents. To address these issues, technologies that compensate for video latency, namely, techniques for predicting future frames, are required. This study investigates a future video generation method for vehicle-mounted camera video. Considering the recent improvements in image resolution, generating high-resolution images in real time for such image prediction tasks is impractical from a computational standpoint. Therefore, this study employs a Variational Auto-Encoder (VAE) to encode images into a low-dimensional latent space and estimates features representing future frames within this space. The latent features are then decoded back into image space to generate future images. In addition, we employ semantic segmentation, depth estimation, and optical flow estimation on the input images to provide auxiliary information for future image generation. This auxiliary information is integrated with RGB images using a Vision Transformer (ViT). To achieve this, the proposed method utilizes an Auxiliary Information Extraction ViT and a Future Video Generation ViT. In the Auxiliary Information Extraction ViT, depth, segmentation, and optical flow data are tokenized, and the concatenated tokens are analyzed using a self-attention structure. The resulting tokens are then split according to their respective modalities and processed by separate MLPs. The outputs are concatenated again, and the same operation is repeated. This approach enables independent analysis of each modality while leveraging inter-modal relationships to obtain auxiliary tokens. Next, the Video Generation ViT takes RGB tokens and auxiliary tokens as input and applies a cross-attention structure, where RGB tokens serve as keys and values, while auxiliary tokens act as queries, to obtain latent features representing future frames. Finally, the decoder reconstructs the latent features into images to generate future frames. To evaluate its effectiveness, we applied the proposed method to videos recorded by a vehicle-mounted camera for future video generation. Experiments were conducted on dynamic image sets extracted at three-frame intervals from the public datasets Cityscapes (16.6 fps) and WayveScenes (10 fps), resulting in frame rates of 5.6 fps and 3.3 fps, respectively. Among these, 1,400 scenes were used for training and 250 scenes for evaluation. Four consecutive frames were used as input to predict the subsequent frame. The mean absolute error between predicted and ground-truth images was 9.3 for the proposed method, compared to 11.3 for the baseline method without auxiliary information. These results confirm that the proposed method, leveraging VAE and multi-modal information, achieves higher prediction accuracy. However, the generated images exhibited blurred edges, suggesting difficulty in restoring high-frequency components. Therefore, future work will focus on developing methods for predicting higher-resolution images.

Area 2 - Foundations & Representation Learning

Nr: 346
Title:

Capturing Spatial Information in Hyperspectral Imaging Using Pretrained Embeddings Models

Authors:

Antoine Deryck, Juan Antonio Fernández Pierna and Toon Goedemé

Abstract: Pre-trained computer vision models have transformed image analysis by enabling transfer learning from large-scale datasets to downstream tasks, even when limited annotated data is available. However, these models are fundamentally designed for RGB images and therefore cannot be directly applied to hyperspectral imaging (HSI), where each pixel contains hundreds of spectral bands that encode both visual and chemical information. As a result, most HSI workflows unfold 3D data cubes into 2D matrices, enabling the use of traditional machine learning methods but discarding spatial structure, a critical loss for applications where texture, pattern, or spatial spread are essential, such as plant disease detection or food authentication. Recent attempts to develop deep learning architectures tailored to hyperspectral data have shown potential but remain limited by high computational cost, architecture design complexity, and the need for large domain-specific datasets. Besides, transferability remains limited compared to established RGB vision models. In response to those limitations, this work proposes an alternative strategy: instead of designing new architectures for hyperspectral data, we adapt hyperspectral data to existing architectures. Spectral signatures are compressed into three-channel representations using dimensionality reduction techniques, including PCA and UMAP, preserving spectral information while converting HSI into pseudo-RGB images compatible with conventional computer vision pipelines. We then evaluate whether state-of-the-art pretrained visual embedding models (OpenAI CLIP, Google SigLIP, and Meta DINOv3) can extract meaningful spatial information and spectral structure from these representations without fine-tuning. Experiments are performed on two case studies: (1) semolina mixture homogeneity assessment, where unfolding-based methods are unable to capture blending patterns, and (2) cocoa bean genotype prediction, where existing HSI pipelines show limited discriminative performance and where the spatial distribution of chemicals within the beans may be informative. Preliminary results on semolina mixtures demonstrate that embeddings successfully capture spatial organization, enabling discrimination based on blending time and mixture uniformity. Cocoa bean analysis is ongoing. These findings demonstrate that pretrained RGB vision models can be repurposed for hyperspectral imaging. The proposed pipeline can be applied to different existing models without requiring a dedicated architecture for HSI, offering a practical way to exploit both spectral and spatial information while reducing development time and complexity. This approach facilitates the use of deep learning for hyperspectral imaging and supports faster deployment in real-world applications.

Nr: 23
Title:

Reliable 3D: Enhancing Robustness of Point Cloud Models to Attacks and Corruptions

Authors:

Rosina Fazal Kharal, Saif Al-Din Ali and Abu Bakr Nafees

Abstract: We have witnessed the rapid expansion of systems that collect 3D point cloud data in real-time, mission-critical systems that rely heavily on object-detection models. The precision and classification accuracy of these models are crucial in applications such as robotics, autonomous vehicles and aviation. However, point cloud data is highly susceptible to various forms of corruption that can severely compromise model performance. These corruptions may arise from weather distortions, sensor input errors, or malicious adversarial attacks targeting the input stream. Robustness against such attacks in 3D has not received the same level of attention as their 2D counterparts. Our work shows that distortions in 3D point cloud data cause a 30% drop in model accuracy. We propose a three-step corrective strategy that restores accuracy to over 85% and over 90% in adversarial cases, surpassing both the baseline and state-of-the-art defenses. Importantly, our method maintains robustness without degrading accuracy in the absence of corruptions or attacks. The defense strategy and corrective measures proposed in this work are a significant step toward making object detection models reliable and resilient in real-time settings, bringing much needed robustness to 3D classification models.

Nr: 236
Title:

AI Agents for UAV Video Streaming Using H.264/SVC Over a P2P System

Authors:

Youssef Lahbabi, Abdellah Ait Oufkir, Lalla Touhfa Belgnaoui and Tarik Lafou

Abstract: This paper proposes a novel intelligent framework for enhancing UAV video streaming using H.264/SVC over a peer-to-peer (P2P) distributed communication system. Traditional UAV streaming architectures rely on centralized servers, creating bottlenecks, single points of failure, and instability when channel conditions vary rapidly. Existing P2P approaches improve scalability but lack intelligent adaptation, especially for multi-layer SVC flows where base and enhancement layers require differentiated handling. To address these limitations, we introduce an autonomous multi-agent system capable of dynamically coordinating video-layer distribution, peer selection, and link adaptation. The core contribution of this work is an AI-driven decision engine in which reinforcement-learning agents manage SVC segmentation, prioritize base-layer reliability, and optimally assign enhancement layers across cooperative peers according to link quality, latency, and node stability. The agents integrate cross-layer information—including UAV mobility patterns, P2P neighbor behavior, and congestion metrics—to ensure consistent quality-of-experience (QoE) under fluctuating network conditions. Simulated results demonstrate that the proposed system outperforms conventional P2P and centralized strategies by reducing delay by up to 30%, improving base-layer delivery rate, and enhancing overall video continuity. This study highlights the strong potential of intelligent P2P-enabled UAV networks for real-time, scalable, and resilient video streaming applications.

Area 3 - Recognition & Detection

Nr: 12
Title:

Deep Learning-Based Models for Recognition of Skydiving Formations

Authors:

Algimantas Skuodis and Olga Kurasova

Abstract: In this study, we present results from a series of works that explore and evaluate the feasibility of using deep learning models for approximate live judging in 4-way formation skydiving. Formation skydiving is one of the many disciplines in the sport of skydiving. In the freefall skydiving disciplines, there are some objectives that skydivers have to complete during a designated time. In 4-way formation skydiving competitions, judging is done after the competition round by five judges using the video footage of the skydive. Judging is a tedious and time-consuming process. It takes time to present a total score for all teams after each round. Sometimes this is solved by approximate live judging, and it has raised a question: Is it possible to conduct live judging using deep learning models? For this, we have created a novel dataset from twelve skydiving competitions and the first six skydiving formations. The dataset was annotated with the formation name and region of interest in the frame. Each captured frame had to be reviewed manually to correct potential errors introduced when automatically extracting frames from the video. This is a time-consuming process. Thus, evaluation was limited to the first six formations. First, we have selected and evaluated the effectiveness of several deep learning architectures, including ResNet-50, EfficientViT, FastViT, and several ConvMixer configurations, in classifying these formations. The results of evaluating selected deep learning models on the skydiving dataset suggest that selected pre-trained models can be used to classify skydiving formations and extract sufficient features to achieve high classification accuracy. Based on the results, we conclude that achieving a high classification score on the skydiving dataset with six classes is possible. Variation of ConvMixer achieved the best overall F1 score of 0.9865, even compared to other ConvMixer variations. Next, we proposed and evaluated a two-step skydiving formation recognition. We have evaluated three variations of the two-step approach: keypoint-based, distance-based, and wrist distance-based. Each approach consists of a YOLO11n-Pose human pose detector, pre-trained on images of skydiving formations. For the classification part of the two-step approach, we have evaluated the use of classical machine learning methods versus a fully connected neural network (FCNN). Training and classification were assessed on our improved skydiving dataset, split into training and validation subsets, and subsequently, a test dataset of 60 randomly selected images. Our findings reveal that, regardless of the method (classical or FCNN) used in the second step, the highest classification result was achieved using a two-step wrist distance-based approach where distances are calculated from each person’s wrist to all other keypoints of all other persons. In all the cases, FCNN in the second stage of the two-step approach performs better on the validation and test datasets. The two-step wrist distance-based approach, with the FCNN on the second stage, achieved the highest overall F1 weighted average scores of 0.92 on the validation dataset and 0.95 on the test dataset. We conclude that the wrist distance-based variation can further improve accuracy by focusing on the most relevant keypoints for formation skydiving. These results demonstrate the potential of the two-step distance-based and wrist distance-based methods.