VISAPP 2022 Abstracts


Area 1 - Image and Video Formation, Preprocessing and Analysis

Full Papers
Paper Nr: 8
Title:

Flexible Table Recognition and Semantic Interpretation System

Authors:

Marcin Namysl, Alexander M. Esser, Sven Behnke and Joachim Köhler

Abstract: Table extraction is an important but still unsolved problem. In this paper, we introduce a flexible and modular table extraction system. We develop two rule-based algorithms that perform the complete table recognition process, including table detection and segmentation, and support the most frequent table formats. Moreover, to incorporate the extraction of semantic information, we develop a graph-based table interpretation method. We conduct extensive experiments on the challenging table recognition benchmarks ICDAR 2013 and ICDAR 2019, achieving results competitive with state-of-the-art approaches. Our complete information extraction system exhibited a high F1 score of 0.7380. To support future research on information extraction from documents, we make the resources (ground-truth annotations, evaluation scripts, algorithm parameters) from our table interpretation experiment publicly available.
Download

Paper Nr: 9
Title:

Non-local Matching of Superpixel-based Deep Features for Color Transfer

Authors:

Hernan Carrillo, Michaël Clément and Aurélie Bugeau

Abstract: In this article, we propose a new method for matching high-resolution feature maps from CNNs using attention mechanisms. To avoid the quadratic scaling problem of all-to-all attention, this method relies on a superpixel-based pooling dimensionality reduction strategy. From this pooling, we efficiently compute non-local similarities between pairs of images. To illustrate the interest of these new methodological blocks, we apply them to the problem of color transfer between a target image and a reference image. While previous methods for this application can suffer from poor spatial and color coherence, our approach tackles these problems by leveraging on a robust non-local matching between high-resolution low-level features. Finally, we highlight the interest in this approach by showing promising results in comparison with state-of-the-art methods.
Download

Paper Nr: 30
Title:

Object Detector Differences When using Synthetic and Real Training Data

Authors:

Martin G. Ljungqvist, Otto Nordander, Arvid Mildner, Tony Liu and Pierre Nugues

Abstract: To train well-behaved generalizing neural networks, sufficiently large and diverse datasets are needed. Collecting data while adhering to privacy legislation becomes increasingly difficult and annotating these large datasets is both a resource-heavy and time-consuming task. An approach to overcome these difficulties is to use synthetic data since it is inherently scalable and can be automatically annotated. However, how training on synthetic data affects the layers of a neural network is still unclear. In this paper, we train the YOLOv3 object detector on real and synthetic images from city environments. We perform a similarity analysis using Centered Kernel Alignment (CKA) to explore the effects of training on synthetic data on a layer-wise basis. The analysis captures the architecture of the detector while showing both different and similar patterns between different models. With this similarity analysis we want to give insights on how training synthetic data affects each layer and to give a better understanding of the inner workings of complex neural networks. The results show that the largest similarity between a detector trained on real data and a detector trained on synthetic data was in the early layers, and the largest difference was in the head part.
Download

Paper Nr: 40
Title:

Application of GAN for Reducing Data Imbalance under Limited Dataset

Authors:

Gaurav Adke

Abstract: The paper discusses architectural and training improvements of generative adversarial network (GAN) model for stable training. The advanced GAN architecture is proposed combining these improvements and it is applied for augmentation of a tire joint nonconformity dataset used for classification applications. The dataset used is highly unbalanced with higher number of conformity images. This unbalanced and limited dataset of nonconformity identification poses challenges in developing accurate nonconformity classification models. Therefore, a research is carried out in the presented work to augment the nonconformity dataset along with increasing the balance between different nonconformity classes. The quality of generated images is improved by incorporating recent developments in GANs. The present study shows that the proposed advanced GAN model is helpful in improving the performance classification model by augmentation under a limited unbalanced dataset. Generated results of advanced GAN are evaluated using Fréchet Inception Distance (FID) score, which shows large improvement over styleGAN architecture. Further experiments for dataset augmentation using generated images show 12% improvement in classification model accuracy over the original dataset. The potency of augmentation using GAN generated images is experimentally proved using principal component analysis plots.
Download

Paper Nr: 46
Title:

Video-based Detection and Tracking with Improved Re-Identification Association for Pigs and Laying Hens in Farms

Authors:

Qinghua Guo, Yue Sun, Lan Min, Arjen van Putten, Egbert F. Knol, Bram Visser, T. B. Rodenburg, J. E. Bolhuis, Piter Bijma and Peter H. N. de With

Abstract: It is important to detect negative behavior of animals for breeding in order to improve their health and welfare. In this work, AI is employed to assist individual animal detection and tracking, which enables the future analysis of behavior for individual animals. The study involves animal groups of pigs and laying hens. First, two state-of-the-art deep learning-based Multi-Object Tracking (MOT) methods are investigated, namely Joint Detection and Embedding (JDE) and FairMOT. Both models detect and track individual animals automatically and continuously. Second, a weighted association algorithm is proposed, which is feasible for both MOT methods to optimize the object re-identification (re-ID), thereby improving the tracking performance. The proposed methods are evaluated on manually annotated datasets. The best tracking performance on pigs is obtained by FairMOT with the weighted association, resulting in an IDF1 of 90.3%, MOTA of 90.8%, MOTP of 83.7%, number of identity switches of 14, and an execution rate of 20.48 fps. For the laying hens, FairMOT with the weighted association also achieves the best tracking performance, with an IDF1 of 88.8%, MOTA of 86.8%, MOTP of 72.8%, number of identity switches of 2, and an execution rate of 21.01 fps. These results show a promising high accuracy and robustness for the individual animal tracking.
Download

Paper Nr: 53
Title:

MA-ResNet50: A General Encoder Network for Video Segmentation

Authors:

Xiaotian Liu, Lei Yang, Xiaoyu Zhang and Xiaohui Duan

Abstract: To improve the performance of segmentation networks on video streaming, most researchers now use optical-flow based method and non optical-flow CNN based method. The former suffers from heavy computational cost and high latency while the latter suffers from poor applicability and versatility. In this paper, we design a Partial Channel Memory Attention module (PCMA) to store and fuse time series features from video sequences.Then, we propose a Memory Attention ResNet50 network (MA-ResNet50) by combining the PCMA module with ResNet50, making it the first video based feature extraction encoder appliable for most of the currently proposed segmentation networks. For experiments, we combine our MA-ResNet50 with four acknowledged per-frame segmentation networks: DeeplabV3P, PSPNet, SFNet, and DNLNet. The results show that our MA-ResNet50 outperforms the original ResNet50 generally in these 4 networks on VSPW and CamVid. Our method also achieves state-of-the-art accuracy on CamVid. The code is avilable at https://github.com/xiaotianliu01/MA-Resnet50.
Download

Paper Nr: 57
Title:

Identification of Planarian Individuals by Spot Patterns in Texture

Authors:

Nikita Lomov, Kharlampiy Tiras and Leonid Mestetskiy

Abstract: Planarian flatworms are known for their abilities to regenerate and are a popular biological model. Identification of individual planarian individuals is useful for automating biological research and improving the accuracy of measurements in experiments. The article proposes a method for identifying planaria by their texture profile, characterized by a set, shape, and position of light spots on the worm’s body— areas without pigment. To make the comparison of planaria of different sizes and in different poses, the method of planarian texture normalization is suggested. It is based on the selection of a main branch in the skeleton of a segmented image and allows one to switch to a unified coordinate system. Also, a method for creating a generalized textural profile of a planarian, based on averaging sets of spots for multiple images, is proposed. Experiments were carried out to identify planaria for different types of observations—during one day, during several days and during several days of regeneration after decapitation. Experiments show that light spots are a temporally stable phenotypic trait.
Download

Paper Nr: 61
Title:

DAEs for Linear Inverse Problems: Improved Recovery with Provable Guarantees

Authors:

Jasjeet Dhaliwal and Kyle Hambrook

Abstract: Generative priors have been shown to provide improved results over sparsity priors in linear inverse problems. However, current state of the art methods suffer from one or more of the following drawbacks: (a) speed of recovery is slow; (b) reconstruction quality is deficient; (c) reconstruction quality is contingent on a computationally expensive process of tuning hyperparameters. In this work, we address these issues by utilizing Denoising Auto Encoders (DAEs) as priors and a projected gradient descent algorithm for recovering the original signal. We provide rigorous theoretical guarantees for our method and experimentally demonstrate its superiority over existing state of the art methods in compressive sensing, inpainting, and super-resolution. We find that our algorithm speeds up recovery by two orders of magnitude (over 100x), improves quality of reconstruction by an order of magnitude (over 10x), and does not require tuning hyperparameters.
Download

Paper Nr: 66
Title:

MinMax-CAM: Improving Focus of CAM-based Visualization Techniques in Multi-label Problems

Authors:

Lucas David, Helio Pedrini and Zanoni Dias

Abstract: The Class Activation Map (CAM) technique (and derivations thereof) has been broadly used in the literature to inspect the decision process of Convolutional Neural Networks (CNNs) in classification problems. However, most studies have focused on maximizing the coherence between the visualization map and the position, shape and sizes of a single object of interest, and little is known about the performance of visualization techniques in scenarios where multiple objects of different labels coexist. In this work, we conduct a series of tests that aim to evaluate the efficacy of CAM techniques over distinct multi-label sets. We find that techniques that were developed with single-label classification in mind (such as Grad-CAM, Grad-CAM++ and Score-CAM) will often produce diffuse visualization maps in multi-label scenarios, overstepping the boundaries of their explaining objects onto different labels. We propose a generalization of CAM technique, based on multi-label activation maximization/minimization to create more accurate activation maps. Finally, we present a regularization strategy that encourages sparse positive weights in the classifying layer, producing cleaner activation maps and better multi-label classification scores.
Download

Paper Nr: 90
Title:

Specularity, Shadow, and Occlusion Removal from Image Sequences using Deep Residual Sets

Authors:

Monika Kwiatkowski and Olaf Hellwich

Abstract: When taking images of planar objects, the images are often subject to unwanted artifacts such as specularities, shadows, and occlusions. While there are some methods that specialize in the removal of each type of artifact individually, we offer a generalized solution. We implement an end-to-end deep learning approach that removes artifacts from a series of images using a fully convolutional residual architecture and Deep Sets. Our architecture can be used as general approach for many image restoration tasks and is robust to varying sequence lengths and varying image resolutions. Furthermore, it enforces permutation invariance on the input sequence. The architecture is optimized to process high resolution images. We also provide a simple online algorithm that allows the processing of arbitrarily long image sequences without increasing the memory consumption. We created a synthetic dataset as an initial proof-of-concept. Additionally, we created a smaller dataset of real image sequences. In order to overcome the data scarcity of our real dataset, we use the synthetic data for pre-training our model. Our evaluations show that our model outperforms many state of the art methods that are used in related problems such as background subtraction and intrinsic image decomposition.
Download

Paper Nr: 103
Title:

Perceptual Loss based Approach for Analogue Film Restoration

Authors:

Daniela Ivanova, Jan P. Siebert and John Williamson

Abstract: Analogue film restoration, both for still photographs and motion picture emulsions, is a slow and laborious manual process. Artifacts such as dust and scratches are random in shape, size, and location; additionally, the overall degree of damage varies between different frames. We address this less popular case of image restoration by training a U-Net model with a modified perceptual loss function. Along with the novel perceptual loss function used for training, we propose a more rigorous quantitative model evaluation approach which measures the overall degree of improvement in perceptual quality over our test set.
Download

Paper Nr: 122
Title:

UAV-ReID: A Benchmark on Unmanned Aerial Vehicle Re-identification in Video Imagery

Authors:

Daniel Organisciak, Matthew Poyser, Aishah Alsehaim, Shanfeng Hu, Brian S. Isaac-Medina, Toby P. Breckon and Hubert H. Shum

Abstract: As unmanned aerial vehicles (UAV) become more accessible with a growing range of applications, the risk of UAV disruption increases. Recent development in deep learning allows vision-based counter-UAV systems to detect and track UAVs with a single camera. However, the limited field of view of a single camera necessitates multi-camera configurations to match UAVs across viewpoints – a problem known as re-identification (Re- ID). While there has been extensive research on person and vehicle Re-ID to match objects across time and viewpoints, to the best of our knowledge, UAV Re-ID remains unresearched but challenging due to great differences in scale and pose. We propose the first UAV re-identification data set, UAV-reID, to facilitate the development of machine learning solutions in multi-camera environments. UAV-reID has two sub-challenges: Temporally-Near and Big-to-Small to evaluate Re-ID performance across viewpoints and scale respectively. We conduct a benchmark study by extensively evaluating different Re-ID deep learning based approaches and their variants, spanning both convolutional and transformer architectures. Under the optimal configuration, such approaches are sufficiently powerful to learn a well-performing representation for UAV (81.9% mAP for Temporally-Near, 46.5% for the more difficult Big-to-Small challenge), while vision transformers are the most robust to extreme variance of scale.
Download

Paper Nr: 124
Title:

Syncrack: Improving Pavement and Concrete Crack Detection through Synthetic Data Generation

Authors:

Rodrigo Rill-García, Eva Dokladalova and Petr Dokládal

Abstract: In crack detection, pixel-accurate predictions are necessary to measure the width – an important indicator of the severity of a crack. However, manual annotation of images to train supervised models is a hard and time-consuming task. Because of this, manual annotations tend to be inaccurate, particularly at pixel-accurate level. The learning bias introduced by this inaccuracy hinders pixel-accurate crack detection. In this paper we propose a novel tool aimed for synthetic image generation with accurate crack labels – Syncrack. This parametrizable tool also provides a method to introduce controlled noise to annotations, emulating human inaccuracy. By using this, first we do a robustness study of the impact of training with inaccurate labels. This study quantifies the detrimental effect of inaccurate annotations in the final prediction scores. Afterwards, we propose to use Syncrack to avoid this detrimental effect in a real-life context. For this, we show the advantages of using Syncrack generated images with accurate annotations for crack detection on real road images. Since supervised scores are biased by the inaccuracy of annotations, we propose a set of unsupervised metrics to evaluate the segmentation quality in terms of crack width.
Download

Paper Nr: 162
Title:

LiDAR Dataset Distillation within Bayesian Active Learning Framework Understanding the Effect of Data Augmentation

Authors:

Anh P. Duong, Alexandre Almin, Léo Lemarié and B. R. Kiran

Abstract: Autonomous driving (AD) datasets have progressively grown in size in the past few years to enable better deep representation learning. Active learning (AL) has re-gained attention recently to address reduction of annotation costs and dataset size. AL has remained relatively unexplored for AD datasets, especially on point cloud data from LiDARs. This paper performs a principled evaluation of AL based dataset distillation on (1/4th) of the large Semantic-KITTI dataset. Further on, the gains in model performance due to data augmentation (DA) are demonstrated across different subsets of the AL loop. We also demonstrate how DA improves the selection of informative samples to annotate. We observe that data augmentation achieves full dataset accuracy using only 60% of samples from the selected dataset configuration. This provides faster training time and subsequent gains in annotation costs.
Download

Paper Nr: 166
Title:

Evaluation of Deep Learning based 3D-Point-Cloud Processing Techniques for Semantic Segmentation of Neuromorphic Vision Sensor Event-streams

Authors:

Tobias Bolten, Felix Lentzen, Regina Pohle-Fröhlich and Klaus D. Tönnies

Abstract: Dynamic Vision Sensors are neuromorphic inspired cameras with pixels that operate independently and asynchronously from each other triggered by illumination changes within the scene. The output of these sensors is a stream with a sparse spatial but high temporal representation of triggered events occurring at a variable rate. Many prior approaches convert the stream into other representations, such as classic 2D frames, to adapt known computer vision techniques. However, the sensor output is natively and directly interpretable as a 3D space-time event cloud without this lossy conversion. Therefore, we propose the processing utilizing 3D point cloud approaches. We provide an evaluation of different deep neural network structures for semantic segmentation of these 3D space-time point clouds, based on PointNet++(Qi et al., 2017b) and three published successor variants. This evaluation on a publicly available dataset includes experiments in terms of different data preprocessing, the optimization of network meta-parameters and a comparison to the results obtained by a 2D frame-conversion based CNN-baseline. In summary, the 3D-based processing achieves better results in terms of quality, network size and required runtime.
Download

Paper Nr: 179
Title:

Beyond Global Average Pooling: Alternative Feature Aggregations for Weakly Supervised Localization

Authors:

Matthias Körschens, Paul Bodesheim and Joachim Denzler

Abstract: Weakly supervised object localization (WSOL) enables the detection and segmentation of objects in applications where localization annotations are hard or too expensive to obtain. Nowadays, most relevant WSOL approaches are based on class activation mapping (CAM), where a classification network utilizing global average pooling is trained for object classification. The classification layer that follows the pooling layer is then repurposed to generate segmentations using the unpooled features. The resulting localizations are usually imprecise and primarily focused around the most discriminative areas of the object, making a correct indication of the object location difficult. We argue that this problem is inherent in training with global average pooling due to its averaging operation. Therefore, we investigate two alternative pooling strategies: global max pooling and global log-sum-exp pooling. Furthermore, to increase the crispness and resolution of localization maps, we also investigate the application of Feature Pyramid Networks, which are commonplace in object detection. We confirm the usefulness of both alternative pooling methods as well as the Feature Pyramid Network on the CUB-200-2011 and OpenImages datasets.
Download

Paper Nr: 216
Title:

Single-view 3D Body and Cloth Reconstruction under Complex Poses

Authors:

Nicolas Ugrinovic, Albert Pumarola, Alberto Sanfeliu and Francesc Moreno-Noguer

Abstract: Recent advances in 3D human shape reconstruction from single images have shown impressive results, leveraging on deep networks that model the so-called implicit function to learn the occupancy status of arbitrarily dense 3D points in space. However, while current algorithms based on this paradigm, like PiFuHD (Saito et al., 2020), are able to estimate accurate geometry of the human shape and clothes, they require high-resolution input images and are not able to capture complex body poses. Most training and evaluation is performed on 1k-resolution images of humans standing in front of the camera under neutral body poses. In this paper, we leverage publicly available data to extend existing implicit function-based models to deal with images of humans that can have arbitrary poses and self-occluded limbs. We argue that the representation power of the implicit function is not sufficient to simultaneously model details of the geometry and of the body pose. We, therefore, propose a coarse-to-fine approach in which we first learn an implicit function that maps the input image to a 3D body shape with a low level of detail, but which correctly fits the underlying human pose, despite its complexity. We then learn a displacement map, conditioned on the smoothed surface and on the input image, which encodes the high-frequency details of the clothes and body. In the experimental section, we show that this coarse-to-fine strategy represents a very good trade-off between shape detail and pose correctness, comparing favorably to the most recent state-of-the-art approaches. Our code will be made publicly available.
Download

Paper Nr: 254
Title:

Deep Depth Completion of Low-cost Sensor Indoor RGB-D using Euclidean Distance-based Weighted Loss and Edge-aware Refinement

Authors:

Augusto R. Castro, Valdir Grassi Jr. and Moacir A. Ponti

Abstract: Low-cost depth-sensing devices can provide real-time depth maps to many applications, such as robotics and augmented reality. However, due to physical limitations in the acquisition process, the depth map obtained can present missing areas corresponding to irregular, transparent, or reflective surfaces. Therefore, when there is more computing power than just the embedded processor in low-cost depth sensors, models developed to complete depth maps can boost the system's performance. To exploit the generalization capability of deep learning models, we propose a method composed of a U-Net followed by a refinement module to complete depth maps provided by Microsoft Kinect. We applied the Euclidean distance transform in the loss function to increase the influence of missing pixels when adjusting our network filters and reduce blur in predictions. We outperform state-of-the-art methods for completed depth maps in a benchmark dataset. Our novel loss function combining the distance transform, gradient and structural similarity measure presents promising results in guiding the model to reduce unnecessary blurring of final depth maps predicted by a convolutional network.
Download

Paper Nr: 260
Title:

A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

Authors:

Kishaan Jeeveswaran, Senthilkumar Kathiresan, Arnav Varma, Omar Magdy, Bahram Zonooz and Elahe Arani

Abstract: Convolutional Neural Networks (CNNs), architectures consisting of convolutional layers, have been the standard choice in vision tasks. Recent studies have shown that Vision Transformers (VTs), architectures based on self-attention modules, achieve comparable performance in challenging tasks such as object detection and semantic segmentation. However, the image processing mechanism of VTs is different from that of conventional CNNs. This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks. To address these questions, we study and compare VT and CNN architectures as a feature extractor in object detection and semantic segmentation. Our extensive empirical results show that the features generated by VTs are more robust to distribution shifts, natural corruptions, and adversarial attacks in both tasks, whereas CNNs perform better at higher image resolutions in object detection. Furthermore, our results demonstrate that VTs in dense prediction tasks produce more reliable and less texture biased predictions.
Download

Paper Nr: 263
Title:

Unsupervised Image Decomposition with Phase-Correlation Networks

Authors:

Angel Villar-Corrales and Sven Behnke

Abstract: The ability to decompose scenes into their object components is a desired property for autonomous agents, allowing them to reason and act in their surroundings. Recently, different methods have been proposed to learn object-centric representations from data in an unsupervised manner. These methods often rely on latent representations learned by deep neural networks, hence requiring high computational costs and large amounts of curated data. Such models are also difficult to interpret. To address these challenges, we propose the Phase-Correlation Decomposition Network (PCDNet), a novel model that decomposes a scene into its object components, which are represented as transformed versions of a set of learned object prototypes. The core building block in PCDNet is the Phase-Correlation Cell (PC Cell), which exploits the frequency-domain representation of the images in order to estimate the transformation between an object prototype and its transformed version in the image. In our experiments, we show how PCDNet outperforms state-of-the-art methods for unsupervised object discovery and segmentation on simple benchmark datasets and on more challenging data, while using a small number of learnable parameters and being fully interpretable. Code and models to reproduce our experiments can be found in https://github.com/AIS-Bonn/Unsupervised-Decomposition-PCDNet.
Download

Paper Nr: 271
Title:

Road Scene Analysis: A Study of Polarimetric and Color-based Features under Various Adverse Weather Conditions

Authors:

Rachel Blin, Samia Ainouz, Stéphane Canu and Fabrice Meriaudeau

Abstract: Autonomous vehicles and ADAS systems require a reliable road scene analysis to guarantee road users’ safety. While most of autonomous systems provide an accurate road objects detection in good weather conditions, there are still some improvements to be made when visibility is altered. Polarimetric features combined with color-based ones have shown great performances in enhancing road scenes under fog. The question remains to generalize these results to other adverse weather situations. To this end, this work experimentally compares the behaviour of the polarimetric intensities, the polarimetric Stokes parameters and the RGB images as well as their combination in different fog densities and under tropical rain. The different detection tasks show a significant improvement when using a relevant fusion scheme and features combination in all the studied adverse weather situations. The obtained results are encouraging regarding the use of polarimetric features to enhance road scene analysis under a wide range of adverse weather conditions.
Download

Paper Nr: 283
Title:

Event Data Downscaling for Embedded Computer Vision

Authors:

Amélie Gruel, Jean Martinet, Teresa Serrano-Gotarredona and Bernabé Linares-Barranco

Abstract: Event cameras (or silicon retinas) represent a new kind of sensor that measure pixel-wise changes in brightness and output asynchronous events accordingly. This novel technology allows for a sparse and energy-efficient recording and storage of visual information. While this type of data is sparse by definition, the event flow can be very high, up to 25M events per second, which requires significant processing resources to handle and therefore impedes embedded applications. Neuromorphic computer vision and event sensor based applications are receiving an increasing interest from the computer vision community (classification, detection, tracking, segmentation, etc.), especially for robotics or autonomous driving scenarios. Downscaling event data is an important feature in a system, especially if embedded, so as to be able to adjust the complexity of data to the available resources such as processing capability and power consumption. To the best of our knowledge, this works is the first attempt to formalize event data downscaling. In order to study the impact of spatial resolution downscaling, we compare several features of the resulting data, such as the total number of events, event density, information entropy, computation time and optical consistency as assessment criteria. Our code is available online at https://github.com/amygruel/EvVisu.
Download

Short Papers
Paper Nr: 26
Title:

Analysis of the Future Potential of Autoencoders in Industrial Defect Detection

Authors:

Sarah Schneider, Doris Antensteiner, Daniel Soukup and Matthias Scheutz

Abstract: We investigated the anomaly detection behaviour of three convolutional autoencoder types - a “standard” convolutional autoencoder (CAE), a variational convolutional autoencoder (VAE) and an adversarial convolutional autoencoder (AAE) - by applying them to different visual anomaly detection scenarios. First, we utilized our three autoencoder types to detect anomalous regions in two synthetically generated datasets. To investigate the convolutional autoencoders’ defect detection performances “in the industrial wild”, we applied the models on quality inspection images of non-defective and defective material regions. We compared the performances of all three autoencoder types based on their ability to detect anomalies and captured the training complexity by measuring the time needed for training them. Although the CAE is the simplest model, the trained model performed nearly as well as the more sophisticated autoencoder types, which depend on more complex training processes. For data that lacks regularity or shows purely stochastic patterns, all our autoencoders failed to compute meaningful results.
Download

Paper Nr: 29
Title:

Segmentation Improves 3D Object Classification in Graph Convolutional Networks

Authors:

Clara Holzhüter, Florian Teich and Florentin Wörgötter

Abstract: 3D object classification is involved in many computer vision pipelines such as autonomous driving or robotics. However, the irregular format of 3D data makes it challenging to develop suitable deep learning architectures. This paper proposes CompointNet, a graph convolutional network architecture, which performs 3D object classification by means of part decomposition. Our model consumes a 3D point cloud in the form of a part graph which is constructed from segmented 3D shapes. The model learns a global descriptor by hierarchically aggregating neighbourhood information using simple graph convolutions. To capture both local and global information, a global classification method processing each point separately is combined with our part graph based approach into a hybrid version of CompointNet. We compare our approach to several state-of-the art methods and demonstrate competitive performance. Particularly, in terms of per class accuracy, our hybrid approach outperforms the compared methods. The proposed hybrid variants achieve a high classification accuracy, while being much more efficient than those benchmark models with a comparable performance. The conducted experiments show that part based approaches levering structural information about a 3D object, indeed, can improve the classification performance of 3D deep learning models.
Download

Paper Nr: 49
Title:

Deep Video Frame Rate Up-conversion Network using Feature-based Progressive Residue Refinement

Authors:

Jinglei Shi, Xiaoran Jiang and Christine Guillemot

Abstract: In this paper, we propose a deep learning-based network for video frame rate up-conversion (or video frame interpolation). The proposed optical flow-based pipeline employs deep features extracted to learn residue maps for progressively refining the synthesized intermediate frame. We also propose a procedure for fine- tuning the optical flow estimation module using frame interpolation datasets, which does not require ground truth optical flows. This procedure is effective to obtain interpolation task-oriented optical flows and can be applied to other methods utilizing a deep optical flow estimation module. Experimental results demonstrate that our proposed network performs favorably against state-of-the-art methods both in terms of qualitative and quantitative measures.
Download

Paper Nr: 60
Title:

Continuous Perception for Classifying Shapes and Weights of Garments for Robotic Vision Applications

Authors:

Li Duan and Gerardo Aragon-Camarasa

Abstract: We present an approach to continuous perception for robotic laundry tasks. Our assumption is that the visual prediction of a garment’s shapes and weights is possible via a neural network that learns the dynamic changes of garments from video sequences. Continuous perception is leveraged during training by inputting consecutive frames, of which the network learns how a garment deforms. To evaluate our hypothesis, we captured a dataset of 40K RGB and depth video sequences while a garment is being manipulated. We also conducted ablation studies to understand whether the neural network learns the physical properties of garments. Our findings suggest that a modified AlexNet-LSTM architecture has the best classification performance for the garment’s shapes and discretised weights. To further provide evidence for continuous perception, we evaluated our network on unseen video sequences and computed the ’Moving Average’ over a sequence of predictions. We found that our network has a classification accuracy of 48% and 60% for shapes and weights of garments, respectively.
Download

Paper Nr: 73
Title:

Aerial to Street View Image Translation using Cascaded Conditional GANs

Authors:

Kshitij Singh, Alexia Briassouli and Mirela Popa

Abstract: Cross view image translation is a challenging case of viewpoint translation which involves generating the street view image when the aerial view image is given and vice versa. As there is no overlap in the two views, a single stage generation network fails to capture the complex scene structure of objects in these two views. Our work aims to tackle the task of generating street level view images from aerial view images on the benchmarking CVUSA dataset by a cascade pipeline consisting of three smaller stages: street view image generation, semantic segmentation map generation, and image refinement, trained together in a constrained manner in a Conditional GAN (CGAN) framework. Our contributions are twofold: (1) The first stage of our pipeline examines the use of alternate architectures ResNet, ResUnet++ in a framework similar to the current State-of-the-Art (SoA), leading to useful insights and comparable or improved results in some cases. (2) In the 3rd stage, ResUNet++ is used for the first time for image refinement. U-net performs the best for street view image generation and semantic map generation as a result of the skip connections between encoders and decoders, while ResU-Net++ performs the best for image refinement because of the presence of the attention module in the decoders. Qualitative and quantitative comparisons with existing methods show that our model outperforms all others on the KL Divergence metric and ranks amongst the best for other metrics.
Download

Paper Nr: 87
Title:

Presenting a Novel Pipeline for Performance Comparison of V-PCC and G-PCC Point Cloud Compression Methods on Datasets with Varying Properties

Authors:

Albert Christensen, Daniel Lehotský, Mathias Poulsen and Thomas Moeslund

Abstract: The increasing availability of 3D sensors enables an ever increasing amount of applications to utilize 3D captured content in the form of point clouds. Several promising methods for compressing point clouds have been proposed but lacks a unified method for evaluating their performance on a wide array of point cloud datasets with different properties. We propose a pipeline for evaluating the performance of point cloud compression methods on both static and dynamic point clouds. The proposed evaluation pipeline is used to evaluate the performance of MPEG’s G-PCC octree RAHT and MPEG’s V-PCC compression codecs.
Download

Paper Nr: 91
Title:

SPD Siamese Neural Network for Skeleton-based Hand Gesture Recognition

Authors:

Mohamed S. Akremi, Rim Slama and Hedi Tabia

Abstract: This article proposes a new learning method for hand gesture recognition from 3D hand skeleton sequences. We introduce a new deep learning method based on a Siamese network of Symmetric Positive Definite (SPD) matrices. We also propose to use the Contrastive Loss to improve the discriminative power of the network. Experimental results are conducted on the challenging Dynamic Hand Gesture (DHG) dataset. We compared our method to other published approaches on this dataset and we obtained the highest performances with up to 95,60% classification accuracy on 14 gestures and 94.05% on 28 gestures.
Download

Paper Nr: 99
Title:

Automatic Transcription System for Nutritional Information Charts of Spanish Food Products

Authors:

José M. Fuentes, Roberto Paredes, Elena Fulladosa, María M. Giró and Anna Claret

Abstract: Labeling of food products contains key nutritional information, but it is often inaccessible or unclear to users. To alleviate this problem, the application of modern automatic transcription techniques to this field is studied in this paper. This presents a challenge, due to the structural difference of these charts with respect to the usual type of documents for which OCR systems are developed, and also because of the wide visual variability present in this type of labels. For these reasons, a series of algorithms and deep learning models have been developed and applied as pre-processing for the images and post-processing for the transcription obtained, in order to optimize and complement this automatic transcription. With this whole pipeline, we achieve to extract the nutritional information from the pictures in an efficient, complete, accurate and structured way.
Download

Paper Nr: 102
Title:

MdVRNet: Deep Video Restoration under Multiple Distortions

Authors:

Claudio Rota and Marco Buzzelli

Abstract: Video restoration techniques aim to remove artifacts, such as noise, blur, and compression, introduced at various levels within and outside the camera imaging pipeline during video acquisition. Although excellent results can be achieved by considering one artifact at a time, in real applications a given video sequence can be affected by multiple artifacts, whose appearance is mutually influenced. In this paper, we present Multi-distorted Video Restoration Network (MdVRNet), a deep neural network specifically designed to handle multiple distortions simultaneously. Our model includes an original Distortion Parameter Estimation sub-Network (DPEN) to automatically infer the intensity of various types of distortions affecting the input sequence, novel Multi-scale Restoration Blocks (MRB) to extract complementary features at different scales using two parallel streams, and implements a two-stage restoration process to focus on different levels of detail. We document the accuracy of the DPEN module in estimating the intensity of multiple distortions, and present an ablation study that quantifies the impact of the DPEN and MRB modules. Finally, we show the advantages of the proposed MdVRNet in a direct comparison with another existing state-of-the-art approach for video restoration. The code is available at https://github.com/claudiom4sir/MdVRNet.
Download

Paper Nr: 106
Title:

3GAN: A Three-GAN-based Approach for Image Inpainting Applied to the Reconstruction of Occluded Parts of Building Walls

Authors:

Benedikt Kottler, Ludwig List, Dimitri Bulatov and Martin Weinmann

Abstract: Realistic representation of building walls from images is an important aspect of scene understanding and has many applications. Often, images of buildings are the only input for texturing 3D models, and these images may be occluded by vegetation. One task of image inpainting is to remove these clutter objects. Since the disturbing objects can also be of a larger scale, modern deep learning techniques should be applied to replace them as realistically and context-aware as possible. To support an inpainting network, it is useful to include a-priori information. An example of a network that considers edge images is the two-stage GAN model denoted as EdgeConnect. This idea is taken up in this work and further developed to a three-stage GAN (3GAN) model for façade images by additionally incorporating semantic label images. By inpainting the label images, not only a clear geometric structure but also class information, like position and shape of windows and their typical color distribution, are provided to the model. This model is compared qualitatively and quantitatively with the conventional version of EdgeConnect and another well-known deep-learning-based approach on inpainting which is based on partial convolutions. This latter approach was outperformed by both GAN-based methods, both qualitatively and quantitatively. While the quantitative evaluation showed that the conventional EdgeConnect method performs minimally best, the proposed method yields a slightly better representation of specific façade elements.
Download

Paper Nr: 108
Title:

Estimating Perceived Comfort in Virtual Humans based on Spatial and Spectral Entropy

Authors:

Greice D. Molin, Victor A. Araujo and Soraia R. Musse

Abstract: Nowadays, we are increasingly exposed to applications with conversational agents or virtual humans. In the psychology literature, the perception of human faces is a research area well studied. In past years, many works have investigated human perception concerning virtual humans. The sense of discomfort perceived in certain virtual characters, discussed in Uncanny Valley (UV) theory, can be a key factor in our perceptual and cognitive discrimination. Understanding how this process happens is essential to avoid it in the process of modeling virtual humans. This paper investigates the relationship between images features and the comfort that human beings can feel about the animated characters created using Computer Graphics (CG). We introduce the CCS (Computed Comfort Score) metric to estimate the probable comfort/discomfort value that a particular virtual human face can generate in the subjects. We used local spatial and spectral entropy to extract features and show their relevance to the subjects’ evaluation. A model using Support Vector Regression (SVR) is proposed to compute the CCS. The results indicated approximately an accuracy of 80% for the tested images when compared with the perceptual data.
Download

Paper Nr: 110
Title:

Monte-Carlo Convolutions on Foveated Images

Authors:

George Killick, Gerardo Aragon-Camarasa and J. P. Siebert

Abstract: Foveated vision captures a visual scene at space-variant resolution. This makes the application of parameterized convolutions to foveated images difficult as they do not have a dense-grid representation in cartesian space. Log-polar space is frequently used to create a dense grid representation of foveated images, however this image representation may not be appropriate for all applications. In this paper we rephrase the convolution operation as the Monte-Carlo estimation of the filter response of the foveated image and a continuous filter kernel, an idea that has seen frequent use for deep learning on point clouds. We subsume our convolution operation into a simple CNN architecture that processes foveated images in cartesian space. We evaluate our system in the context of image classification and show that our approach significantly outperforms an equivalent CNN processing a foveated image in log-polar space.
Download

Paper Nr: 113
Title:

Towards Full-to-Empty Room Generation with Structure-aware Feature Encoding and Soft Semantic Region-adaptive Normalization

Authors:

Vasileios Gkitsas, Nikolaos Zioulis, Vladimiros Sterzentsenko, Alexandros Doumanoglou and Dimitrios Zarpalas

Abstract: The task of transforming a furnished room image into a background-only is extremely challenging since it requires making large changes regarding the scene context while still preserving the overall layout and style. In order to acquire photo-realistic and structural consistent background, existing deep learning methods either employ image inpainting approaches or incorporate the learning of the scene layout as an individual task and leverage it later in a not fully differentiable semantic region-adaptive normalization module. To tackle these drawbacks, we treat scene layout generation as a feature linear transformation problem and propose a simple yet effective adjusted fully differentiable soft semantic region-adaptive normalization module (softSEAN) block. We showcase the applicability in diminished reality and depth estimation tasks, where our approach besides the advantages of mitigating training complexity and non-differentiability issues, surpasses the compared methods both quantitatively and qualitatively. Our softSEAN block can be used as a drop-in module for existing discriminative and generative models.
Download

Paper Nr: 123
Title:

Mask R-CNN Applied to Quasi-particle Segmentation from the Hybrid Pelletized Sinter (HPS) Process

Authors:

Natália C. Meira, Mateus C. Silva, Andrea C. Bianchi, Cláudio B. Vieira, Alinne Souza, Efrem Ribeiro, Roberto O. Junior and Ricardo R. Oliveira

Abstract: Particle size is an important quality parameter for raw materials in steel industry processes. In this work, we propose to implement the Mask-R-CNN algorithm to segment quasi-particles by size classes. We created a dataset with real images of an industrial environment, labeled the quasi-particles by size classes, and performed four training sessions by adjusting the model’s hyperparameters. The results indicated that the model segments with well-defined edges and applications as classes correctly. We obtained a mAP between 0.2333 and 0.2585. Additionally, hit and detection rates increase for larger particle size classes.
Download

Paper Nr: 134
Title:

Image Prefiltering in DeepFake Detection

Authors:

Szymon Motłoch, Mateusz Szczygielski and Grzegorz Sarwas

Abstract: Artificial intelligence, becoming common technology, creates a lot of new possibilities and dangers. An example can be open source applications that enable swapping faces on images or videos with other faces delivered from other sources. This type of modification is named DeepFake. Since the human eye cannot detect DeepFake, it is crucial to possess a mechanism that would detect such changes. This paper analyses solution based on Spatial Rich Models (SRM) for image prefiltering connecting convolutional neural network VGG16 to increase DeepFake detection with neural networks. For DeepFake detection, a fractional order spatial rich model (FoSRM) is proposed, which was compared with classical SRM filter and integer order derivative operators. In the experiment, we used two different approximation fractional order derivative methods: first based on the mask and second used Fast Fourier Transform (FFT). Achieved results we also compare with the original ones and the VGG16 network with an additional layer added to select the parameters of the prefiltering mask automatically. As a result of the work, we questioned the legitimacy of using additional image enrichment by prefiltering when using the convolutional neural network. Additional network layer gave us the best results from the performed experiments.
Download

Paper Nr: 141
Title:

Learn by Guessing: Multi-step Pseudo-label Refinement for Person Re-Identification

Authors:

Tiago G. Pereira and Teofilo E. de Campos

Abstract: Unsupervised Domain Adaptation (UDA) methods for person Re-Identification (Re-ID) rely on target domain samples to model the marginal distribution of the data. To deal with the lack of target domain labels, UDA methods leverage information from labeled source samples and unlabeled target samples. A promising approach relies on the use of unsupervised learning as part of the pipeline, such as clustering methods. The quality of the clusters clearly plays a major role in methods performance, but this point has been overlooked. In this work, we propose a multi-step pseudo-label refinement method to select the best possible clusters and keep improving them so that these clusters become closer to the class divisions without knowledge of the class labels. Our refinement method includes a cluster selection strategy and a camera-based normalization method which reduces the within-domain variations caused by the use of multiple cameras in person Re-ID. This allows our method to reach state-of-the-art UDA results on DukeMTMC→Market1501 (source→target). We surpass state-of-the-art for UDA Re-ID by 3.4% on Market1501→DukeMTMC datasets, which is a more challenging adaptation setup because the target domain (DukeMTMC) has eight distinct cameras. Furthermore, the camera-based normalization method causes a significant reduction in the number of iterations required for training convergence.
Download

Paper Nr: 164
Title:

Can Super Resolution Improve Human Pose Estimation in Low Resolution Scenarios?

Authors:

Peter Hardy, Srinandan Dasmahapatra and Hansung Kim

Abstract: The results obtained from state of the art human pose estimation (HPE) models degrade rapidly when evaluating people of a low resolution, but can super resolution (SR) be used to help mitigate this effect? By using various SR approaches we enhanced two low resolution datasets and evaluated the change in performance of both an object and keypoint detector as well as end-to-end HPE results. We remark the following observations. First we find that for people who were originally depicted at a low resolution (segmentation area in pixels), their keypoint detection performance would improve once SR was applied. Second, the keypoint detection performance gained is dependent on that persons pixel count in the original image prior to any application of SR; keypoint detection performance was improved when SR was applied to people with a small initial segmentation area, but degrades as this becomes larger. To address this we introduced a novel Mask-RCNN approach, utilising a segmentation area threshold to decide when to use SR during the keypoint detection step. This approach achieved the best results on our low resolution datasets for each HPE performance metrics.
Download

Paper Nr: 169
Title:

Wavelet Transform for the Analysis of Convolutional Neural Networks in Texture Recognition

Authors:

Joao B. Florindo

Abstract: Convolutional neural networks have become omnipresent in applications of image recognition during the last years. However, when it comes to texture analysis, classical techniques developed before the popularity of deep learning has demonstrated potential to boost the performance of these networks, especially when they are employed as feature extractor. Given this context, here we propose a novel method to analyze feature maps of a convolutional network by wavelet transform. In the first step, we compute the detail coefficients from the activation response on the penultimate layer. In the second one, a one-dimensional version of local binary patterns are computed over the details to provide a local description of the frequency distribution. The frequency analysis accomplished by wavelets has been reported to be related to the learning process of the network. Wavelet details capture finer features of the image without increasing the number of training epochs, which is not possible, in feature extractor mode. This process also attenuates over-fitting effect at the same time that preserves the computational efficiency of feature extraction. Wavelet details are also directly related to fractal dimension, an important feature of textures and that has also recently been found to be related to generalization capabilities. The proposed methodology was evaluated on the classification of benchmark databases as well as in a real-world problem (identification of plant species), outperforming the accuracy of the original architecture and of several other state-of-the-art approaches.
Download

Paper Nr: 175
Title:

U-Net-based DFU Tissue Segmentation and Registration on Uncontrolled Dermoscopic Images

Authors:

Yanexis Toledo, Leandro F. Fernandes, Silena Herold-Garcia and Alexis P. Quesada

Abstract: Diabetic Foot Ulcers (DFUs) are aggressive wounds with high morbimortality due to their slow healing capacity and rapid tissue degeneration, which cause complications such as infection, gangrene, and amputation. The automatic analysis of the evolution of tissues associated with DFU allows the quick identification and treatment of possible complications. In this paper, our contribution is twofold. First, we present a new DFU dataset composed of 222 images labeled by specialists. The images followed the healing process of patients of an experimental treatment and were captured under uncontrolled viewpoint and illumination conditions. To the best of our knowledge, this is the first DFU dataset whose images include the identification of background and six different classes of tissues. The second contribution is an U-Net-based segmentation and registration procedure that uses features computed by hidden layers of the network and epipolar constraints to identify pixelwise correspondences between images of the same patient at different healing stages.
Download

Paper Nr: 181
Title:

Blind Projection-based 3D Point Cloud Quality Assessment Method using a Convolutional Neural Network

Authors:

Salima Bourbia, Ayoub Karine, Aladine Chetouani and Mohammed El Hassouni

Abstract: In recent years, 3D point clouds have experienced rapid growth in various fields of computer vision, increasing the demand for efficient approaches to automatically assess the quality of 3D point clouds. In this paper, we propose a blind point cloud quality assessment method based on deep learning, that takes an input point cloud object and predicts its quality score. The proposed approach starts with projecting each 3D point cloud into rendering views (2D images). It then feeds these views to a deep convolutional neural network (CNN) to obtain the perceptual quality scores. In order to predict accurately the quality score, we use transfer learning to exploit the high potential of VGG-16, which is a classification model trained on ImageNet database. We evaluate the performance of our model on two benchmark databases: ICIP2020 and SJTU. Based on the results analysis, our model shows a strong correlation between the predicted and the subjective quality scores, showing promising results, and outperforming the state-of-the-art point cloud quality assessment models.
Download

Paper Nr: 185
Title:

DeTracker: A Joint Detection and Tracking Framework

Authors:

Juan G. Zuniga, Ujjwal and François Bremond

Abstract: We propose a unified network for simultaneous detection and tracking. Instead of basing the tracking framework on object detections, we focus our work directly on tracklet detection whilst obtaining object detection. We take advantage of the spatio-temporal information and features from 3D CNN networks and output a series of bounding boxes and their corresponding identifiers with the use of Graph Convolution Neural Networks. We put forward our approach in contrast to traditional tracking-by-detection methods, the major advantages of our formulation are the creation of more reliable tracklets, the enforcement of the temporal consistency, and the absence of data association mechanism for a given set of frames. We introduce DeTracker, a truly joint detection and tracking network. We enforce an intra-batch temporal consistency of features by enforcing a triplet loss over our tracklets, guiding the features of tracklets with different identities separately clustered in the feature space. Our approach is demonstrated on two different datasets, including natural images and synthetic images, and we obtain 58.7% on MOT and 56.79% on a subset of the JTA-dataset.
Download

Paper Nr: 189
Title:

Automatic Estimation of Anthropometric Human Body Measurements

Authors:

Dana Škorvánková, Adam Riečický and Martin Madaras

Abstract: Research tasks related to human body analysis have been drawing a lot of attention in computer vision area over the last few decades, considering its potential benefits on our day-to-day life. Anthropometry is a field defining physical measures of a human body size, form, and functional capacities. Specifically, the accurate estimation of anthropometric body measurements from visual human body data is one of the challenging problems, where the solution would ease many different areas of applications, including ergonomics, garment manufacturing, etc. This paper formulates a research in the field of deep learning and neural networks, to tackle the challenge of body measurements estimation from various types of visual input data (such as 2D images or 3D point clouds). Also, we deal with the lack of real human data annotated with ground truth body measurements required for training and evaluation, by generating a synthetic dataset of various human body shapes and performing a skeleton-driven annotation.
Download

Paper Nr: 190
Title:

Towards Deep Learning-based 6D Bin Pose Estimation in 3D Scan

Authors:

Lukáš Gajdošech, Viktor Kocur, Martin Stuchlík, Lukáš Hudec and Martin Madaras

Abstract: An automated robotic system needs to be as robust as possible and fail-safe in general while having relatively high precision and repeatability. Although deep learning-based methods are becoming research standard on how to approach 3D scan and image processing tasks, the industry standard for processing this data is still analytically-based. Our paper claims that analytical methods are less robust and harder for testing, updating, and maintaining. This paper focuses on a specific task of 6D pose estimation of a bin in 3D scans. Therefore, we present a high-quality dataset composed of synthetic data and real scans captured by a structured-light scanner with precise annotations. Additionally, we propose two different methods for 6D bin pose estimation, an analytical method as the industrial standard and a baseline data-driven method. Both approaches are cross-evaluated, and our experiments show that augmenting the training on real scans with synthetic data improves our proposed data-driven neural model. This position paper is preliminary, as proposed methods are trained and evaluated on a relatively small initial dataset which we plan to extend in the future.
Download

Paper Nr: 194
Title:

ADAS Classifier for Driver Monitoring and Driving Qualification using Both Internal and External Vehicle Data

Authors:

Rafael A. Berri, Diego R. Bruno, Eduardo Borges, Giancarlo Lucca and Fernando S. Osorio

Abstract: In this paper, we present an innovative safety system for driver monitoring and quality of how a vehicle is being controlled by a human driver. The main objective of this work is linked to the goal of detecting human failures in the task of driving, improving the predictions of human failures. In this work, we used 3D information of the driver’s posture and also the vehicles’ behavior on the road. Our proposal is able to act when human inappropriate behaviors are detected by applying a set of automatic routines to minimize their consequences. It is also possible to produce safety alarms/warnings in order to re-educate the driver to maintain good posture practices and to avoid dangerous driving using only few seconds (2.5s) of data capture. This can help to improve traffic, drivers’ education, and benefits with the reduction of accidents. When a highly dangerous behavior/situation is detected, using 140 seconds of recorded data, an autonomous parking system is activated, parking the vehicle in a safe position. We present in this paper new classifiers for ADAS (Advanced Systems of Driver Assistance) based on Machine Learning. Our classifiers are based on Artificial Neural Nets (ANN), furthermore, the values set to adjust input features, neuron activation functions, and network topology/training parameters were optimized and selected using a Genetic Algorithm. The proposed system achieved results of 79.65% of accuracy in different alarm levels (short and long term), for joint detection of risk in situations of cellphone usage, drunkenness, or regular driving. Only 1.8% of normal situations have wrong predictions (false positive alarms) in Naturalistic Driver Behavior Dataset frames, contributing to the driver’s comfort when he/she is using the system. In the near future we aim to improve these results even more.
Download

Paper Nr: 203
Title:

Can We Use Neural Regularization to Solve Depth Super-resolution?

Authors:

Milena Gazdieva, Oleg Voynov, Alexey Artemov, Youyi Zheng, Luiz Velho and Evgeny Burnaev

Abstract: Depth maps captured with commodity sensors often require super-resolution to be used in applications. In this work we study a super-resolution approach based on a variational problem statement with Tikhonov regularization where the regularizer is parametrized with a deep neural network. This approach was previously applied successfully in photoacoustic tomography. We experimentally show that its application to depth map super-resolution is difficult, and provide suggestions about the reasons for that.
Download

Paper Nr: 208
Title:

Robust Underwater Visual Graph SLAM using a Simanese Neural Network and Robust Image Matching

Authors:

Antoni Burguera

Abstract: This paper proposes a fast method to robustly perform Visual Graph SLAM in underwater environments. Since Graph SLAM is not resilient to wrong loop detections, the key of our proposal is the Visual Loop Detector, which operates in two steps. First, a lightweight Siamese Neural Network performs a fast check to discard non loop closing image pairs. Second, a RANSAC based algorithm exhaustively analyzes the remaining image pairs and filters out those that do not close a loop. The accepted image pairs are then introduced as new graph constraints that will be used during the graph optimization. By executing RANSAC only on a previously filtered set of images, the gain in speed is considerable. The experimental results, which evaluate each component separately as well as the whole Visual Graph SLAM system, show the validity of our proposal both in terms of quality of the detected loops, error of the resulting trajectory and execution time.
Download

Paper Nr: 210
Title:

Recovering High Intensity Images from Sequential Low Light Images

Authors:

Masahiro Hayashi, Fumihiko Sakaue, Jun Sato, Yoshiteru Koreeda, Masakatsu Higashikubo and Hidenori Yamamoto

Abstract: In this paper, we propose a method for recovering high intensity images from degraded low intensity images taken in low light. In particular, we show that by using the sequence of low light images, the high intensity image can be generated more accurately. For using the sequence of images, we have to deal with moving objects in the image. We combine multiple networks for generating accurate high intensity images in the presence of moving objects. We also introduce newly defined loss called character recognition loss for obtaining more accurate high intensity images.
Download

Paper Nr: 212
Title:

AutoCNN-MSCD: An Autodesigned CNN Framework for Detecting Multi-skin Cancer Diseases over Dermoscopic Images

Authors:

Robert Brodin, Palawat Busaranuvong and Chun-Kit Ngan

Abstract: We enhance and customize the automatically evolving genetic-based CNN (AE-CNN) framework to develop an auto-designed CNN (AutoCNN) pipeline to dynamically generate an optimal CNN model to assist physicians in detecting multi-skin cancer diseases (MSCD) over dermatoscopic images. Specifically, the contributions of this work are three-fold: (1) integrate the pre-processing module into the existing AE-CNN framework to sanitize and diversify dermatoscopic images; (2) enhance the evaluation algorithm of the framework to improve the model selection process by using the k-fold cross-validation; and (3) conduct the experimental study to present the accuracy results that the CNN model constructed by AutoCNN outperforms the model by AE-CNN to detect and classify MSCD.
Download

Paper Nr: 214
Title:

From Explanations to Segmentation: Using Explainable AI for Image Segmentation

Authors:

Clemens Seibold, Johannes Künzel, Anna Hilsmann and Peter Eisert

Abstract: The new era of image segmentation leveraging the power of Deep Neural Nets (DNNs) comes with a price tag: to train a neural network for pixel-wise segmentation, a large amount of training samples has to be manually labeled on pixel-precision. In this work, we address this by following an indirect solution. We build upon the advances of the Explainable AI (XAI) community and extract a pixel-wise binary segmentation from the output of the Layer-wise Relevance Propagation (LRP) explaining the decision of a classification network. We show that we achieve similar results compared to an established U-Net segmentation architecture, while the generation of the training data is significantly simplified. The proposed method can be trained in a weakly supervised fashion, as the training samples must be only labeled on image-level, at the same time enabling the output of a segmentation mask. This makes it especially applicable to a wider range of real applications where tedious pixel-level labelling is often not possible.
Download

Paper Nr: 215
Title:

Underwater Image Enhancement by the Retinex Inspired Contrast Enhancer STRESS

Authors:

Michela Lecca

Abstract: Underwater images are often affected by undesired effects, like noise, color casts and poor detail visibility, hampering the understanding of the image content. This work proposes to improve the quality of such images by means of STRESS, a Retinex inspired contrast enhancer originally designed to process general, real-world pictures. STRESS, which is based on a local color spatial processing inspired to the human vision mechanism, is here tested and compared with other approaches on the public underwater image dataset UIEB. The experiments show that in general STRESS remarkably increases the quality of the input image, while preserving its local structure. The images enhanced by STRESS are released for free to enable visual inspection, further analysis and comparisons.
Download

Paper Nr: 220
Title:

Multi-Image Super-Resolution for Thermal Images

Authors:

Rafael E. Rivadeneira, Angel D. Sappa and Boris X. Vintimilla

Abstract: This paper proposes a novel CNN architecture for the multi-thermal image super-resolution problem. In the proposed scheme, the multi-images are synthetically generated by downsampling and slightly shifting the given image; noise is also added to each of these synthesized images. The proposed architecture uses two attention blocks paths to extract high-frequency details taking advantage of the large information extracted from multiple images of the same scene. Experimental results are provided, showing the proposed scheme has overcome the state-of-the-art approaches.
Download

Paper Nr: 222
Title:

Deep Learning based Object Detection and Tracking for Maritime Situational Awareness

Authors:

Rihab Lahouli, Geert De Cubber, Benoît Pairet, Charles Hamesse, Timothée Fréville and Rob Haelterman

Abstract: Improving real-time situational awareness using deep-learning based video processing is of great interest in maritime and inland waterway environments. For instance, automating visual analysis for the classification and interpretation of the objects surrounding a vessel remains a critical challenge towards more autonomous navigational system. The complexity dramatically increases when we address waterway environments with a more dense traffic compared to open sea, and presenting navigation marks that need to be detected and correctly understood to take correct decisions. In this paper, we will therefore propose a new training dataset tailored to the navigation and mooring in waterway environments. The dataset contains 827 representative images gathered in various Belgian waterways. The images are captured on board a navigating barge and from a camera mounted on a drone. The dataset covers a range of realistic conditions of traffic and weather conditions. We investigate in the current study the training of the YOLOv5 model for the detection of seven different classes corresponding to vessels, obstacles and different navigation marks. The detector is combined with a pretrained Deep Sort Tracker. The YOLOv5 training results proved to reach an overall mean average precision of 0.891 for an intersection over union of 0.5.
Download

Paper Nr: 228
Title:

Improving the Efficiency of Autoencoders for Visual Defect Detection with Orientation Normalization

Authors:

Richárd Rádli and László Czúni

Abstract: Autoencoders (AE) can have an important role in visual inspection since they are capable of unsupervised learning of normal visual appearance and detection of visual defects as anomalies. Reducing the variability of incoming structures can result in more efficient representation in latent space and better reconstruction quality for defect free inputs. In our paper we investigate the utilization of spatial transformer networks (STN) to improve the efficiency of AEs in reconstruction and defect detection. We found that the simultaneous training of the convolutional layers of the AEs and the weights of STNs doesn’t result in satisfactory reconstructions by the decoder. Instead, the STN can be trained to normalize the orientation of the input images. We evaluate the performance of the proposed mechanism, on three classes of input patterns, by reconstruction error and standard anomaly detection metrics.
Download

Paper Nr: 229
Title:

Generating High Resolution Depth Image from Low Resolution LiDAR Data using RGB Image

Authors:

Kento Yamakawa, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a GAN that generates a high-resolution depth image from a low-resolution depth image obtained from low-resolution LiDAR. Our method uses a high-resolution RGB image as a guide image, and generate high-resolution depth image from low-resolution depth image efficiently by using GAN. The results of the qualitative and quantitative evaluation show the effectiveness of the proposed method.
Download

Paper Nr: 240
Title:

Variational Temporal Optical Flow for Multi-exposure Video

Authors:

Onofre Martorell and Antoni Buades

Abstract: High Dynamic Range (HDR) reconstruction for multi-exposure video sequences is a very challenging task. One of its main steps is registration of input frames. We propose a novel variational model for optical flow estimation for multi-exposure video sequences. We introduce data terms for consecutive and non consecutive frames, the latter term comparing frames with the same exposure. We also compute forward and backward flow terms for the current frame, naturally introducing temporal regularization. We present a particular formulation with sequences with two exposures, that can be extended to larger number of exposures. We compare the proposed method with state of the art variational models.
Download

Paper Nr: 245
Title:

Enhanced 3D Point Cloud Object Detection with Iterative Sampling and Clustering Algorithms

Authors:

Shane Ward and Hossein Malekmohamadi

Abstract: Existing state-of-the-art object detection networks for 3D point clouds provide bounding box results directly from 3D data, without reliance on 2D detection methods. While state-of-the-art accuracy and mAP (mean- average precision) results are achieved by GroupFree3D, MLCVNet and VoteNet methods for the SUN RGB- D and ScanNet V2 datasets, challenges remain in translating these methods across multiple datasets for a variety of applications. These challenges arise due to the irregularity, sparsity and noise present in point clouds which hinder object detection networks from extracting accurate features and bounding box results. In this paper, we extend existing state-of-the-art 3D point cloud object detection methods to include filtering of outlier data via iterative sampling and accentuate feature learning via clustering algorithms. Specifically, the use of RANSAC allows for the removal of outlier points from the dataset scenes and the integration of DBSCAN, K-means, BIRCH and OPTICS clustering algorithms allows the detection networks to optimise the extraction of object features. We demonstrate a mean average precision improvement for some classes of the SUN RGB-D validation dataset through the use of iterative sampling against current state-of-the-art methods while demonstrating a consistent object accuracy of above 99.1%. The results of this paper demonstrate that combining iterative sampling with current state-of-the-art 3D point cloud object detection methods can improve accuracy and performance while reducing the computational size.
Download

Paper Nr: 251
Title:

NEMA: 6-DoF Pose Estimation Dataset for Deep Learning

Authors:

Philippe Pérez de San Roman, Pascal Desbarats, Jean-Philippe Domenger and Axel Buendia

Abstract: Maintenance is inevitable, time-consuming, expensive, and risky to production and maintenance operators. Porting maintenance support applications to mixed reality (MR) headsets would ease operations. To function, the application needs to anchor 3D graphics onto real objects, i.e. locate and track real-world objects in three dimensions. This task is known in the computer vision community as Six Degree of Freedom Pose Estimation (6-Dof) and is best solved using Convolutional Neural Networks (CNNs). Training them required numerous examples, but acquiring real labeled images for 6-DoF pose estimation is a challenge on its own. In this article, we propose first a thorough review of existing non-synthetic datasets for 6-DoF pose estimations. This allows identifying several reasons why synthetic training data has been favored over real training data. Nothing can replace real images. We show next that it is possible to overcome the limitations faced by previous datasets by presenting a new methodology for labeled images acquisition. And finally, we present a new dataset named NEMA that allows deep learning methods to be trained without the need for synthetic data.
Download

Paper Nr: 252
Title:

Detection and Identification of Threat Potential of Ships using Satellite Images and AIS Data

Authors:

Akash Kumar, Aayush Sugandhi and Yamuna Prasad

Abstract: This paper addresses the issue of vessel tracking using Automatic Identification Systems (AIS) and imagery data. In general, we depend on AIS data for the accurate tracking of the vessels, but there is often a gap between two consecutive AIS instances of any vessel. This is called as blind period or the inactivity period. In this period, we can not be sure about the location of the ship. The duration of inactivity period is quite variable due to various factors like weather, satellite connectivity and manual turn off. This makes tracking and identification of any threat difficult. In this paper, we propose a two-fold approach for tracking and identifying the potential threat using deep learning models and AIS data. In the first fold, the ships out of satellite imaging are identified while in the second fold, the corresponding AIS data is analysed to discover any potential threat or suspicious activity.
Download

Paper Nr: 256
Title:

Image Quality Assessment using Deep Features for Object Detection

Authors:

Poonam Beniwal, Pranav Mantini and Shishir K. Shah

Abstract: Applications such as video surveillance and self-driving cars produce large amounts of video data. Computer vision algorithms such as object detection have found a natural place in these scenarios. The reliability of these algorithms is usually benchmarked using curated datasets. However, one of the core challenges of working with computer vision data is variability. Compression is one such parameter that introduces artifacts (variability) in the data and can negatively affect performance. In this paper, we study the effect of compression on CNN-based object detectors and propose a new full-reference image quality metric based on Discrete Cosine Transform (DCT) to quantify the quality of an image for CNN-based object detectors. We compare this metric with commonly used image quality metrics, and the results show that the proposed metric correlates better with object detection performance. Furthermore, we train a regression model to estimate the quality of images for object detection.
Download

Paper Nr: 258
Title:

Spectral Absorption from Two-view Hyperspectral Images

Authors:

Kenta Kageyama, Ryo Kawahara and Takahiro Okabe

Abstract: When light passes through a liquid, its energy is attenuated due to absorption. The attenuation depends both on the spectral absorption coefficient of a liquid and on the optical path length of light, and is described by the Lambert-Beer law. The spectral absorption coefficients of liquids are often unknown in real-world applications and to be measured/estimated in advance, because they depend not only on liquid media themselves but also on dissolved materials. In this paper, we propose a method for estimating the spectral absorption coefficient of a liquid only from two-view hyperspectral images of an under-liquid scene taken from the outside of the liquid in a passive and non-contact manner. Specifically, we show that the estimation results in Non-negative Matrix Factorization (NMF) because both the objective variables and the explanatory variables are all nonnegative, and then study the ambiguity in matrix factorization. We conducted a number of experiments using real hyperspectral images, and confirmed that our method works well and is useful for reconstructing shape of an under-liquid scene.
Download

Paper Nr: 282
Title:

LiDAR-camera Calibration in an Uniaxial 1-DoF Sensor System

Authors:

Tamás Tófalvi, Bandó Kovács, Levente Hajder and Tekla Tóth

Abstract: This paper introduces a novel camera-LiDAR calibration method using a simple planar chessboard pattern as the calibration object. We propose a special mounting for the sensors when only one rotation angle should be estimated for the calibration. It is proved that the calibration can optimally be solved in the least-squares sense even if the problem is overdetermined, i.e., when many chessboard patterns are visible for the sensors. The accuracy and precision of our unique solution are validated on both simulated and real-world data.
Download

Paper Nr: 6
Title:

A Real-time 3D Surround View Pipeline for Embedded Devices

Authors:

Onur Eker, Burak Ercan, Berkant Bayraktar and Murat Bal

Abstract: In recent years, 3D surround view systems started to gain more attention as advanced driver assistance systems (ADAS) become more capable and intelligent. A 3D surround view system provides a 360-degree view of the environment surrounding the vehicle to enable the driver or operator to virtually observe its surroundings in a convenient way. In this paper we propose an end-to-end algorithm pipeline for 3D surround view systems, and show that it works in real-time on embedded devices. The proposed pipeline uses four cameras mounted around a vehicle for image acquisition. First, images are rectified and mapped to a spherical surface to generate 2D panorama view. In mapping step, a low-cost color correction is applied to provide a uniform scene in panorama. Lastly, generated panorama image is projected on a bowl-shaped mesh model to provide 360-degree view of the surrounding environment. The experimental results show that the proposed method works in real-time on desktop computers as well as on embedded devices (such as the NVIDIA Xavier); and generates less distorted, visually appealing 360-degree surround view of the vehicle.
Download

Paper Nr: 23
Title:

What Matters for Out-of-Distribution Detectors using Pre-trained CNN?

Authors:

Dong-Hee Kim, Jaeyoon Lee and Ki-Seok Chung

Abstract: In many real-world applications, a trained neural network classifier may have inputs that do not belong to any classes of the dataset used for training. Such inputs are called out-of-distribution (OOD) inputs. Obviously, OOD samples may cause the classifier to perform unreliably and inaccurately. Therefore, it is important to have the capability of distinguishing the OOD inputs from the in-distribution (ID) data. To improve the detection capability, quite a few methods using pre-trained convolutional neural networks(CNNs) with OOD samples have been proposed. Even though these methods show good performance in various applications, the OOD detection capabilities may vary depending on the implementation details and the methodology how to apply a set of detection methods. Thus, it is very important to choose both a good set of solutions and the methodology how to apply the set of solutions to maximize the effectiveness. In this paper, we carry out an extensive set of experiments to discuss various factors that may affect the OOD detection performance. Four different OOD detectors are tested with various implementation settings to find the configuration to achieve practically solid results.
Download

Paper Nr: 24
Title:

Hybrid Method for Rapid Development of Efficient and Robust Models for In-row Crop Segmentation

Authors:

Paweł Majewski and Jacek Reiner

Abstract: Crop segmentation is a crucial part of computer vision methods for precision agriculture. Two types of crop segmentation approaches can be observed – based on pixel intensity thresholding of vegetation indices and classification-based including context (e.g. deep convolutional neural network). Threshold-based methods work well when images do not contain disruptions (weeds, overlapping, different illumination). Although deep learning methods can cope with the mentioned problems their development requires a large number of labelled samples. In this study, we propose a hybrid method for the rapid development of efficient and robust models for in-row crop segmentation, combining the advantages of described approaches. Our method consists of two-step labelling with the generation of synthetic crop images and the following training of the Mask R-CNN model. The proposed method has been tested comprehensively on samples characterised by different types of disruptions. Already the first labelling step based mainly on cluster labelling significantly increased the average F1-score in crop detection task compared to binary thresholding of vegetation indices. The second stage of the labelling allowed this result to be increased. As part of this research, an algorithm for row detection and row-based filtering was also proposed, which reduced the number of FP errors made during inference.
Download

Paper Nr: 31
Title:

Evaluation of a Local Descriptor for HDR Images

Authors:

Artur S. Nascimento, Welerson J. Melo, Beatriz T. Andrade and Daniel O. Dantas

Abstract: Feature point (FP) detection and description are processes that detect and extract characteristics from images. Several computer vision applications rely on the usage of FPs. Most FP descriptors are designed to support low dynamic range (LDR) images as input. However, high dynamic range (HDR) images can show details in bright and shadowed areas that LDR images can not. For that reason, the interest in HDR imagery as input in the detection and description processes has been increasing. Previous studies have explored FP detectors in HDR images. However, none have presented FP descriptors designed for HDR images. This study compares the FP matching performance of description vectors generated from LDR and HDR images. The FPs were detected and described using a version of the SIFT algorithm adapted to support HDR images. The FP matching performance of the algorithm was evaluated with the mAP metric. In all cases, using HDR images increased the mAP values when compared to LDR images.
Download

Paper Nr: 32
Title:

AAEGAN Loss Optimizations Supporting Data Augmentation on Cerebral Organoid Bright-field Images

Authors:

Clara Brémond Martin, Camille Simon Chane, Cédric Clouchoux and Aymeric Histace

Abstract: Cerebral Organoids (CO) are brain-like structures that are paving the way to promising alternatives to in vivo models for brain structure analysis. Available microscopic image databases of CO cultures contain only a few tens of images and are not widespread due to their recency. However, developing and comparing reliable analysis methods, be they semi-automatic or learning-based, requires larger datasets with a trusted ground truth. We extend a small database of bright-field CO using an Adversarial Autoencoder(AAEGAN) after comparing various Generative Adversarial Network (GAN) architectures. We test several loss variations, by metric calculations, to overcome the generation of blurry images and to increase the similitude between original and generated images. To observe how the optimization could enrich the input dataset in variability, we perform a dimensional reduction by t-distributed Stochastic Neighbor Embedding (t-SNE). To highlight a potential benefit effect of one of these optimizations we implement a U-Net segmentation task with the newly generated images compared to classical data augmentation strategies. The Perceptual wasserstein loss prove to be an efficient baseline for future investigations of bright-field CO database augmentation in term of quality and similitude. The segmentation is the best perform when training step include images from this generative process. According to the t-SNE representation we have generated high quality images which enrich the input dataset regardless of loss optimization. We are convinced each loss optimization could bring a different information during the generative process that are still yet to be discovered.
Download

Paper Nr: 36
Title:

ERQA: Edge-restoration Quality Assessment for Video Super-Resolution

Authors:

Anastasia Kirillova, Eugene Lyapustin, Anastasia Antsiferova and Dmitry Vatolin

Abstract: Despite the growing popularity of video super-resolution (VSR), there is still no good way to assess the quality of the restored details in upscaled frames. Some VSR methods may produce the wrong digit or an entirely different face. Whether a method’s results are trustworthy depends on how well it restores truthful details. Image super-resolution can use natural distributions to produce a high-resolution image that is only somewhat similar to the real one. VSR enables exploration of additional information in neighboring frames to restore details from the original scene. The ERQA metric, which we propose in this paper, aims to estimate a model’s ability to restore real details using VSR. On the assumption that edges are significant for detail and character recognition, we chose edge fidelity as the foundation for this metric. Experimental validation of our work is based on the MSU Video Super-Resolution Benchmark, which includes the most difficult patterns for detail restoration and verifies the fidelity of details from the original frame. Code for the proposed metric is publicly available at https://github.com/msu-video-group/ERQA.
Download

Paper Nr: 43
Title:

Deep Learning-based Anomaly Detection on X-Ray Images of Fuel Cell Electrodes

Authors:

Simon B. Jensen, Thomas B. Moeslund and Søren J. Andreasen

Abstract: Anomaly detection in X-ray images has been an active and lasting research area in the last decades, especially in the domain of medical X-ray images. For this work, we created a real-world labeled anomaly dataset, consisting of 16-bit X-ray image data of fuel cell electrodes coated with a platinum catalyst solution and perform anomaly detection on the dataset using a deep learning approach. The dataset contains a diverse set of anomalies with 11 identified common anomalies where the electrodes contain e.g., scratches, bubbles, smudges etc. We experiment with 16-bit image to 8-bit image conversion methods to utilize pre-trained Convolutional Neural Networks as feature extractors (transfer learning) and find that we achieve the best performance by maximizing the contrasts globally across the dataset during the 16-bit to 8-bit conversion, through histogram equalization. We group the fuel cell electrodes with anomalies into a single class called abnormal and the normal fuel cell electrodes into a class called normal, thereby abstracting the anomaly detection problem into a binary classification problem. We achieve a balanced accuracy of 85.18%. The anomaly detection is used by the company, Serenergy, for optimizing the time spend on the quality control of the fuel cell electrodes.
Download

Paper Nr: 50
Title:

FisheyeSuperPoint: Keypoint Detection and Description Network for Fisheye Images

Authors:

Anna Konrad, Ciarán Eising, Ganesh Sistu, John McDonald, Rudi Villing and Senthil Yogamani

Abstract: Keypoint detection and description is a commonly used building block in computer vision systems particularly for robotics and autonomous driving. However, the majority of techniques to date have focused on standard cameras with little consideration given to fisheye cameras which are commonly used in urban driving and automated parking. In this paper, we propose a novel training and evaluation pipeline for fisheye images. We make use of SuperPoint as our baseline which is a self-supervised keypoint detector and descriptor that has achieved state-of-the-art results on homography estimation. We introduce a fisheye adaptation pipeline to enable training on undistorted fisheye images. We evaluate the performance on the HPatches benchmark, and, by introducing a fisheye based evaluation method for detection repeatability and descriptor matching correctness, on the Oxford RobotCar dataset.
Download

Paper Nr: 65
Title:

Colour Augmentation for Improved Semi-supervised Semantic Segmentation

Authors:

Geoff French and Michal Mackiewicz

Abstract: Consistency regularization describes a class of approaches that have yielded state-of-the-art results for semi-supervised classification. While semi-supervised semantic segmentation proved to be more challenging, recent work has explored the challenges involved in using consistency regularization for segmentation problems and has presented solutions. In their self-supervised work Chen et al. found that colour augmentation prevents a classification network from using image colour statistics as a short-cut for self-supervised learning via instance discrimination. Drawing inspiration from this we find that a similar problem impedes semi-supervised semantic segmentation and offer colour augmentation as a solution, improving semi-supervised semantic segmentation performance on challenging photographic imagery. Implementation at: https://github.com/Britefury/cutmix-semisup-seg
Download

Paper Nr: 68
Title:

Physics based Motion Estimation to Improve Video Compression

Authors:

James McCullough, Naseer Al-Jawad and Tuan Nguyen

Abstract: Optical flow is a fundamental component of video compression as it can be used to effectively compress sequential frames. However, currently optical flow is only a transformation of one frame into another. This paper considers the possibility of representing optical flow based on physics principles which has not, to our knowledge, been researched before. Video often consists of real-world events captured by a camera, meaning that objects within videos follow Newtonian physics, so the video can be compressed by converting the motion of the object into physics-based motion paths. The proposed algorithm converts an object’s location over a series of frames into a sequence of physics motion paths. The space cost in saving these motion paths could be considerably smaller compared with traditional optical flow, and this improves video compression in exchange for increased encoding/decoding times. Based on our experimental implementation, motion paths can be used to compress the motion of objects on basic trajectories. By comparing the file sizes between original and processed image sequences, effective compression on basic object movements can be identified.
Download

Paper Nr: 76
Title:

Visual Analysis of Deep Learning Methods for Industrial Vacuum Metalized Film Product

Authors:

Thiago R. Bastos, Luiz Stragevitch and Cleber Zanchettin

Abstract: Extract information to support decisions in a complex environment as the industrial is not an easy task. Information technologies and cyber-physical systems have provided technical possibilities to extract, store, and process many data. In parallel, the recent advances in artificial intelligence permit the prediction and evaluation of features and information. Industry 4.0 can benefit from these approaches, allowing the visualization of process, feature prediction, and model interpretation. We evaluate the use of Machine Learning (ML) to support monitoring and quality prediction of an industrial vacuum metalization process. Therefore, we proposed a semantic segmentation approach to fault identification using images composed of optical density (OD) values from the vacuum metalized film process. Besides that, a deep neural network model is applied to product classification using the segmented OD profile. The semantic segmentation allowed film regions analysis and coating quality associations through their class and format. The proposed classifier presented 86.67% of accuracy. The use of visualization and ML approaches permits systematical real-time process monitoring that reduces time and material waste. Consequently, it is a promising approach for Industry 4.0 on monitoring and maintenance support.
Download

Paper Nr: 94
Title:

CLOSED: A Dashboard for 3D Point Cloud Segmentation Analysis using Deep Learning

Authors:

Thanasis Zoumpekas, Guillem Molina, Anna Puig and Maria Salamó

Abstract: With the growing interest in 3D point cloud data, which is a set of data points in space used to describe a 3D object, and the inherent need to analyze it using deep neural networks, the visualization of data processes is critical for extracting meaningful insights. There is a gap in the literature for a full-suite visualization tool to analyse 3D deep learning segmentation models on point cloud data. This paper proposes such a tool to cover this gap, entitled point CLOud SEgmentation Dashboard (CLOSED). Specifically, we concentrate our efforts on 3D point cloud part segmentation, where the entire shape and the parts of a 3D object are significant. Our approach manages to (i) exhibit the learning evolution of neural networks, (ii) compare and evaluate different neural networks, (iii) highlight key-points of the segmentation process. We illustrate our proposal by analysing five neural networks utilizing the ShapeNet-part dataset.
Download

Paper Nr: 131
Title:

Mitigating the Zero Biased Steering Angles in Self-driving Simulator Datasets

Authors:

Muhammad A. Khan, Khawaja G. Alamdar, Aiman Junaid and Muhammad Farhan

Abstract: Autonomous or self-driving systems require rigorous training before making it to the roads. Deep learning is at the forefront of the training, testing, and validation of such systems. Self-driving simulators play a vital role in this process not only due to the data-intensiveness of the deep learning algorithm but also due to several parameters involved in the system. The data generated from self-driving car simulators have an inherent problem of large zero-bias due to the discrete nature of computation arising from computer input devices. In this paper, we analyze this problem and propose filtering to make the steering angles in the dataset smoother and to remove random fluctuations that make our model learn better. After such processing, the test run on simulators showed promising results using a significantly small dataset and a relatively shallow network.
Download

Paper Nr: 193
Title:

Smartphone based Finger-Photo Verification using Siamese Network

Authors:

Jag M. Singh, Ahmad S. Madhun, Ahmed M. Kedir and Raghavendra Ramachandra

Abstract: With the advent of deep-learning, finger-photo verification, a.k.a finger-selfies, is an upcoming research area in biometrics. In this paper, we propose the Siamese Neural Network (SNN) architecture for finger photo verification. Our approach consists of a MaskRCNN network used for finger photo segmentation from an input video frame and the proposed Siamese Neural Network for finger-photo verification. Extensive experiments are carried out on the public dataset consisting of 400000 images extracted from 2000 videos in five different sessions. The dataset has 200 unique fingers, where each finger is captured in 5 sessions, 2 sample videos each with 200 frames. We define protocols for testing in the same session and different sessions with/without using the same subjects replicating the real-world scenario. Our proposed method achieves an EER in the range of 8.9% to 34.7%. Our proposed method does not use COTS and uses only a deep neural network.
Download

Paper Nr: 195
Title:

DLDFD: Recurrence Free 2D Convolution Approach for Deep Fake Detection

Authors:

Jag M. Singh and Raghavendra Ramachandra

Abstract: Deep Fake images, which are digitally generated either through computer graphics or deep learning techniques, pose an increasing risk to existing face recognition systems. This paper presents a Deep-Learning-based Deep Fake Detection (DLDFD) architecture consisting of augmented convolutional layers followed by Resnet-50 architecture. We train DLDFD end-to-end with low-resolution images from the FaceForensics++ dataset. The number of images used during different phases includes approximately 1.68 million during training, 315k during validation, and 340k during testing. We train DLDFD in three different scenarios, combined image manipulation where we achieve an accuracy of 96.07% compared to 85.14% of state-of-the-art (SOTA), single image manipulation techniques where we get 100% accuracy for neural textures, and finally, cross-image manipulation techniques where we achieve an accuracy of 94.28% on the unseen category of face swap much higher than SOTA. Our approach requires only 2D convolutions without recurrence as compared to SOTA.
Download

Paper Nr: 197
Title:

An Initial Study in Wood Tomographic Image Classification using the SVM and CNN Techniques

Authors:

Antonio A. Pereira Junior and Marco A. Garcia de Carvalho

Abstract: The internal analysis of wood logs is an essential task in the field of forest assessment. To assist in the identification of anomalies within wood logs, methods from the Non-Destructive Testing area can be used, as the acoustic methods. The ultrasound tomography is an acoustic method that allows to evaluate the internal conditions of wood logs, through the analysis of wave propagation, without being necessary to cause damage to the specimen. The images generated by ultrasound tomography can be improved by spatial interpolation, i.e., estimating the values of wave propagation not measured in the initial examination. In this paper we present an initial study of classification techniques in order to identify tomographic images with anomalies. In our approach we consider three different classifiers: k-Nearest-Neighbor (k-NN), Support Vector Machine (SVM) and Convolutional Neural Network (CNN). Experiments were conducted comparing them by means of metrics obtained from the confusion matrix. We build a dataset with 5000 images using data augmentation process. The quantitative metrics demonstrate the effectiveness of CNN when compared with k-NN and SVM classifiers.
Download

Paper Nr: 255
Title:

Classification and Analysis of Liverwort Sperm by Integration-Net

Authors:

Haruki Fujii, Naoki Minamino, Takashi Ueda, Yohei Kondo and Kazuhiro Hotta

Abstract: In this paper, we propose a method to classify the videos of wild-type and mutant sperm of liverwort using deep learning and discover the differences between them. In traditional video classification, 3D-Convolution was often used. However, when 3D CNN is used, the information of multiple frames is mixed. Therefore, it is difficult to detect important frames and locations in a video. To solve this problem, we proposed a network that retains video frame information using Depthwise Convolution and Skip Connection, and used gradient-based visualization to analyze the difference between wild type and mutant sperm. In experiments, we compared the proposed method with conventional 3DCNN and show the effectiveness of the proposed method.
Download

Paper Nr: 268
Title:

Melanoma Recognition

Authors:

Michal Haindl and Pavel Žid

Abstract: Early and reliable melanoma detection is one of today's significant challenges for dermatologists to allow successful cancer treatment. This paper introduces multispectral rotationally invariant textural features of the Markovian type applied to effective skin cancerous lesions classification. Presented texture features are inferred from the descriptive multispectral circular wide-sense Markov model. Unlike the alternative texture-based recognition methods, mainly using different discriminative textural descriptions, our textural representation is fully descriptive multispectral and rotationally invariant. The presented method achieves high accuracy for skin lesion categorization. We tested our classifier on the open-source dermoscopic ISIC database, containing 23 901 benign or malignant lesions images, where the classifier outperformed several deep neural network alternatives while using smaller training data.
Download

Area 2 - Mobile and Egocentric Vision for Humans and Robots

Full Papers
Paper Nr: 69
Title:

3D Map Generation with Shape and Appearance Information

Authors:

Taro Yamada and Shuichi Enokida

Abstract: It is clear from the numerous reports of recent years that interest in the research and development of autonomous mobile robots is growing and that a key requirement for the successful development of such self- directed machines is effective estimations of the navigable domain. Furthermore, in view of the differing characteristics of their physical performance capabilities relative to specific applications, specific estimations must be made for each robot. The effective assessment of a domain that permits successful robot navigation of a densely occupied indoor space requires the generation of a fine-grained three-dimensional (3D) map to facilitate its safe movements. This, in turn, requires the provision of appearance information as well as space shape ascertainment. To addresFs these issues, we herein propose a practical Semantic Simultaneous Localization and Mapping (Semantic SLAM) method capable of yielding labeled 3D maps. This method generates maps by class-labeling images obtained via semantic segmentation of 3D point groups obtained with Real-Time Appearance-Based Mapping (RTAB-Map).
Download

Paper Nr: 150
Title:

Hardware-oriented Algorithm for Human Detection using GMM-MRCoHOG Features

Authors:

Ryogo Takemoto, Yuya Nagamine, Kazuki Yoshihiro, Masatoshi Shibata, Hideo Yamada, Yuichiro Tanaka, Shuichi Enokida and Hakaru Tamukoh

Abstract: In this research, we focus on Gaussian mixture model-multiresolution co-occurrence histograms of oriented gradients (GMM-MRCoHOG) features using luminance gradients in images and propose a hardware-oriented algorithm of GMM-MRCoHOG to implement it on a field programmable gate array (FPGA). The proposed method simplifies the calculation of luminance gradients, which is a high-cost operation in the conventional algorithm, by using lookup tables to reduce the circuit size. We also designed a human-detection digital architecture of the proposed algorithm for FPGA implementation using high-level synthesis. The verification results showed that the processing speed of the proposed architecture was approximately 123 times faster than that of the FPGA implementation of VGG-16.
Download

Paper Nr: 204
Title:

Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics

Authors:

Arnav Varma, Hemang Chawla, Bahram Zonooz and Elahe Arani

Abstract: The advent of autonomous driving and advanced driver assistance systems necessitates continuous developments in computer vision for 3D scene understanding. Self-supervised monocular depth estimation, a method for pixel-wise distance estimation of objects from a single camera without the use of ground truth labels, is an important task in 3D scene understanding. However, existing methods for this task are limited to convolutional neural network (CNN) architectures. In contrast with CNNs that use localized linear operations and lose feature resolution across the layers, vision transformers process at constant resolution with a global receptive field at every stage. While recent works have compared transformers against their CNN counterparts for tasks such as image classification, no study exists that investigates the impact of using transformers for self-supervised monocular depth estimation. Here, we first demonstrate how to adapt vision transformers for self-supervised monocular depth estimation. Thereafter, we compare the transformer and CNN-based architectures for their performance on KITTI depth prediction benchmarks, as well as their robustness to natural corruptions and adversarial attacks, including when the camera intrinsics are unknown. Our study demonstrates how transformer-based architecture, though lower in run-time efficiency, achieves comparable performance while being more robust and generalizable.
Download

Paper Nr: 224
Title:

3D Hand and Object Pose Estimation for Real-time Human-robot Interaction

Authors:

Chaitanya Bandi, Hannes Kisner and Urike Thomas

Abstract: Estimating 3D hand pose and object pose in real-time is essential for human-robot interaction scenarios like handover of objects. Particularly in handover scenarios, many challenges need to be faced such as mutual hand-object occlusions and the inference speed to enhance the reactiveness of robots. In this paper, we present an approach to estimate 3D hand pose and object pose in real-time using a low-cost consumer RGB-D camera for human-robot interaction scenarios. We propose a cascade of networks strategy to regress 2D and 3D pose features. The first network detects the objects and hands in images. The second network is an end-to-end model with independent weights to regress 2D keypoints of hands joints and object corners, followed by a 3D wrist centric hand and object pose regression using a novel residual graph regression network and finally a perspective-n-point approach to solve 6D pose of detected objects in hand. To train and evaluate our model, we also propose a small-scale 3D hand pose dataset with a new semi-automated annotation approach using a robot arm and demonstrate the generalizability of our model on the state-of-the-art benchmarks.
Download

Paper Nr: 261
Title:

SparseDet: Towards End-to-End 3D Object Detection

Authors:

Jianhong Han, Zhaoyi Wan, Zhe Liu, Jie Feng and Bingfeng Zhou

Abstract: In this paper, we propose SparseDet for end-to-end 3D object detection from point cloud. Existing works on 3D object detection rely on dense object candidates over all locations in a 3D or 2D grid following the mainstream methods for object detection in 2D images. However, this dense paradigm requires expertise in data to fulfill the gap between label and detection. As a new detection paradigm, SparseDet maintains a fixed set of learnable proposals to represent latent candidates and directly perform classification and localization for 3D objects through stacked transformers. It demonstrates that effective 3D object detection can be achieved with none of post-processing such as redundant removal and non-maximum suppression. With a properly designed network, SparseDet achieves highly competitive detection accuracy while running with a more efficient speed of 34.5 FPS. We believe this end-to-end paradigm of SparseDet will inspire new thinking on the sparsity of 3D object detection.
Download

Short Papers
Paper Nr: 21
Title:

NeuralQAAD: An Efficient Differentiable Framework for Compressing High Resolution Consistent Point Clouds Datasets

Authors:

Nicolas Wagner and Ulrich Schwanecke

Abstract: In this paper, we propose NeuralQAAD, a differentiable point cloud compression framework that is fast, robust to sampling, and applicable to consistent shapes with high detail resolution. Previous work that is able to handle complex and non-smooth topologies is hardly scaleable to more than just a few thousand points. We tackle the task with a novel neural network architecture characterized by weight sharing and autodecoding. Our architecture uses parameters far more efficiently than previous work, allowing it to be deeper and more scalable. We also show that the currently only tractable training criterion for point cloud compression, the Chamfer distance, performances poorly for high resolutions. To overcome this issue, we pair our architecture with a new training procedure based on a quadratic assignment problem. This procedure acts as a surrogate loss and allows to implicitly minimize the more expressive Earth Movers Distance (EMD) even for point clouds with way more than 106 points. As directly evaluating the EMD on high resolution point clouds is intractable, we propose a new divide-and-conquer approach based on k-d trees, which we call EM-kD. The EM-kD is shown to be a scaleable and fast but still reliable upper bound for the EMD. NeuralQAAD demonstrates on three datasets (COMA, D-FAUST and Skulls) that it significantly outperforms the current state-of-the-art both visually and qualitatively in terms of EM-kD.
Download

Paper Nr: 159
Title:

Multi-stage RGB-based Transfer Learning Pipeline for Hand Activity Recognition

Authors:

Yasser Boutaleb, Catherine Soladie, Nam-Duong Duong, Jérôme Royan and Renaud Seguier

Abstract: First-person hand activity recognition is a challenging task, especially when not enough data are available. In this paper, we tackle this challenge by proposing a new low-cost multi-stage learning pipeline for first-person RGB-based hand activity recognition on a limited amount of data. For a given RGB image activity sequence, in the first stage, the regions of interest are extracted using a pre-trained neural network (NN). Then, in the second stage, high-level spatial features are extracted using pre-trained deep NN. In the third stage, the temporal dependencies are learned. Finally, in the last stage, a hand activity sequence classifier is learned, using a post-fusion strategy, which is applied to the previously learned temporal dependencies. The experiments evaluated on two real-world data sets shows that our pipeline achieves the state-of-the-art. Moreover, it shows that the proposed pipeline achieves good results on limited data.
Download

Paper Nr: 4
Title:

Category-level Part-based 3D Object Non-rigid Registration

Authors:

Diego Rodriguez, Florian Huber and Sven Behnke

Abstract: In this paper, we propose a novel approach for registering objects in a non-rigid manner based on decomposed parts of an object category. By performing part-based registration, the deforming points match better local geometric structures of the observed instance. Moreover, the knowledge acquired of an object part can be transferred to different object categories that share the same decomposed part. This is possible because the registration is based on a learned latent space that encodes typical geometrical variations of each part independently. We evaluate our approach extensively on different object categories and demonstrate its robustness against outliers, noise and misalignments of the object pose.
Download

Paper Nr: 14
Title:

Efficient Semantic Mapping in Dynamic Environments

Authors:

Christian Hofmann, Mathias Fichtner, Markus Lieret and Jörg Franke

Abstract: Unmanned Aerial Vehicles (UAVs) are required to fulfill more and more complex tasks in indoor environments like inspection, stock-taking or transportation of goods. For these tasks, they need to percept objects and obstacles in the environment, navigate safely in it and sometimes even interact with it. A step towards generating a comprehensive environmental overview for robots are semantic maps. Nevertheless, UAVs have several constraints concerning size, weight and power consumption and thus computational resources. In this paper an efficient object-oriented semantic mapping approach suitable for UAVs and similar constrained robots in dynamic environments is proposed. The approach can be completely executed on a computer suited as onboard computer of an UAV. A map comprising semantic information and dynamic objects is generated and updated with update rate of more than 10 Hz.
Download

Paper Nr: 62
Title:

Tracking 3D Deformable Objects in Real Time

Authors:

Tiago Silva, Luís Magalhães, Manuel Ferreira, Salik R. Khanal and Jorge Silva

Abstract: 3D object tracking is a topic that has been widely studied for several years. Although there are already several robust solutions for tracking rigid objects, when it comes to deformable objects the problem increases in complexity. In recent years, there has been an increase in the use of Machine / Deep Learning techniques to solve problems in computer vision, including 3D object tracking. On the other hand, several low-cost devices (like Kinect) have appeared that allow obtaining RGB-D images, which, in addition to colour information, contain depth information. In this paper is proposed a 3D tracking approach for deformable objects that use Machine / Deep Learning techniques and have RGB-D images as input. Furthermore, our approach implements a tracking algorithm, increasing the object segmentation performance towards real time. Our tests were performed on a dataset acquired by ourselves and have obtained satisfactory results for the segmentation of the deformable object.
Download

Paper Nr: 111
Title:

Human Detection and Gesture Recognition for the Navigation of Unmanned Aircraft

Authors:

Markus Lieret, Maximilian Hübner, Christian Hofmann and Jörg Franke

Abstract: Unmanned aircraft (UA) have become increasingly popular for different industrial indoor applications in recent years. Typical applications include the automated stocktaking in high bay warehouses, the automated transport of materials or inspection tasks. Due to limited space in indoor environments and the ongoing production, the UA oftentimes need to operate in less distance to humans compared to outdoor applications. To reduce the risk of danger to persons present in the working area of the UA, it is necessary to enable the UA to perceive and locate persons and to react appropriately to their behaviour. Within this paper, we present an approach to influence the flight mission of autonomous UA using different gestures. Thereby, the UA detects persons within its flight path using an on-board camera and pauses its current flight mission. Subsequently, the body posture of the detected persons is determined so that the persons can provide further flight instructions to the UA via defined gestures. The proposed approach is evaluated by means of simulation and real world flight tests and shows an accuracy of the gesture recognition between 82 and 100 percent, depending on the distance between the persons and the UA.
Download

Paper Nr: 274
Title:

3D Object Recognition using Time of Flight Camera with Embedded GPU on Mobile Robots

Authors:

Benjamin Kelényi, Szilárd Molnár and Levente Tamás

Abstract: The main goal of this work is to analyze the most suitable methods for segmenting and classifying 3D point clouds using embedded GPU for mobile robots. We review the current main approaches including, the point-based, voxel-based and point-voxel-based methods. We evaluated the selected algorithms on different publicly available datasets. Simultaneously, we created a novel architecture based on point-voxel CNN architecture that combines depth imaging with IR. This architecture was designed particularly for pulse-based Time of Flight (ToF) cameras and the primary algorithm’s target being embedded devices. We tested the proposed algorithm on custom indoor/outdoor and public datasets, using different camera vendors.
Download

Area 3 - Image and Video Understanding

Full Papers
Paper Nr: 7
Title:

Self-supervised Learning from Semantically Imprecise Data

Authors:

Clemens-Alexander Brust, Björn Barz and Joachim Denzler

Abstract: Learning from imprecise labels such as “animal” or “bird”, but making precise predictions like “snow bunting” at inference time is an important capability for any classifier when expertly labeled training data is scarce. Contributions by volunteers or results of web crawling lack precision in this manner, but are still valuable. And crucially, these weakly labeled examples are available in larger quantities for lower cost than high-quality bespoke training data. CHILLAX, a recently proposed method to tackle this task, leverages a hierarchical classifier to learn from imprecise labels. However, it has two major limitations. First, it does not learn from examples labeled as the root of the hierarchy, e.g., “object”. Second, an extrapolation of annotations to precise labels is only performed at test time, where confident extrapolations could be already used as training data. In this work, we extend CHILLAX with a self-supervised scheme using constrained semantic extrapolation to generate pseudo-labels. This addresses the second concern, which in turn solves the first problem, enabling an even weaker supervision requirement than CHILLAX. We evaluate our approach empirically, showing that our method allows for a consistent accuracy improvement of 0.84 to 1.19 percent points over CHILLAX and is suitable as a drop-in replacement without any negative consequences such as longer training times.
Download

Paper Nr: 13
Title:

Animal Fiber Identification under the Open Set Condition

Authors:

Oliver Rippel, Sergen Gülçelik, Khosrow Rahimi, Juliana Kurniadi, Andreas Herrmann and Dorit Merhof

Abstract: Animal fiber identification is an essential aspect of fabric production, since specialty fibers such as cashmere are often targeted by adulteration attempts. Proposed, automated solutions can furthermore not be applied in practice (i.e. under the open set condition), as they are trained on a small subset of all existing fiber types only and simultaneously lack the ability to reject fiber types unseen during training at test time. In our work, we overcome this limitation by applying out-of-distribution (OOD)-detection techniques to the natural fiber identification task. Specifically, we propose to jointly model the probability density function of in-distribution data across feature levels of the trained classification network by means of Gaussian mixture models. Moreover, we extend the open set F-measure to the so-called area under the open set precision-recall curve (AUPRos), a threshold-independent measure of joint in-distribution classification & OOD-detection performance for OOD-detection methods with continuous OOD scores. Exhaustive comparison to the state of the art reveals that our proposed approach performs best overall, achieving highest area under the class-averaged, open set precision-recall curve (AUPRos,avg). We thus show that the application of automated fiber identification solutions under the open set condition is feasible via OOD detection.
Download

Paper Nr: 15
Title:

Who Did It? Identifying Foul Subjects and Objects in Broadcast Soccer Videos

Authors:

Chunbo Song and Christopher Rasmussen

Abstract: We present a deep learning approach to sports video understanding as part of the development of an automated refereeing system for broadcast soccer games. The task of identifying which players are involved in a foul at a given moment is one of spatiotemporal action recognition in a cluttered visual environment. We describe how to employ multi-object tracking to generate a base set of candidate image sequences which are post-processed to mitigate common mistracking scenarios and then classified according to several two-person interaction types. For this work we created a large soccer foul dataset with a significant video component for training relevant networks. Our system can differentiate foul participants from bystanders with high accuracy and localize them over a wide range of game situations. We also report reasonable accuracy for distinguishing the player who committed the foul, or subject, from the object of the infraction, despite very low-resolution images.
Download

Paper Nr: 17
Title:

CGT: Consistency Guided Training in Semi-Supervised Learning

Authors:

Nesreen Hasan, Farzin Ghorban, Jörg Velten and Anton Kummert

Abstract: We propose a framework, CGT, for semi-supervised learning (SSL) that involves a unification of multiple image-based augmentation techniques. More specifically, we utilize Mixup and CutMix in addition to introducing one-sided stochastically augmented versions of those operators. Moreover, we introduce a generalization of the Mixup operator that regularizes a larger region of the input space. The objective of CGT is expressed as a linear combination of multiple constituents, each corresponding to the contribution of a different augmentation technique. CGT achieves state-of-the-art performance on the SVHN, CIFAR-10, and CIFAR-100 benchmark datasets and demonstrates that it is beneficial to heavily augment unlabeled training data.
Download

Paper Nr: 18
Title:

Unidentified Floating Object Detection in Maritime Environment

Authors:

Darshan Venkatrayappa, Agnès Desolneux, Jean-Michel Hubert and Josselin Manceau

Abstract: In this article, we present a new unsupervised approach to detect unidentified floating objects in the maritime environment. The proposed approach is capable of detecting floating objects online without any prior knowledge of their visual appearance, shape or location. Given an image from a video stream, we extract the self-similar and dissimilar components of the image using a visual dictionary. The dissimilar component consists of noise and structures (objects). The structures (objects) are then extracted using an a contrario model. We demonstrate the capabilities of our algorithm by testing it on videos exhibiting varying maritime scenarios.
Download

Paper Nr: 22
Title:

Milking CowMask for Semi-supervised Image Classification

Authors:

Geoff French, Avital Oliver and Tim Salimans

Abstract: Consistency regularization is a technique for semi-supervised learning that underlies a number of strong results for classification with few labeled data. It works by encouraging a learned model to be robust to perturbations on unlabeled data. Here, we present a novel mask-based augmentation method called CowMask. Using it to provide perturbations for semi-supervised consistency regularization, we achieve a competitive result on ImageNet with 10% labeled data, with a top-5 error of 8.76% and top-1 error of 26.06%. Moreover, we do so with a method that is much simpler than many alternatives. We further investigate the behavior of CowMask for semi-supervised learning by running many smaller scale experiments on the SVHN, CIFAR-10 and CIFAR-100 data sets, where we achieve results competitive with the state of the art, indicating that CowMask is widely applicable. We open source our code at https://github.com/google-research/google-research/tree/master/milking cowmask.
Download

Paper Nr: 33
Title:

Anomaly Detection for Industrial Inspection using Convolutional Autoencoder and Deep Feature-based One-class Classification

Authors:

Jamal Saeedi and Alessandro Giusti

Abstract: Part-to-part and image-to-image variability pose a great challenge to automatic anomaly detection systems; an additional challenge is applying deep learning methods on high-resolution images. Motivated by these challenges together with the promising results of transfer learning for anomaly detection, this paper presents a new approach combing the autoencoder-based method with one class deep feature classification. Specifically, after training an autoencoder using only normal images, we compute error images or anomaly maps between input and reconstructed images from the autoencoder. Then, we embed these anomaly maps using a pre-trained convolutional neural network feature extractor. Having the embeddings from the anomaly maps of training samples, we train a one-class classifier, k nearest neighbor, to compute an anomaly score for an unseen sample. Finally, a simple threshold-based criterion is used to determine if the unseen sample is anomalous or not. We compare the proposed algorithm with state-of-the-art methods on multiple challenging datasets: one representing zipper cursors, acquired specifically for this work; and eight belonging to the recently introduced MVTec dataset collection, representing various industrial anomaly detection tasks. We find that the proposed approach outperforms alternatives in all cases, and we achieve the average precision score of 94.77% and 96.35% for zipper cursors and MVTec datasets on average, respectively.
Download

Paper Nr: 41
Title:

3D Detection of Vehicles from 2D Images in Traffic Surveillance

Authors:

M. H. Zwemer, D. Scholte, R. J. Wijnhoven and P. N. de With

Abstract: Traffic surveillance systems use monocular cameras and automatic visual algorithms to locate and observe traffic movement. Object detection results in 2D object boxes around vehicles, which relate to inaccurate real-world locations. In this paper, we employ the existing KM3D CNN-based 3D detection model, which directly estimates 3D boxes around vehicles in the camera image. However, the KM3D model has only been applied earlier in autonomous driving use cases with different camera viewpoints. However, 3D annotation datasets are not available for traffic surveillance, requiring the construction of a new dataset for training the 3D detector. We propose and validate four different annotation configurations that generate 3D box annotations using only camera calibration, scene information (static vanishing points) and existing 2D annotations. Our novel Simple box method does not require segmentation of vehicles and provides a more simple 3D box construction, which assumes a fixed predefined vehicle width. The Simple box pipeline provides the best 3D object detection results, resulting in 51.9% AP3D using KM3D trained on this data. The 3D object detector can estimate an accurate 3D box up to a distance of 125 meters from the camera, with a median middle point mean error of only 0.5-1.0 meter.
Download

Paper Nr: 79
Title:

Iterative 3D Deformable Registration from Single-view RGB Images using Differentiable Rendering

Authors:

Arul S. Periyasamy, Max Schwarz and Sven Behnke

Abstract: For autonomous robotic systems, comprehensive 3D scene parsing is a prerequisite. Machine learning techniques used for 3D scene parsing that incorporate knowledge about the process of 2D image generation from 3D scenes have a big potential. This has sparked an interest in differentiable renderers that provide approximate gradients of the rendered image with respect to scene and object parameters. An efficient differentiable renderer facilitates approaching many 3D scene parsing problems using a render-and-compare framework, where the object and scene parameters are optimized by minimizing the difference between rendered and observed images. In this work, we introduce StilllebenDR, a light-weight scalable differentiable renderer built as an extension to the Stillleben library and use it for 3D deformable registration from single-view RGB images. Our end-to-end differentiable pipeline achieves results comparable to state-of-the-art methods without any training and outperforms the competing methods significantly in the presence of pose initialization errors.
Download

Paper Nr: 89
Title:

SieveNet: Estimating the Particle Size Distribution of Kernel Fragments in Whole Plant Corn Silage

Authors:

Christoffer B. Rasmussen, Kristian Kirk and Thomas B. Moeslund

Abstract: In this paper we present a method for efficiently measuring the particle size distribution of whole plant corn silage with a sieving-based network. Our network, SieveNet, learns to predict the size class of predefined sieves for kernel fragments through a novel sieve-based anchor matching algorithm during training. SieveNet improves inference timings by 40% compared to previous approaches that are based on two-stage recognition networks. Additionally, an estimated Corn Silage Processing score from the network predictions show strong correlations of up to 0.93 r2 against physically sieved samples, improving correlation results by a number of percentage points compared to previous approaches.
Download

Paper Nr: 98
Title:

GAN-based Face Mask Removal using Facial Landmarks and Pixel Errors in Masked Region

Authors:

Hitoshi Yoshihashi, Naoto Ienaga and Maki Sugimoto

Abstract: In 2020 and beyond, the opportunities to communicate with others while wearing a face mask have increased. A mask hides the mouth and facial muscles, making it difficult to convey facial expressions to others. In this study, we propose to use generative adversarial networks (GAN) to complete the facial region hidden by the mask. We defined custom loss functions that focus on the errors of the feature point coordinates of the face and the pixels in the masked region. As a result, we were able to generate images with higher quality than existing methods.
Download

Paper Nr: 101
Title:

Fine-grained Action Recognition using Attribute Vectors

Authors:

Sravani Yenduri, Nazil Perveen, Vishnu Chalavadi and C. K. Mohan

Abstract: Modelling the subtle interactions between human and objects is crucial in fine-grained action recognition. However, the existing methodologies that employ deep networks for modelling the interactions are highly supervised, computationally expensive, and need a vast amount of annotated data for training. In this paper, a framework for an efficient representation of fine-grained actions is proposed. First, spatio-temporal features, namely, histogram of optical flow (HOF), and motion boundary histogram (MBH) are extracted for each input video as these features are more robust to irregular motions and capture the motion information in videos efficiently. Then a large Gaussian mixture model (GMM) is trained using the maximum a posterior (MAP) adaption, to capture the attributes of fine-grained actions. The adapted means of all mixtures are concatenated to form an attribute vector for each fine-grained action video. This attribute vector is of large dimension and contains redundant attributes that may not contribute to the particular fine-grained action. So, factor analysis is used to decompose the high-dimensional attribute vector to a low-dimension in order to retain only the attributes which are responsible for that fine-grained action. The efficacy of the proposed approach is demonstrated on three fine-grained action datasets, namely, JIGSAWS, KSCGR, and MPII cooking2.
Download

Paper Nr: 112
Title:

Hear Me out: Fusional Approaches for Audio Augmented Temporal Action Localization

Authors:

Anurag Bagchi, Jazib Mahmood, Dolton Fernandes and Ravi K. Sarvadevabhatla

Abstract: State of the art architectures for untrimmed video Temporal Action Localization (TAL) have only considered RGB and Flow modalities, leaving the information-rich audio modality unexploited. Audio fusion has been explored for the related but an arguably easier problem of trimmed (clip-level) action recognition. However, TAL poses a unique set of challenges. In this paper, we propose simple but effective fusion-based approaches for TAL. To the best of our knowledge, our work is the first to jointly consider audio and video modalities for supervised TAL. We experimentally show that our schemes consistently improve performance for the state-of-the-art video-only TAL approaches. Specifically, they help achieve a new state-of-the-art performance on large-scale benchmark datasets - ActivityNet-1.3 (54.34 mAP@0.5) and THUMOS14 (57.18 mAP@0.5). Our experiments include ablations involving multiple fusion schemes, modality combinations, and TAL architectures. Our code, models, and associated data are available at https://github.com/skelemoa/tal-hmo.
Download

Paper Nr: 127
Title:

SemSegDepth: A Combined Model for Semantic Segmentation and Depth Completion

Authors:

Juan P. Lagos and Esa Rahtu

Abstract: Holistic scene understanding is pivotal for the performance of autonomous machines. In this paper we propose a new end-to-end model for performing semantic segmentation and depth completion jointly. The vast majority of recent approaches have developed semantic segmentation and depth completion as independent tasks. Our approach relies on RGB and sparse depth as inputs to our model and produces a dense depth map and the corresponding semantic segmentation image. It consists of a feature extractor, a depth completion branch, a semantic segmentation branch and a joint branch which further processes semantic and depth information altogether. The experiments done on Virtual KITTI 2 dataset, demonstrate and provide further evidence, that combining both tasks, semantic segmentation and depth completion, in a multi-task network can effectively improve the performance of each task. Code is available at https://github.com/juanb09111/semantic depth.
Download

Paper Nr: 147
Title:

On the Cross-dataset Generalization in License Plate Recognition

Authors:

Rayson Laroca, Everton V. Cardoso, Diego R. Lucio, Valter Estevam and David Menotti

Abstract: Automatic License Plate Recognition (ALPR) systems have shown remarkable performance on license plates (LPs) from multiple regions due to advances in deep learning and the increasing availability of datasets. The evaluation of deep ALPR systems is usually done within each dataset; therefore, it is questionable if such results are a reliable indicator of generalization ability. In this paper, we propose a traditional-split versus leave-one-dataset-out experimental setup to empirically assess the cross-dataset generalization of 12 Optical Character Recognition (OCR) models applied to LP recognition on 9 publicly available datasets with a great variety in several aspects (e.g., acquisition settings, image resolution, and LP layouts). We also introduce a public dataset for end-to-end ALPR that is the first to contain images of vehicles with Mercosur LPs and the one with the highest number of motorcycle images. The experimental results shed light on the limitations of the traditional-split protocol for evaluating approaches in the ALPR context, as there are significant drops in performance for most datasets when training and testing the models in a leave-one-dataset-out fashion.
Download

Paper Nr: 154
Title:

Leveraging Local Domains for Image-to-Image Translation

Authors:

Anthony Dell’Eva, Fabio Pizzati, Massimo Bertozzi and Raoul de Charette

Abstract: Image-to-image (i2i) networks struggle to capture local changes because they do not affect the global scene structure. For example, translating from highway scenes to offroad, i2i networks easily focus on global color features but ignore obvious traits for humans like the absence of lane markings. In this paper, we leverage human knowledge about spatial domain characteristics which we refer to as ’local domains’ and demonstrate its benefit for image-to-image translation. Relying on a simple geometrical guidance, we train a patch-based GAN on few source data and hallucinate a new unseen domain which subsequently eases transfer learning to target. We experiment on three tasks ranging from unstructured environments to adverse weather. Our comprehensive evaluation setting shows we are able to generate realistic translations, with minimal priors, and training only on a few images. Furthermore, when trained on our translations images we show that all tested proxy tasks are significantly improved, without ever seeing target domain at training.
Download

Paper Nr: 158
Title:

CAM-SegNet: A Context-Aware Dense Material Segmentation Network for Sparsely Labelled Datasets

Authors:

Yuwen Heng, Yihong Wu, Srinandan Dasmahapatra and Hansung Kim

Abstract: Contextual information reduces the uncertainty in the dense material segmentation task to improve segmentation quality. Typical contextual information includes object, place labels or extracted feature maps by a neural network. Existing methods typically adopt a pre-trained network to generate contextual feature maps without fine-tuning since dedicated material datasets do not contain contextual labels. As a consequence, these contextual features may not improve the material segmentation performance. In consideration of this problem, this paper proposes a hybrid network architecture, the CAM-SegNet, to learn from contextual and material features during training jointly without extra contextual labels. The utility of our CAM-SegNet is demonstrated by guiding the network to learn boundary-related contextual features with the help of a self-training approach. Experiments show that CAM-SegNet can recognise materials that have similar appearances, achieving an improvement of 3-20% on accuracy and 6-28% on Mean IoU.
Download

Paper Nr: 167
Title:

The MVTec 3D-AD Dataset for Unsupervised 3D Anomaly Detection and Localization

Authors:

Paul Bergmann, Xin Jin, David Sattlegger and Carsten Steger

Abstract: We introduce the first comprehensive 3D dataset for the task of unsupervised anomaly detection and localization. It is inspired by real-world visual inspection scenarios in which a model has to detect various types of defects on manufactured products, even if it is trained only on anomaly-free data. There are defects that manifest themselves as anomalies in the geometric structure of an object. These cause significant deviations in a 3D representation of the data. We employed a high-resolution industrial 3D sensor to acquire depth scans of 10 different object categories. For all object categories, we present a training and validation set, each of which solely consists of scans of anomaly-free samples. The corresponding test sets contain samples showing various defects such as scratches, dents, holes, contaminations, or deformations. Precise ground-truth annotations are provided for every anomalous test sample. An initial benchmark of 3D anomaly detection methods on our dataset indicates a considerable room for improvement.
Download

Paper Nr: 180
Title:

Improving the Sample-complexity of Deep Classification Networks with Invariant Integration

Authors:

Matthias Rath and Alexandru P. Condurache

Abstract: Leveraging prior knowledge on intraclass variance due to transformations is a powerful method to improve the sample complexity of deep neural networks. This makes them applicable to practically important use-cases where training data is scarce. Rather than being learned, this knowledge can be embedded by enforcing invariance to those transformations. Invariance can be imposed using group-equivariant convolutions followed by a pooling operation. For rotation-invariance, previous work investigated replacing the spatial pooling operation with invariant integration which explicitly constructs invariant representations. Invariant integration uses monomials which are selected using an iterative approach requiring expensive pre-training. We propose a novel monomial selection algorithm based on pruning methods to allow an application to more complex problems. Additionally, we replace monomials with different functions such as weighted sums, multi-layer perceptrons and self-attention, thereby streamlining the training of invariant-integration-based architectures. We demonstrate the improved sample complexity on the Rotated-MNIST, SVHN and CIFAR-10 datasets where rotation-invariant-integration-based Wide-ResNet architectures using monomials and weighted sums outperform the respective baselines in the limited sample regime. We achieve state-of-the-art results using full data on Rotated-MNIST and SVHN where rotation is a main source of intraclass variation. On STL-10 we outperform a standard and a rotation-equivariant convolutional neural network using pooling.
Download

Paper Nr: 199
Title:

Segregational Soft Dynamic Time Warping and Its Application to Action Prediction

Authors:

Victoria Manousaki and Antonis Argyros

Abstract: Aligning the execution of complete actions captured in segmented videos has been a problem explored by Dynamic Time Warping (DTW) and Soft Dynamic Time Warping (S-DTW) algorithms. The limitation of these algorithms is that they cannot align unsegmented actions, i.e., actions that appear between other actions. This limitation is mitigated by the use of two existing DTW variants, namely the Open-End DTW (OE-DTW) and the Open-Begin-End DTW (OBE-DTW). OE-DTW is designed for aligning actions of known begin point but unknown end point, while OBE-DTW handles continuous, completely unsegmented actions with unknown begin and end points. In this paper, we combine the merits of S-DTW with those of OE-DTW and OBE-DTW. In that direction, we propose two new DTW variants, the Open-End Soft DTW (OE-S-DTW) and the Open-Begin-End Soft DTW (OBE-S-DTW). The superiority of the proposed algorithms lies in the combination of the soft-minimum operator and the relaxation of the boundary constraints of S-DTW, with the segregational capabilities of OE-DTW and OBE-DTW, resulting in better and differentiable action alignment in the case of continuous, unsegmented videos. We evaluate the proposed algorithms on the task of action prediction on standard datasets such as MHAD, MHAD101-v/-s, MSR Daily Activities and CAD-120. Our experimental results show the superiority of the proposed algorithms to existing video alignment methods.
Download

Paper Nr: 227
Title:

Vehicle Pair Activity Classification using QTC and Long Short Term Memory Neural Network

Authors:

Rahulan Radhakrishnan and Alaa Alzoubi

Abstract: The automated recognition of vehicle interaction is crucial for self-driving, collision avoidance and security surveillance applications. In this paper, we present a novel Long-Short Term Memory Neural Network (LSTM) based method for vehicle trajectory classification. We use Qualitative Trajectory Calculus (QTC) to represent the relative motion between a pair of vehicles. The spatio-temporal features of the interacting vehicles are captured as a sequence of QTC states and then encoded using one hot vector representation. Then, we develop an LSTM network to classify QTC trajectories that represent vehicle pairwise activities. Most of the high performing LSTM models are manually designed and require expertise in hyperparameter configuration. We adapt Bayesian Optimisation method to find an optimal LSTM architecture for classifying QTC trajectories of vehicle interaction. We evaluated our method on three different datasets comprising 7257 trajectories of 9 unique vehicle activities in different traffic scenarios. We demonstrate that our proposed method outperforms the state-of-the-art techniques. Further, we evaluated our approach with a combined dataset of the three datasets and achieved an error rate of no more than 1.79%. Though, our work mainly focuses on vehicle trajectories, the proposed method is generic and can be used on pairwise analysis of other interacting objects.
Download

Paper Nr: 237
Title:

ETL: Efficient Transfer Learning for Face Tasks

Authors:

Thrupthi A. John, Isha Dua, Vineeth N. Balasubramanian and C. V. Jawahar

Abstract: Transfer learning is a popular method for obtaining deep trained models for data-scarce face tasks such as head pose and emotion. However, current transfer learning methods are inefficient and time-consuming as they do not fully account for the relationships between related tasks. Moreover, the transferred model is large and computationally expensive. As an alternative, we propose ETL: a technique that efficiently transfers a pre-trained model to a new task by retaining only cross-task aware filters, resulting in a sparse transferred model. We demonstrate the effectiveness of ETL by transferring VGGFace, a popular face recognition model to four diverse face tasks. Our experiments show that we attain a size reduction up to 97% and an inference time reduction up to 94% while retaining 99.5% of the baseline transfer learning accuracy.
Download

Paper Nr: 239
Title:

Reinforced Damage Minimization in Critical Events for Self-driving Vehicles

Authors:

Francesco Merola, Fabrizio Falchi, Claudio Gennaro and Marco Di Benedetto

Abstract: Self-driving systems have recently received massive attention in both academic and industrial contexts, leading to major improvements in standard navigation scenarios typically identified as well-maintained urban routes. Critical events like road accidents or unexpected obstacles, however, require the execution of specific emergency actions that deviate from the ordinary driving behavior and are therefore harder to incorporate in the system. In this context, we propose a system that is specifically built to take control of the vehicle and perform an emergency maneuver in case of a dangerous scenario. The presented architecture is based on a deep reinforcement learning algorithm, trained in a simulated environment and using raw sensory data as input. We evaluate the system’s performance on several typical pre-accident scenario and show promising results, with the vehicle being able to consistently perform an avoidance maneuver to nullify or minimize the incoming damage.
Download

Paper Nr: 248
Title:

Evolutional Normal Maps: 3D Face Representations for 2D-3D Face Recognition, Face Modelling and Data Augmentation

Authors:

Michael Danner, Thomas Weber, Patrik Huber, Muhammad Awais, Matthias Raetsch and Josef Kittler

Abstract: We address the problem of 3D face recognition based on either 3D sensor data, or on a 3D face reconstructed from a 2D face image. We focus on 3D shape representation in terms of a mesh of surface normal vectors. The first contribution of this work is an evaluation of eight different 3D face representations and their multiple combinations. An important contribution of the study is the proposed implementation, which allows these representations to be computed directly from 3D meshes, instead of point clouds. This enhances their computational efficiency. Motivated by the results of the comparative evaluation, we propose a 3D face shape descriptor, named Evolutional Normal Maps, that assimilates and optimises a subset of six of these approaches. The proposed shape descriptor can be modified and tuned to suit different tasks. It is used as input for a deep convolutional network for 3D face recognition. An extensive experimental evaluation using the Bosphorus 3D Face, CASIA 3D Face and JNU-3D Face datasets shows that, compared to the state of the art methods, the proposed approach is better in terms of both computational cost and recognition accuracy.
Download

Paper Nr: 249
Title:

Bispectral Pedestrian Detection Augmented with Saliency Maps using Transformer

Authors:

Mohamed A. Marnissi, Ikram Hattab, Hajer Fradi, Anis Sahbani and Najoua E. Ben Amara

Abstract: In this paper, we focus on the problem of automatic pedestrian detection for surveillance applications. Particularly, the main goal is to perform real-time detection from both visible and thermal cameras for complementary aspects. To handle that, a fusion network that uses features from both inputs and performs augmentation by means of visual saliency transformation is proposed. This fusion process is incorporated into YOLO-v3 as base architecture. The resulting detection model is trained in a paired setting in order to improve the results compared to the detection of each single input. To prove the effectiveness of the proposed fusion framework, several experiments are conducted on KAIST multi-spectral dataset. From the obtained results, it has been shown superior results compared to single inputs and to other fusion schemes. The proposed approach has also the advantage of a very low computational cost, which is quite important for real-time applications. To prove that, additional tests on a security robot are presented as well.
Download

Short Papers
Paper Nr: 28
Title:

DeepPupil Net: Deep Residual Network for Precise Pupil Center Localization

Authors:

Nikolaos Poulopoulos and Emmanouil Z. Psarakis

Abstract: Precise eye center localization constitutes a very promising but challenging task in many human interaction applications due to many limitations related with the presence of photometric distortions and occlusions as well as pose and shape variations. In this paper, a Fully Convolutional Network (FCN), namely DeepPupil Net is proposed to localize precisely the eye centers by performing image-to-heatmap regression between the eye regions and the corresponding heatmaps. Moreover, a new loss function is introduced in order to incorporate into the training process the predicted eye center positions and penalize inaccurate localizations. The proposed method achieves real-time performance in a general-purpose computer environment and outperforms in terms of accuracy the state-of-the-art eye center localization techniques.
Download

Paper Nr: 38
Title:

High Resolution Mask R-CNN-based Damage Detection on Titanium Nitride Coated Milling Tools for Condition Monitoring by using a New Illumination Technique

Authors:

Mühenad Bilal, Sunil Kancharana, Christian Mayer, Daniel Pfaller, Leonid Koval, Markus Bregulla, Rafal Cupek and Adam Ziębiński

Abstract: The implementation of intelligent software in the manufacturing industry is a technology of growing importance and has highlighted the need for improvement in automatization, production, inspection, and quality assurance. An automated inspection system based on deep learning methods can help to enhance inspection and provide a consistent overview of the production line. Camera-based imaging systems are among the most widely used tools, replacing manual industrial quality control tasks. Moreover, an automatized damage detection system on milling tools can be employed in quality control during the coating process and to simplify measuring tool life. Deep Convolutional Neural Networks (DCNNs) are state-of-the-art methods used to extract visual features and classify objects. Hence, there is great interest in applying DCNN in damage detection and classification. However, training a DCNN model on Titanium-Nitride coated (TiN) milling tools is extremely challenging. Due to the coating, the optical properties such as reflection and light scattering on the milling tool surface make image capturing for computer vision tasks quite challenging. In addition to the reflection and scattering, the helical-shaped surface of the cutting tools creates shadows, preventing the neural network from efficient training and damage detection. Here, in the context of applying an automatized deep learning-based method to detect damages on coated milling tools for quality control, the light has been shed on a novel illumination technique that allows capturing high-quality images which makes efficient damage detection for condition monitoring and quality control reliable. The method is outlined along with results obtained in training a ResNet 50 and ResNet 101 model reaching an overall accuracy of 83% from a dataset containing bounding box annotated damages. For instance and semantic segmentation, the state-of-the-art framework Mask R-CNN is employed.
Download

Paper Nr: 44
Title:

Neural Network Pruning based on Filter Importance Values Approximated with Monte Carlo Gradient Estimation

Authors:

Csanád Sándor, Szabolcs Pável and Lehel Csató

Abstract: Neural network pruning is an effective way to reduce memory- and time requirements in most deep neural network architectures. Recently developed pruning techniques can remove individual neurons or entire filters from convolutional neural networks, making these “slim” architectures more robust and more resource- efficient. In this paper, we present a simple yet effective method that assigns probabilities to the network units – to filters in convolutional layers and to neurons in fully connected layers – and prunes them based on these values. The probabilities are learned by maximizing the expected value of a score function – calculated from the accuracy – that ranks the network when different units are tuned off. Gradients of the probabilities are estimated using Monte Carlo gradient estimation. We conduct experiments on the CIFAR-10 dataset with a small VGG-like architecture as well as on the lightweight version of the ResNet architecture. The results show that our pruning method has comparable results with different state-of-the-art algorithms in terms of parameter and floating point operation reduction. In case of the ResNet-110 architecture, our pruning method removes 72.53% of the floating point operations and 68.89% of the parameters, that marginally surpasses the result of existing pruning methods.
Download

Paper Nr: 52
Title:

The MIS Check-Dam Dataset for Object Detection and Instance Segmentation Tasks

Authors:

Chintan Tundia, Rajiv Kumar, Om Damani and G. Sivakumar

Abstract: Deep learning has led to many recent advances in object detection and instance segmentation, among other computer vision tasks. These advancements have led to wide application of deep learning based methods and related methodologies in object detection tasks for satellite imagery. In this paper, we introduce MIS Check-Dam, a new dataset of check-dams from satellite imagery for building an automated system for the detection and mapping of check-dams, focusing on the importance of irrigation structures used for agriculture. We review some of the most recent object detection and instance segmentation methods and assess their performance on our new dataset. We evaluate several single stage, two-stage and attention based methods under various network configurations and backbone architectures. The dataset and the pre-trained models are available at https://www.cse.iitb.ac.in/gramdrishti/.
Download

Paper Nr: 59
Title:

Oil Spill Detection and Visualization from UAV Images using Convolutional Neural Networks

Authors:

Valério N. Rodrigues Junior, Roberto M. Cavalcante, João R. Almeida, Tiago M. Fé, Ana M. Malhado, Thales Vieira and Krerley Oliveira

Abstract: Marine oil spills may have devastating consequences for the environment, the economy, and society. The 2019 oil spill crisis along the northeast Brazilian coast required immediate actions to control and mitigate the impacts of the pollution. In this paper, we propose an approach based on Deep Learning to efficiently inspect beaches and assist response teams using UAV imagery through an inexpensive visual system. Images collected by UAVs through an aerial survey are split and evaluated by a Convolutional Neural Network. The results are then integrated into heatmaps, which are exploited to perform geospatial visual analysis. Experiments were carried out to validate and evaluate the classifiers, achieving an accuracy of up to 93.6% and an F1 score of 78.6% for the top trained models. We also describe a case study to demonstrate that our approach can be used in real-world situations.
Download

Paper Nr: 75
Title:

Generating Proposals from Corners in RPN to Detect Bees in Dense Scenes

Authors:

Yassine Kriouile, Corinne Ancourt, Katarzyna Wegrzyn-Wolska and Lamine Bougueroua

Abstract: Detecting bees in beekeeping is an important task to help beekeepers in their work, such as counting bees, and monitoring their health status. Deep learning techniques could be used to perform this automatic detection. For instance Faster RCNN is a neural network for object detection that is suitable for this kind of tasks. But its accuracy is degraded when it comes to images of bee frames due to the high density of objects. In this paper, we propose to extend the RPN sub-neural network of Faster RCNN to improve detection recall. In addition to detect bees from centers, four branches are added to detect bees from their corners. We constructed a dataset of images and annotated it. We compared this approach to the standard Faster RCNN. It improves the detection accuracy. Code is available at https://github.com/yassine-kr/RPNCorner.
Download

Paper Nr: 84
Title:

A General Two-branch Decoder Architecture for Improving Encoder-decoder Image Segmentation Models

Authors:

Sijie Hu, Fabien Bonardi, Samia Bouchafa and Désiré Sidibé

Abstract: Recently, many methods with complex structures were proposed to address image parsing tasks such as image segmentation. These well-designed structures are hardly to be used flexibly and require a heavy footprint. This paper focuses on a popular semantic segmentation framework known as encoder-decoder, and points out a phenomenon that existing decoders do not fully integrate the information extracted by the encoder. To alleviate this issue, we propose a more general two-branch paradigm, composed of a main branch and an auxiliary branch, without increasing the number of parameters, and a boundary enhanced loss computation strategy to make two-branch decoders learn complementary information adaptively instead of explicitly indicating the specific learning element. In addition, one branch learn pixels that are difficult to resolve in another branch making a competition between them, which promotes the model to learn more efficiently. We evaluate our approach on two challenging image segmentation datasets and show its superior performance in different baseline models. We also perform an ablation study to tease apart the effects of different settings. Finally, we show our two-branch paradigm can achieve satisfactory results when remove the auxiliary branch in the inference stage, so that it can be applied to low-resource systems.
Download

Paper Nr: 95
Title:

Automated Damage Inspection of Power Transmission Towers from UAV Images

Authors:

Aleixo Cambeiro Barreiro, Clemens Seibold, Anna Hilsmann and Peter Eisert

Abstract: Infrastructure inspection is a very costly task, requiring technicians to access remote or hard-to-reach places. This is the case for power transmission towers, which are sparsely located and require trained workers to climb them to search for damages. Recently, the use of drones or helicopters for remote recording is increasing in the industry, sparing the technicians this perilous task. This, however, leaves the problem of analyzing big amounts of images, which has great potential for automation. This is a challenging task for several reasons. First, the lack of freely available training data and the difficulty to collect it complicate this problem. Additionally, the boundaries of what constitutes a damage are fuzzy, introducing a degree of subjectivity in the labelling of the data. The unbalanced class distribution in the images also plays a role in increasing the difficulty of the task. This paper tackles the problem of structural damage detection in transmission towers, addressing these issues. Our main contributions are the development of a system for damage detection on remotely acquired drone images, applying techniques to overcome the issue of data scarcity and ambiguity, as well as the evaluation of the viability of such an approach to solve this particular problem.
Download

Paper Nr: 96
Title:

Monocular Estimation of Translation, Pose and 3D Shape on Detected Objects using a Convolutional Autoencoder

Authors:

Ivar Persson, Martin Ahrnbom and Mikael Nilsson

Abstract: This paper present a 6DoF-positioning method and shape estimation method for cars from monocular images. We pre-learn principal components, using Principal Component Analysis (PCA), from the shape of cars and use a learnt encoder-decoder structure in order to position the cars and create binary masks of each camera instance. The proposed method is tailored towards usefulness for autonomous driving and traffic safety surveillance. The work introduces a novel encoder-decoder framework for this purpose, thus expanding and extending state-of-the-art models for the task. Quantitative and qualitative analysis is performed on the Apolloscape dataset, showing promising results, in particular regarding rotations and segmentation masks.
Download

Paper Nr: 100
Title:

PodNet: Ensemble-based Classification of Podocytopathy on Kidney Glomerular Images

Authors:

George O. Barros, David C. Wanderley, Luciano O. Rebouças, Washington D. Santos, Angelo A. Duarte and Flavio B. Vidal

Abstract: Podocyte lesions in renal glomeruli are identified by pathologists using visual analyses of kidney tissue sections (histological images). By applying automatic visual diagnosis systems, one may reduce the subjectivity of analyses, accelerate the diagnosis process, and improve medical decision accuracy. Towards this direction, we present here a new data set of renal glomeruli histological images for podocitopathy classification and a deep neural network model. The data set consists of 835 digital images (374 with podocytopathy and 430 without podocytopathy), annotated by a group of pathologists. Our proposed method (called here PodNet) is a classification method based on deep neural networks (pre-trained VGG19) used as features extractor from images in different color spaces. We compared PodNet with other six state-of-the-art models in two data set versions (RGB and gray level) and two different training contexts: pre-trained models (transfer learning from Imagenet) and from-scratch, both with hyperparameters tuning. The proposed method achieved classification results to 90.9% of f1-score, 88.9% precision, and 93.2% of recall in the final validation sets.
Download

Paper Nr: 107
Title:

Multitask Metamodel for Keypoint Visibility Prediction in Human Pose Estimation

Authors:

Romain Guesdon, Carlos Crispim-Junior and Laure Tougne

Abstract: The task of human pose estimation (HPE) aims to predict the coordinates of body keypoints in images. Even if nowadays, we achieve high performance on HPE, some difficulties remain to be fully overcome. For instance, a strong occlusion can deceive the methods and make them predict false-positive keypoints with high confidence. This can be problematic in applications that require reliable detection, such as posture analysis in car-safety applications. Despite this difficulty, actual HPE solutions are designed to always predict coordinates for each keypoint. To answer this problem, we propose a new metamodel that predicts both keypoints coordinates and their visibility. Visibility is an attribute that indicates if a keypoint is visible, non-visible, or not labeled. Our model is composed of three modules: the feature extraction, the coordinate estimation, and the visibility prediction modules. We study in this paper the performance of the visibility predictions and the impact of this task on the coordinate estimation. Baseline results are provided on the COCO dataset. Moreover, to measure the performance of this method in a more occluded context, we also use the driver dataset DriPE. Finally, we implement the proposed metamodel on several base models to demonstrate the general aspect of our metamodel.
Download

Paper Nr: 117
Title:

Evaluation of Long-term Deep Visual Place Recognition

Authors:

Farid Alijani, Jukka Peltomäki, Jussi Puura, Heikki Huttunen, Joni-Kristian Kämäräinen and Esa Rahtu

Abstract: In this paper, we provide a comprehensive study on evaluating two state-of-the-art deep metric learning methods for visual place recognition. Visual place recognition is an essential component in the visual localization and the vision-based navigation where it provides an initial coarse location. It is used in variety of autonomous navigation technologies, including autonomous vehicles, drones and computer vision systems. We study recent visual place recognition and image retrieval methods and utilize them to conduct extensive and comprehensive experiments on two diverse and large long-term indoor and outdoor robot navigation datasets, e.g., COLD and Oxford Radar RobotCar along with ablation studies on the crucial parameters of the deep architectures. Our comprehensive results indicate that the methods can achieve 5 m of outdoor and 50 cm of indoor place recognition accuracy with high recall rate of 80 %.
Download

Paper Nr: 119
Title:

Human Activity Recognition: A Spatio-temporal Image Encoding of 3D Skeleton Data for Online Action Detection

Authors:

Nassim Mokhtari, Alexis Nédélec and Pierre De Loor

Abstract: Human activity recognition (HAR) based on skeleton data that can be extracted from videos (Kinect for example) , or provided by a depth camera is a time series classification problem, where handling both spatial and temporal dependencies is a crucial task, in order to achieve a good recognition. In the online human activity recognition, identifying the beginning and end of an action is an important element, that might be difficult in a continuous data flow. In this work, we present a 3D skeleton data encoding method to generate an image that preserves the spatial and temporal dependencies existing between the skeletal joints.To allow online action detection we combine this encoding system with a sliding window on the continous data stream. By this way, no start or stop timestamp is needed and the recognition can be done at any moment. A deep learning CNN algorithm is used to achieve actions online detection.
Download

Paper Nr: 126
Title:

Deep Set Conditioned Latent Representations for Action Recognition

Authors:

Akash Singh, Tom de Schepper, Kevin Mets, Peter Hellinckx, José Oramas and Steven Latré

Abstract: In recent years multi-label, multi-class video action recognition has gained significant popularity. While reasoning over temporally connected atomic actions is mundane for intelligent species, standard artificial neural networks (ANN) still struggle to classify them. In the real world, atomic actions often temporally connect to form more complex composite actions. The challenge lies in recognising composite action of varying durations while other distinct composite or atomic actions occur in the background. Drawing upon the success of relational networks, we propose methods that learn to reason over the semantic concept of objects and actions. We empirically show how ANNs benefit from pretraining, relational inductive biases and unordered set-based latent representations. In this paper we propose deep set conditioned I3D (SCI3D), a two stream relational network that employs latent representation of state and visual representation for reasoning over events and actions. They learn to reason about temporally connected actions in order to identify all of them in the video. The proposed method achieves an improvement of around 1.49% mAP in atomic action recognition and 17.57% mAP in composite action recognition, over a I3D-NL baseline, on the CATER dataset.
Download

Paper Nr: 128
Title:

Streamlining Action Recognition in Autonomous Shared Vehicles with an Audiovisual Cascade Strategy

Authors:

João Ribeiro Pinto, Pedro Carvalho, Carolina Pinto, Afonso Sousa, Leonardo Capozzi and Jaime S. Cardoso

Abstract: With the advent of self-driving cars, and big companies such as Waymo or Bosch pushing forward into fully driverless transportation services, the in-vehicle behaviour of passengers must be monitored to ensure safety and comfort. The use of audio-visual information is attractive by its spatio-temporal richness as well as non-invasive nature, but faces the likely constraints posed by available hardware and energy consumption. Hence new strategies are required to improve the usage of these scarce resources. We propose the processing of audio and visual data in a cascade pipeline for in-vehicle action recognition. The data is processed by modality-specific sub-modules, with subsequent ones being used when a confident classification is not reached. Experiments show an interesting accuracy-acceleration trade-off when compared with a parallel pipeline with late fusion, presenting potential for industrial applications on embedded devices.
Download

Paper Nr: 133
Title:

Subclass-based Undersampling for Class-imbalanced Image Classification

Authors:

Daniel Lehmann and Marc Ebner

Abstract: Image classification problems are often class-imbalanced in practice. Such a class imbalance can negatively affect the classification performance of CNN models. A State-of-the-Art (SOTA) approach to address this issue is to randomly undersample the majority class. However, random undersampling can result in an information loss because the randomly selected samples may not come from all distinct groups of samples of the class (subclasses). In this paper, we examine an alternative undersampling approach. Our method undersamples a class by selecting samples from all subclasses of the class. To identify the subclasses, we investigated if clustering of the high-level features of CNN models is a suitable approach. We conducted experiments on 2 real-world datasets. Their results show that our approach can outperform a) models trained on the imbalanced dataset and b) models trained using several SOTA methods addressing the class imbalance.
Download

Paper Nr: 135
Title:

Multimodal Personality Recognition using Cross-attention Transformer and Behaviour Encoding

Authors:

Tanay Agrawal, Dhruv Agarwal, Michal Balazia, Neelabh Sinha and François Bremond

Abstract: Personality computing and affective computing have gained recent interest in many research areas. The datasets for the task generally have multiple modalities like video, audio, language and bio-signals. In this paper, we propose a flexible model for the task which exploits all available data. The task involves complex relations and to avoid using a large model for video processing specifically, we propose the use of behaviour encoding which boosts performance with minimal change to the model. Cross-attention using transformers has become popular in recent times and is utilised for fusion of different modalities. Since long term relations may exist, breaking the input into chunks is not desirable, thus the proposed model processes the entire input together. Our experiments show the importance of each of the above contributions.
Download

Paper Nr: 136
Title:

Improving Semantic Image Segmentation via Label Fusion in Semantically Textured Meshes

Authors:

Florian Fervers, Timo Breuer, Gregor Stachowiak, Sebastian Bullinger, Christoph Bodensteiner and Michael Arens

Abstract: Models for semantic segmentation require a large amount of hand-labeled training data which is costly and time-consuming to produce. For this purpose, we present a label fusion framework that is capable of improving semantic pixel labels of video sequences in an unsupervised manner. We make use of a 3D mesh representation of the environment and fuse the predictions of different frames into a consistent representation using semantic mesh textures. Rendering the semantic mesh using the original intrinsic and extrinsic camera parameters yields a set of improved semantic segmentation images. Due to our optimized CUDA implementation, we are able to exploit the entire c-dimensional probability distribution of annotations over c classes in an uncertainty-aware manner. We evaluate our method on the Scannet dataset where we improve annotations produced by the state-of-the-art segmentation network ESANet from 52.05% to 58.25% pixel accuracy. We publish the source code of our framework online to foster future research in this area (https://github.com/fferflo/semantic-meshes). To the best of our knowledge, this is the first publicly available label fusion framework for semantic image segmentation based on meshes with semantic textures.
Download

Paper Nr: 140
Title:

Honeybee Re-identification in Video: New Datasets and Impact of Self-supervision

Authors:

Jeffrey Chan, Hector Carrión, Rémi Mégret, José A. Rivera and Tugrul Giray

Abstract: This paper presents an experimental study of long-term re-identification of honeybees from the appearance of their abdomen in videos. The first contribution is composed of two image datasets of single honeybees extracted from 12 days of video and annotated with information about their identity on long-term and short-term scales. The long-term dataset contains 8,962 images associated to 181 known identities and used to evaluate the long-term re-identification of individuals. The short-term dataset contains 109,654 images associated to 4,949 short-term tracks that provide multiple views of an individual suitable for self-supervised training. A deep convolutional network was trained to map an image of the honeybee’s abdomen to a 128 dimensional feature vector using several approaches. Re-identification was evaluated in test setups that capture different levels of difficulty: from the same hour to a different day. The results show using the short-term self-supervised information for training performed better than the supervised long-term dataset, with best performance achieved by using both. Ablation studies show the impact of the quantity of data used in training as well as the impact of augmentation, which will guide the design of future systems for individual identification.
Download

Paper Nr: 145
Title:

Describing Image Focused in Cognitive and Visual Details for Visually Impaired People: An Approach to Generating Inclusive Paragraphs

Authors:

Daniel L. Fernandes, Marcos H. F. Ribeiro, Fabio R. Cerqueira and Michel M. Silva

Abstract: Several services for people with visual disabilities have emerged recently due to achievements in Assistive Technologies and Artificial Intelligence areas. Despite the growth in assistive systems availability, there is a lack of services that support specific tasks, such as understanding the image context presented in online content, e.g., webinars. Image captioning techniques and their variants are limited as Assistive Technologies as they do not match the needs of visually impaired people when generating specific descriptions. We propose an approach for generating context of webinar images combining a dense captioning technique with a set of filters, to fit the captions in our domain, and a language model for the abstractive summary task. The results demonstrated that we can produce descriptions with higher interpretability and focused on the relevant information for that group of people by combining image analysis methods and neural language models.
Download

Paper Nr: 161
Title:

Augmented Radar Points Connectivity based on Image Processing Techniques for Object Detection and Classification

Authors:

Mohamed Sabry, Ahmed Hussein, Amr Elmougy and Slim Abdennadher

Abstract: Perception and scene understanding are complex modules that require data from multiple types of sensors to construct a weather-resilient system that can operate in almost all conditions. This is mainly due to drawbacks of each sensor on its own. The only sensor that is able to work in a variety of conditions is the radar. However, the sparseness of radar pointclouds from open source datasets makes it under-perform in object classification tasks. This is compared to the LiDAR, which after constraints and filtration, produces an average of 22,000 points per frame within a grid map image representation of 120 x 120 meters in the real world. Therefore, in this paper, a preprocessing module is proposed to enable the radar to partially reconnect objects in the scene from a sparse pointcloud. This adapts the radar to object classification tasks rather than the conventional uses in automotive applications, such as Adaptive Cruise Control or object tracking. The proposed module is used as preprocessing step in a Deep Learning pipeline for a classification task. The evaluation was carried out on the nuScenes dataset, as it contained both radar and LiDAR data, which enables the comparison between the performance of both modules. After applying the preprocessing module, this work managed to make the radar-based classification significantly close to the performance of the LiDAR.
Download

Paper Nr: 176
Title:

TVNet: Temporal Voting Network for Action Localization

Authors:

Hanyuan Wang, Dima Damen, Majid Mirmehdi and Toby Perrett

Abstract: We propose a Temporal Voting Network (TVNet) for action localization in untrimmed videos. This incorporates a novel Voting Evidence Module to locate temporal boundaries, more accurately, where temporal contextual evidence is accumulated to predict frame-level probabilities of start and end action boundaries. Our action-independent evidence module is incorporated within a pipeline to calculate confidence scores and action classes. We achieve an average mAP of 34.6% on ActivityNet-1.3, particularly outperforming previous methods with the highest IoU of 0.95. TVNet also achieves mAP of 56.0% when combined with PGCN and 59.1% with MUSES at 0.5 IoU on THUMOS14 and outperforms prior work at all thresholds. Our code is available at https://github.com/hanielwang/TVNet.
Download

Paper Nr: 178
Title:

HRI-Gestures: Gesture Recognition for Human-Robot Interaction

Authors:

Avgi Kollakidou, Frederik Haarslev, Cagatay Odabasi, Leon Bodenhagen and Norbert Krüger

Abstract: Most of people’s communication happens through body language and gestures. Gesture recognition in human-robot interaction is an unsolved problem which limits the possible communication between humans and robots in today’s applications. Gesture recognition can be considered as the same problem as action recognition which is largely solved by deep learning, however, current publicly available datasets do not contain many classes relevant to human-robot interaction. In order to address the problem, a human-robot interaction gesture dataset is therefore required. In this paper, we introduce HRI-Gestures, which includes 13600 instances of RGB and depth image sequences, and joint position files. A state of the art action recognition network is trained on relevant subsets of the dataset and achieve upwards of 96.9% accuracy. However, as the network is designed for the large-scale NTU RGB+D dataset, subpar performance is achieved on the full HRI-Gestures dataset. Further enhancement of gesture recognition is possible by tailored algorithms or extension of the dataset.
Download

Paper Nr: 182
Title:

StructureNet: Deep Context Attention Learning for Structural Component Recognition

Authors:

Akash Kaothalkar, Bappaditya Mandal and Niladri. B. Puhan

Abstract: Structural component recognition using images is a very challenging task due to the appearance of large components and their long continuation, existing jointly with very small components, the latter are often outcasted/missed by the existing methodologies. In this work, various categories of the bridge components are exploited at the contextual level information encoding across spatial as well as channel dimensions. Tensor decomposition is used to design a context attention framework that acquires crucial information across various dimensions by fusing the class contexts and 3-D attention map. Experimental results on benchmarking bridge component classification dataset show that our proposed architecture attains superior results as compared to the current state-of-the-art methodologies.
Download

Paper Nr: 207
Title:

Pose Guided Feature Learning for 3D Object Tracking on RGB Videos

Authors:

Mateusz Majcher and Bogdan Kwolek

Abstract: In this work we propose a new approach to 3D object pose tracking in sequences of RGB images acquired by a calibrated camera. A single hourglass neural network that has been trained to detect fiducial keypoints on a set of objects delivers heatmaps representing 2D locations of the keypoints. Given a calibrated camera model and a sparse object model consisting of 3D locations of the keypoints, the keypoints in hypothesized object poses are projected onto 2D plane and then matched with the heatmaps. A quaternion particle filter with a probabilistic observation model that uses such a matching is employed to maintain 3D object pose distribution. A single Siamese neural network is trained for a set of objects on keypoints from the current and previous frame in order to generate a particle in the predicted 3D object pose. The filter draws particles to predict the current pose using its a priori knowledge about the object velocity and includes the predicted 3D object pose by the neural network in a priori distribution. Thus, the hypothesized 3D object poses are generated using both a priori knowledge about the object velocity in 3D and keypoint-based geometric reasoning as well as relative transformations in the image plane. In an extended algorithm we combine the set of propagated particles with an optimized particle, whose pose is determined by Levenberg-Marguardt.
Download

Paper Nr: 223
Title:

Seeing the Differences in Artistry among Art Fields by using Multi-task Learning

Authors:

Ryo Sato, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a method for analyzing the relevance of artistry among multiple art fields by using deep neural networks. Artistry is thought to exist in various man-made objects, such as paintings, sculptures, architectures, and gardens. However, we are not sure if the artistry or the human aesthetic sensitivities in these different art fields is the same or different. Therefore, we in this paper propose a method for analyzing the relevance of artistry among multiple art fields by using deep neural networks. In particular, we show that by using the multi-task learning, the relevance of multiple art fields can be analyzed efficiently.
Download

Paper Nr: 238
Title:

Transfer Learning via Test-time Neural Networks Aggregation

Authors:

Bruno Casella, Alessio B. Chisari, Sebastiano Battiato and Mario V. Giuffrida

Abstract: It has been demonstrated that deep neural networks outperform traditional machine learning. However, deep networks lack generalisability, that is, they will not perform as good as in a new (testing) set drawn from a different distribution due to the domain shift. In order to tackle this known issue, several transfer learning approaches have been proposed, where the knowledge of a trained model is transferred into another to improve performance with different data. However, most of these approaches require additional training steps, or they suffer from catastrophic forgetting that occurs when a trained model has overwritten previously learnt knowledge. We address both problems with a novel transfer learning approach that uses network aggregation. We train dataset-specific networks together with an aggregation network in a unified framework. The loss function includes two main components: a task-specific loss (such as cross-entropy) and an aggregation loss. The proposed aggregation loss allows our model to learn how trained deep network parameters can be aggregated with an aggregation operator. We demonstrate that the proposed approach learns model aggregation at test time without any further training step, reducing the burden of transfer learning to a simple arithmetical operation. The proposed approach achieves comparable performance w.r.t. the baseline. Besides, if the aggregation operator has an inverse, we will show that our model also inherently allows for selective forgetting, i.e., the aggregated model can forget one of the datasets it was trained on, retaining information on the others.
Download

Paper Nr: 246
Title:

Using Contrastive Learning and Pseudolabels to Learn Representations for Retail Product Image Classification

Authors:

Muktabh M. Srivastava

Abstract: Retail product Image classification problems are often few shot classification problems, given retail product classes cannot have the type of variations across images like a cat or dog or tree could have. Previous works have shown different methods to finetune Convolutional Neural Networks to achieve better classification accuracy on such datasets. In this work, we try to address the problem statement : Can we pretrain a Convolutional Neural Network backbone which yields good enough representations for retail product images, so that training a simple logistic regression on these representations gives us good classifiers ? We use contrastive learning and pseudolabel based noisy student training to learn representations that get accuracy in order of the effort of finetuning the entire Convnet backbone for retail product image classification.
Download

Paper Nr: 273
Title:

Don’t Miss the Fine Print! An Enhanced Framework to Extract Text from Low Resolution Images

Authors:

Pranay Dugar, Aditya Vikram, Anirban Chatterjee, Kunal Banerjee and Vijay Agneeswaran

Abstract: Scene Text Recognition (STR) enables processing and understanding of the text in the wild. However, roadblocks like natural degradation, blur, and uneven lighting in the captured images result in poor accuracy during detection and recognition. Previous approaches have introduced Super-Resolution (SR) as a processing step between detection and recognition; however, post enhancement, there is a significant drop in the quality of the reconstructed text in the image. This drop is especially significant in the healthcare domain because any loss in accuracy can be detrimental. This paper will quantitatively show the drop in quality of the text in an image from the existing SR techniques across multiple optimization-based and GAN-based models. We propose a new loss function for training and an improved deep neural network architecture to address these shortcomings and recover text with sharp boundaries in the SR images. We also show that the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index Measure (SSIM) scores are not effective metrics for identifying the quality of the text in an SR image. Extensive experiments show that our model achieves better accuracy and visual improvements against state-of-the-art methods in terms of text recognition accuracy. We plan to add our module on SR in the near future to our already deployed solution for text extraction from product images for our company.
Download

Paper Nr: 277
Title:

Attention-based Gender Recognition on Masked Faces

Authors:

Vincenzo Carletti, Antonio Greco, Alessia Saggese and Mario Vento

Abstract: Gender recognition from face images can be profitably used in several vertical markets, such as targeted advertising and cognitive robotics. However, in the last years, due to the COVID-19 pandemic, the unreliability of such systems when dealing with faces covered by a mask has emerged. In this paper, we propose a novel architecture based on attention layers and trained with a domain specific data augmentation technique for reliable gender recognition of masked faces. The proposed method has been experimentally evaluated on a huge dataset, namely VGGFace2-M, a masked version of the well known VGGFace2 dataset, and the achieved results confirm an improvement of around 4% with respect to traditional gender recognition algorithms, while preserving the performance on unmasked faces.
Download

Paper Nr: 286
Title:

Class-conditional Importance Weighting for Deep Learning with Noisy Labels

Authors:

Bhalaji Nagarajan, Ricardo Marques, Marcos Mejia and Petia Radeva

Abstract: Large-scale accurate labels are very important to the Deep Neural Networks to train them and assure high performance. However, it is very expensive to create a clean dataset since usually it relies on human interaction. To this purpose, the labelling process is made cheap with a trade-off of having noisy labels. Learning with Noisy Labels is an active area of research being at the same time very challenging. The recent advances in Self-supervised learning and robust loss functions have helped in advancing noisy label research. In this paper, we propose a loss correction method that relies on dynamic weights computed based on the model training. We extend the existing Contrast to Divide algorithm coupled with DivideMix using a new class-conditional weighted scheme. We validate the method using the standard noise experiments and achieved encouraging results.
Download

Paper Nr: 3
Title:

Weakly-supervised Localization of Multiple Objects in Images using Cosine Loss

Authors:

Björn Barz and Joachim Denzler

Abstract: Can we learn to localize objects in images from just image-level class labels? Previous research has shown that this ability can be added to convolutional neural networks (CNNs) trained for image classification post hoc without additional cost or effort using so-called class activation maps (CAMs). However, while CAMs can localize a particular known class in the image quite accurately, they cannot detect and localize instances of multiple different classes in a single image. This limitation is a consequence of the missing comparability of prediction scores between classes, which results from training with the cross-entropy loss after a softmax activation. We find that CNNs trained with the cosine loss instead of cross-entropy do not exhibit this limitation and propose a variation of CAMs termed Dense Class Maps (DCMs) that fuse predictions for multiple classes into a coarse semantic segmentation of the scene. Even though the network has only been trained for single-label classification at the image level, DCMs allow for detecting the presence of multiple objects in an image and locating them. Our approach outperforms CAMs on the MS COCO object detection dataset by a relative increase of 27% in mean average precision.
Download

Paper Nr: 77
Title:

Feature Extraction using Downsampling for Person Re-identification with Low-resolution Images

Authors:

Masashi Nishiyama, Takuya Endo and Yoshio Iwai

Abstract: We investigate whether a downsampling process of high-resolution pedestrian images can improve person re-identification accuracy. Generally, deep-learning and machine-learning techniques are used to extract features that are unaffected by image resolution. However, it requires a large number of pairs of high- and low-resolution images acquired from the same person. Here, we consider a situation in which these resolution pairs cannot be collected. We extract features from low-resolution pedestrian images using only a simple downsampling process that requires no training resolution pairs. We collected image resolution datasets by changing the focal length of the camera lens and the distance from the person to the camera. We confirmed that the person re-identification accuracy of the downsampling process was superior to that of the upsampling. We also confirmed that the low-frequency components corresponding to the output of the downsampling process contain many discriminative features.
Download

Paper Nr: 78
Title:

Applying Center Loss to Multidimensional Feature Space in Deep Neural Networks for Open-set Recognition

Authors:

Daiju Kanaoka, Yuichiro Tanaka and Hakaru Tamukoh

Abstract: With the advent of deep learning, significant improvements in image recognition performance have been achieved. In image recognition, it is generally assumed that all the test data are composed of known classes. This approach is termed as closed-set recognition. In closed-set recognition, when an untrained, unknown class is input, it is recognized as one of the trained classes. The method whereby an unknown image is recognized as unknown when it is input is termed as open-set recognition. Although several open-set recognition methods have been proposed, none of these previous methods excel in terms of all three evaluation items: learning cost, recognition performance, and scalability from closed-set recognition models. To address this, we propose an open-set recognition method using the distance between features in the multidimensional feature space of neural networks. By applying center loss to the feature space, we aim to maintain the classification accuracy of closed-set recognition and improve the unknown detection performance. In our experiments, we achieved state-of-the-art performance on the MNIST, SVHN, and CIFAR-10 datasets. In addition, the proposed approach shows excellent performance in terms of the three evaluation items.
Download

Paper Nr: 80
Title:

Automated Human Movement Segmentation by Means of Human Pose Estimation in RGB-D Videos for Climbing Motion Analysis

Authors:

Raul Beltrán B., Julia Richter and Ulrich Heinkel

Abstract: The individual movement characterization of the human body parts is a fundamental task for the study of different activities executed by a person. Changes in position, speed and frequency of the different limbs reveal the kind of activity and allow us to estimate whether an action is well performed or not. Part of this characterization consists of establishing when the action begins and ends, but it is a difficult process when attempted by purely optical means since the subject’s pose in the image must first be extracted before proceeding with the movement variables identification. Human motion analysis has been approached in multiple studies through methods ranging from stochastic to artificial intelligence prediction, and more recently the latest research has been extended to the sport climbing employing the centre-of-mass analysis. In this paper, we present a method to identify the beginning and end of the movements of human body parts, through the analysis of kinematic variables obtained from RGB-D videos, with the aim of motion analysis in climbing. Application tests with OpenPose, PoseNet and Vision are presented to determine the optimal framework for human pose estimation in this sports scenario, and finally, the proposed method is validated to segment the movements of a climber on the climbing wall.
Download

Paper Nr: 97
Title:

Tennis Strokes Recognition from Generated Stick Figure Video Overlays

Authors:

Boris Bačić and Ishara Bandara

Abstract: In this paper, we contribute to the existing body of knowledge of video indexing technology by presenting a novel approach for recognition of tennis strokes from consumer-grade video cameras. To classify four categories with three strokes of interest (forehand, backhand, serve, no-stroke), we extract features as a time series from stick figure overlays generated using OpenPose library. To process spatiotemporal feature space, we experimented with three variations of LSTM-based classifier models. From a selection of publicly available videos, trained models achieved an average accuracy of between 97%–100%. To demonstrate transferability of our approach, future work will include other individual and team sports, while maintaining focus on feature extraction techniques with minimal reliance on domain expertise.
Download

Paper Nr: 104
Title:

Detecting Patches on Road Pavement Images Acquired with 3D Laser Sensors using Object Detection and Deep Learning

Authors:

Syed I. Hassan, Dympna O’sullivan, Susan Mckeever, David Power, Ray Mcgowan and Kieran Feighan

Abstract: Regular pavement inspections are key to good road maintenance and detecting road defects. Advanced pavement inspection systems such as LCMS (Laser Crack Measurement System) can automatically detect the presence of simple defects (e.g. ruts) using 3D lasers. However, such systems still require manual involvement to complete the detection of more complex pavement defects (e.g. patches). This paper proposes an automatic patch detection system using object detection techniques. To our knowledge, this is the first time state-of-the-art object detection models (Faster RCNN, and SSD MobileNet-V2) have been used to detect patches inside images acquired by 3D profiling sensors. Results show that the object detection model can successfully detect patches inside such images and suggest that our proposed approach could be integrated into the existing pavement inspection systems. The contribution of this paper are (1) an automatic pavement patch detection model for images acquired by 3D profiling sensors and (2) comparative analysis of RCNN, and SSD MobileNet-V2 models for automatic patch detection.
Download

Paper Nr: 105
Title:

Weakly Supervised Segmentation of Histopathology Images: An Insight in Feature Maps Ability for Learning Models Interpretation

Authors:

Yanbo Feng, Adel Hafiane and Hélène Laurent

Abstract: Feature map is obtained from the middle layer of convolutional neural network (CNN), it carries the regional information captured by network itself about the target of input image. This property is widely used in weakly supervised learning to achieve target localization and segmentation. However, the traditional method of processing feature map is often associated with the weight of output layer. In this paper, the weak correlation between feature map and weight is discussed. We believe that it is not accurate to directly transplant the weights of output layer to feature maps, the reason is that the global mean value of feature map loses its spatial information, weighting scalars cannot accurately constrain the three-dimensional feature maps. We highlight that the feature map in a specific channel has invariance to target’s location, it can stably activate the more complete region directly related to target, that is, the feature map ability has strong correlation with the channel.
Download

Paper Nr: 129
Title:

Video-based Behavior Understanding of Children for Objective Diagnosis of Autism

Authors:

Abid Ali, Farhood Negin, Susanne Thümmler and Francois Bremond

Abstract: One of the major diagnostic criteria for Autism Spectrum Disorder (ASD) is the recognition of stereotyped behaviors. However, it primarily relies on parental interviews and clinical observations, which result in a prolonged diagnosis cycle preventing ASD children from timely treatment. To help clinicians speed up the diagnosis process, we propose a computer-vision-based solution. First, we collected and annotated a novel dataset for action recognition tasks in videos of children with ASD in an uncontrolled environment. Second, we propose a multi-modality fusion network based on 3D CNNs. In the first stage of our method, we pre- process the RGB videos to get the ROI (child) using Yolov5 and DeepSORT algorithms. For optical flow extraction, we use the RAFT algorithm. In the second stage, we perform extensive experiments on different deep learning frameworks to propose a baseline. In the last stage, a multi-modality-based late fusion network is proposed to classify and evaluate performance of ASD children. The results revealed that the multi-modality fusion network achieves the best accuracy as compared to other methods. The baseline results also demonstrate the potential of an action-recognition-based system to assist clinicians in a reliable, accurate, and timely diagnosis of ASD disorder.
Download

Paper Nr: 130
Title:

Buildings Extraction from Historical Topographic Maps via a Deep Convolution Neural Network

Authors:

Christos Xydas, Anastasios L. Kesidis, Kleomenis Kalogeropoulos and Andreas Tsatsaris

Abstract: The cartographic representation is static by definition. Therefore, reading a map of the past can provide information, which corresponds to the accuracy, technology, as well as scientific knowledge of the time of their creation. Digital technology enables the current researcher to "copy" a historical map and "transcribe" it to today. In this way, a cartographic reduction from the past to the present is possible, with parallel visualization of new information (historical geodata), which the researcher has at his disposal, in addition to the background. In this work a deep learning approach is presented for the extraction of buildings within historical topographic maps. A deep convolution neural network based on the U-Net architecture is trained by a large number of images patches in a deep image-to-image regression mode in order to effectively isolate the buildings from the topographic map while ignoring other surrounding or overlapping information like texts or other irrelevant geospatial features. Several experimental scenarios on a historical census topographic map investigate the applicability of the method under various patch sizes as well as patch sampling methods. The so far results show that the proposed method delivers promising outcomes in terms of building detection accuracy.
Download

Paper Nr: 174
Title:

Using Student Action Recognition to Enhance the Efficiency of Tele-education

Authors:

Eleni Dimitriadou and Andreas Lanitis

Abstract: Due to the COVID-19 pandemic, many schools worldwide are using tele-education for class delivery. However, this causes a problem related to students’ active class participation. We propose to address the problem with a system that recognizes student’s actions and informs the teacher accordingly, while preserving the privacy of students. In the proposed action recognition system, seven typical actions performed by students attending online courses, are recognized using Convolutional Neural Network (CNN) architectures. The actions considered were defined by considering the relevant literature and educator’s views, and ensure that they provide information about the physical presence, active participation, and distraction of students, that constitute important pedagogical aspects of class delivery. The action recognition process is performed locally on the device of each student, thus it is imperative to use classification methods that require minimal computational load and memory requirements. Initial experimental results indicate that the proposed action recognition system provides promising classification results, when dealing with new instances of previously enrolled students or when dealing with previously unseen students.
Download

Paper Nr: 209
Title:

Fusion of Different Features by Cross Cooperative Learning for Semantic Segmentation

Authors:

Ryota Ikedo and Kazuhiro Hotta

Abstract: Deep neural networks have achieved high accuracy in the field of image recognition. Its technology is expected to use the medical, autonomous driving and so on. Therefore, various deep learning methods have been studied for many years. Recently, many studies used a backbone network as an encoder for feature extraction. Of course, the extracted features are changed when we change backbone networks. This paper focused on the differences in features extracted from two backbone networks. It will be possible to obtain the information that cannot be obtained by a single backbone network, and we can get rich information to solve a task. In addition, we use cross cooperative learning for fusing the features of different backbone networks effectively. In experiments on two kinds of datasets for image segmentation, our proposed method achieved better segmentation accuracy than conventional method using a single backbone network and the ensemble of networks.
Download

Paper Nr: 218
Title:

Detecting Object States vs Detecting Objects: A New Dataset and a Quantitative Experimental Study

Authors:

Filippos Gouidis, Theodore Patkos, Antonis Argyros and Dimitris Plexousakis

Abstract: The detection of object states in images (State Detection - SD) is a problem of both theoretical and practical importance and it is tightly interwoven with other important computer vision problems, such as action recognition and affordance detection. It is also highly relevant to any entity that needs to reason and act in dynamic domains, such as robotic systems and intelligent agents. Despite its importance, up to now, the research on this problem has been limited. In this paper, we attempt a systematic study of the SD problem. First, we introduce the Object State Detection Dataset (OSDD), a new publicly available dataset consisting of more than 19,000 annotations for 18 object categories and 9 state classes. Second, using a standard deep learning framework used for Object Detection (OD), we conduct a number of appropriately designed experiments, towards an in- depth study of the behavior of the SD problem. This study enables the setup of a baseline on the performance of SD, as well as its relative performance in comparison to OD, in a variety of scenarios. Overall, the experimental outcomes confirm that SD is harder than OD and that tailored SD methods need to be developed for addressing effectively this significant problem.
Download

Paper Nr: 219
Title:

Skeleton-based Online Sign Language Recognition using Monotonic Attention

Authors:

Natsuki Takayama, Gibran Benitez-Garcia and Hiroki Takahashi

Abstract: Sequence-to-sequence models have been successfully applied to improve continuous sign language word recognition in recent years. Although various methods for continuous sign language word recognition have been proposed, these methods assume offline recognition and lack further investigation in online and streaming situations. In this study, skeleton-based continuous sign language word recognition for online situations was investigated. A combination of spatial-temporal graph convolutional networks and recurrent neural networks with soft attention was employed as the base model. Further, three types of monotonic attention techniques were applied to extend the base model for online recognition. The monotonic attention included hard monotonic attention, monotonic chunkwise attention, and monotonic infinite lookback attention. The performance of the proposed models was evaluated in offline and online recognition settings. A conventional Japanese sign language video dataset, including 275 types of isolated word videos and 113 types of sentence videos, was utilized to evaluate the proposed models. The results showed that the effectiveness of monotonic attention to online continuous sign language word recognition.
Download

Paper Nr: 225
Title:

Diversifying Image Synthesis using Data Classification

Authors:

Yuta Suzuki, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a method for generating highly diverse images in GAN-based image generation. In recent years, GANs that generate various images such as MSGAN and BicycleGAN have been proposed. By using these methods, it is possible to generate a variety of images to some extent, but when compared with the variety of training images, they are still less diverse. That is, it is still a difficult problem to generate a variety of images, even if a wide variety of training images are being trained. Thus, in this paper, we propose a new structure of GAN which enables us to generate more diversity than the existing methods. Our method estimates the distribution of training images in advance and learns to imitate the diversity of training images. The effectiveness of the proposed method is shown by comparative experiments with the existing methods.
Download

Paper Nr: 226
Title:

Combining Text and Image Knowledge with GANs for Zero-Shot Action Recognition in Videos

Authors:

Kaiqiang Huang, Luis Miralles-Pechuán and Susan Mckeever

Abstract: The recognition of actions in videos is an active research area in machine learning, relevant to multiple domains such as health monitoring, security and social media analysis. Zero-Shot Action Recognition (ZSAR) is a challenging problem in which models are trained to identify action classes that have not been seen during the training process. According to the literature, the most promising ZSAR approaches make use of Generative Adversarial Networks (GANs). GANs can synthesise visual embeddings for unseen classes conditioned on either textual information or images related to the class labels. In this paper, we propose a Dual-GAN approach based on the VAEGAN model to prove that the fusion of visual and textual-based knowledge sources is an effective way to improve ZSAR performance. We conduct empirical ZSAR experiments of our approach on the UCF101 dataset. We apply the following embedding fusion methods for combining text-driven and image-driven information: averaging, summation, maximum, and minimum. Our best result from Dual-GAN model is achieved with the maximum embedding fusion approach that results in an average accuracy of 46.37%, which is improved by 5.37% at least compared to the leading approaches.
Download

Paper Nr: 234
Title:

Exploitation of Noisy Automatic Data Annotation and Its Application to Hand Posture Classification

Authors:

Georgios Lydakis, Iason Oikonomidis, Dimitrios Kosmopoulos and Antonis A. Argyros

Abstract: The success of deep learning in recent years relies on the availability of large amounts of accurately annotated training data. In this work, we investigate a technique for utilizing automatically annotated data in classification problems. Using a small number of manually annotated samples, and a large set of data that feature automatically created, noisy labels, our approach trains a Convolutional Neural Network (CNN) in an iterative manner. The automatic annotations are combined with the predictions of the network in order to gradually expand the training set. In order to evaluate the performance of the proposed approach, we apply it to the problem of hand posture recognition from RGB images. We compare the results of training a CNN classifier with and without the use of our technique. Our method yields a significant increase in average classification accuracy, and also decreases the deviation in class accuracies, thus indicating the validity and the usefulness of the proposed approach.
Download

Paper Nr: 241
Title:

Evaluation of RGB and LiDAR Combination for Robust Place Recognition

Authors:

Farid Alijani, Jukka Peltomäki, Jussi Puura, Heikki Huttunen, Joni-Kristian Kämäräinen and Esa Rahtu

Abstract: Place recognition is one of the main challenges in localization, mapping and navigation tasks of self-driving vehicles under various perceptual conditions, including appearance and viewpoint variations. In this paper, we provide a comprehensive study on the utility of fine-tuned Deep Convolutional Neural Network (DCNN) with three MAC, SpoC and GeM pooling layers to learn global image representation for place recognition in an end-to-end manner using three different sensor data modalities: (1) only RGB images; (2) only intensity or only depth 3D LiDAR point clouds projected into 2D images and (3) early fusion of RGB images and LiDAR point clouds (both intensity and depth) to form a unified global descriptor to leverage robust features of both modalities. The experimental results on a diverse and large long-term Oxford Radar RobotCar dataset illustrate an achievement of 5 m outdoor place recognition accuracy with high recall rate of 90 % using early fusion of RGB and LiDAR sensor data modalities when fine-tuned network with GeM pooling layer is utilized.
Download

Area 4 - Applications and Services

Full Papers
Paper Nr: 86
Title:

Cervical Spine Range of Motion Measurement Utilizing Image Analysis

Authors:

Kana Matsuo, Koji Fujita, Takafumi Koyama, Shingo Morishita and Yuta Sugiura

Abstract: Diseases of the cervical spine often cause more serious impediments to daily activities than diseases of other parts of the body, and thus require prompt and accurate diagnosis. One of the indicators used for diagnosing cervical spine diseases is measurements of the range of motion (RoM) angle. However, the main measurement method is manual, which creates a burden on physicians. In this work, we investigate the possibility of measuring the RoM angle of the cervical spine from cervical X-ray images by using Mask R-CNN and image processing. The results of measuring the RoM angle with the proposed cervical spine motion angle measurement system showed that the mean error from the true value was 3.5 degrees and the standard deviation was 2.8 degrees. Moreover, the standard deviation of the specialist measurements used for comparison was 2.9 degrees, while that of the proposed system was just 0 degrees, indicating that there was no variation in the measurements of the proposed system.
Download

Paper Nr: 137
Title:

Semi-supervised Surface Anomaly Detection of Composite Wind Turbine Blades from Drone Imagery

Authors:

Jack W. Barker, Neelanjan Bhowmik and Toby P. Breckon

Abstract: Within commercial wind energy generation, the monitoring and predictive maintenance of wind turbine blades in-situ is a crucial task, for which remote monitoring via aerial survey from an Unmanned Aerial Vehicle (UAV) is commonplace. Turbine blades are susceptible to both operational and weather-based damage over time, reducing the energy efficiency output of turbines. In this study, we address automating the otherwise time-consuming task of both blade detection and extraction, together with fault detection within UAV-captured turbine blade inspection imagery. We propose BladeNet, an application-based, robust dual architecture to perform both unsupervised turbine blade detection and extraction, followed by super-pixel generation using the Simple Linear Iterative Clustering (SLIC) method to produce regional clusters. These clusters are then processed by a suite of semi-supervised detection methods. Our dual architecture detects surface faults of glass fibre composite material blades with high aptitude while requiring minimal prior manual image annotation. BladeNet produces an Average Precision (AP) of 0.995 across our Ørsted blade inspection dataset for offshore wind turbines and 0.223 across the Danish Technical University (DTU) NordTank turbine blade inspection dataset. BladeNet also obtains an AUC of 0.639 for surface anomaly detection across the Ørsted blade inspection dataset.
Download

Paper Nr: 191
Title:

Camera Pose Estimation using Human Head Pose Estimation

Authors:

Robert Fischer, Michael Hödlmoser and Margrit Gelautz

Abstract: This paper presents a novel framework for camera pose estimation using the human head as a calibration object. The proposed approach enables extrinsic calibration based on 2D input images (RGB and/or NIR), without any need for additional calibration objects or depth information. The method can be used for single cameras or multi-camera networks. For estimating the human head pose, we rely on a deep learning based 2D human facial landmark detector and fit a 3D head model to estimate the 3D human head pose. The paper demonstrates the feasibility of this novel approach and shows its performance on both synthetic and real multi-camera data. We compare our calibration procedure to a traditional checkerboard calibration technique and calculate calibration errors between camera pairs. Additionally, we examine the robustness to varying input parameters, such as simulated people with different skin tone and gender, head models, and variations in camera positions. We expect our method to be useful in various application domains including automotive in- cabin monitoring, where the flexibility and ease of handling the calibration procedure are often more important than very high accuracy.
Download

Paper Nr: 267
Title:

Counting or Localizing? Evaluating Cell Counting and Detection in Microscopy Images

Authors:

Luca Ciampi, Fabio Carrara, Giuseppe Amato and Claudio Gennaro

Abstract: Image-based automatic cell counting is an essential yet challenging task, crucial for the diagnosing of many diseases. Current solutions rely on Convolutional Neural Networks and provide astonishing results. However, their performance is often measured only considering counting errors, which can lead to masked mistaken estimations; a low counting error can be obtained with a high but equal number of false positives and false negatives. Consequently, it is hard to determine which solution truly performs best. In this work, we investigate three general counting approaches that have been successfully adopted in the literature for counting several different categories of objects. Through an experimental evaluation over three public collections of microscopy images containing marked cells, we assess not only their counting performance compared to several state-of-the-art methods but also their ability to correctly localize the counted cells. We show that commonly adopted counting metrics do not always agree with the localization performance of the tested models, and thus we suggest integrating the proposed evaluation protocol when developing novel cell counting solutions.
Download

Short Papers
Paper Nr: 48
Title:

Detecting Corruption in Real Video Game Graphics using Deep Convolutional Neural Networks

Authors:

Matthieu Chan Chee, Vinay Pandit and Max Kiehn

Abstract: Early detection of video game display corruption is essential to maintain the highest quality standards and to reduce the time to market of new GPUs targeted for the gaming industry. This paper presents a Deep Learning approach to automate gameplay corruption detection, which otherwise requires labor-intensive manual inspection. Unlike prior efforts which are reliant on synthetically generated corrupted images, we collected real-world examples of corrupted images from over 50 game titles. We trained an EfficientNet to classify input game frames as corrupted or golden using a two-stage training strategy and extensive hyperparameter search. Our method was able to accurately detect a variety of geometric, texture, and color corruptions with a precision of 0.989 and recall of 0.888.
Download

Paper Nr: 70
Title:

Altering Facial Expression based on Textual Emotion

Authors:

Mohammad Imrul Jubair, Md. Masud Rana, Md. Amir Hamza, Mohsena Ashraf, Fahim Ahsan Khan and Ahnaf Tahseen Prince

Abstract: Faces and their expressions are one of the potent subjects for digital images. Detecting emotions from images is an ancient task in the field of computer vision; however, performing its reverse—synthesizing facial expressions from images—is quite new. Such operations of regenerating images with different facial expressions, or altering an existing expression in an image require the Generative Adversarial Network (GAN). In this paper, we aim to change the facial expression in an image using GAN, where the input image with an initial expression (i.e., happy) is altered to a different expression (i.e., disgusted) for the same person. We used StarGAN techniques on a modified version of the MUG dataset to accomplish this objective. Moreover, we extended our work further by remodeling facial expressions in an image indicated by the emotion from a given text. As a result, we applied a Long Short-Term Memory (LSTM) method to extract emotion from the text and forwarded it to our expression-altering module. As a demonstration of our working pipeline, we also create an application prototype of a blog that regenerates the profile picture with different expressions based on the user’s textual emotion.
Download

Paper Nr: 115
Title:

Deep Features Extraction for Endoscopic Image Matching

Authors:

Houda Chaabouni-Chouayakh, Manel Farhat and Achraf Ben-Hamadou

Abstract: Image feature matching is a key step in creating endoscopic mosaics of the bladder inner walls, which help urologists in lesion detection and patient follow-up. Endoscopic images, on the other hand, are particularly difficult to match because they are weekly textured and have limited surface area per frame. Deep learning techniques have recently gained popularity in a variety of computer vision tasks. The ability of convolutional neural networks (CNNs) to learn rich and optimal features contributes to the success of these methods. In this paper, we present a novel deep learning based approach for endoscopic image matching. Instead of standard handcrafted image descriptors, we designed a CNN to extract feature vector from local interest points. We propose an efficient approach to train our CNN without manually annotated data. We proposed an adaptive triplet loss which has the advantage of improving the inter-class separability as well as the inter class compactness. The training dataset is automatically constructed, each sample is a triplet of patches: an anchor, one positive sample (a perspective transformation of the anchor) and one negative sample. The obtained experimental results show at the end of the training step a more discriminative space representation where the anchor becomes closer to the positive sample and farther from the negative one in the embedding space. Comparison with the well-known standard hand-crafted descriptor SIFT in terms of recall and precision showed the effectiveness of the proposed approach, reaching the top recall value for a precision value of 0.97.
Download

Paper Nr: 118
Title:

Ensemble Clustering for Histopathological Images Segmentation using Convolutional Autoencoders

Authors:

Ilias Rmouque, Maxime Devanne, Jonathan Weber, Germain Forestier and Cédric Wemmert

Abstract: Unsupervised deep learning using autoencoders has shown excellent results in image analysis and computer vision. However, only few studies have been presented in the field of digital pathology, where proper labelling of the objects of interest is a particularly costly and difficult task. Thus, having a first fully unsupervised segmentation could greatly help in the analysis process of such images. In this paper, many architectures of convolutional autoencoders have been compared to study the influence of three main hyperparameters: (1) number of convolutional layers, (2) number of convolutions in each layer and (3) size of the latent space. Different clustering algorithms are also compared and we propose a new way to obtain more precise results by applying ensemble clustering techniques which consists in combining multiple clustering results.
Download

Paper Nr: 149
Title:

Automated Video Edition for Synchronized Mobile Recordings of Concerts

Authors:

Albert Jiménez, Lluís Gómez and Joan Llobera

Abstract: We propose a computer vision model that paves the road towards a system that automatically creates a video of a live concert by combining multiple recordings of the audience. The automatic edition system divides the edition problem in three parts: synchronize recordings with media streaming technology, selection of the scene cut position, and the selection of the next shot among the different contributions using an attention-based shot prediction model. We train the shot prediction model using camera transitions in professionally-edited videos of concerts, and evaluate it with both an accuracy metric and a human judgement study. Results show that our system selects the same video source as the ground truth in 38.8% of the cases when challenged with a random number of possible sources ranging between 5 and 10. For the rest of the samples, subjective preference among the selected image and the ground truth is at chance level for non-experts. Image editing experts do reflect better-than-chance performance, when asked to predict the following shot selected.
Download

Paper Nr: 170
Title:

Identification of over One Thousand Individual Wild Humpback Whales using Fluke Photos

Authors:

Takashi Yoshikawa, Masami Hida, Chonho Lee, Haruna Okabe, Nozomi Kobayashi, Sachie Ozawa, Hideo Saito, Masaki Kan, Susumu Date and Shinji Shimojo

Abstract: Identifying individual humpback whales by photographs of their tails is valuable for understanding the ecology of wild whales. We have about 10,000 photos of 1,850 identified whales taken in the sea area around Okinawa over a 30-year period. The identification process on this large scale of numbers is difficult not only for the human eye but also for machine vision, as the numbers of photographs per individual whale are very low. About 30% of the whales have only a single photograph, and 80% have fewer than five. In addition, the shapes of the tails and the black and white patterns on them are vague, and these change readily with the whale’s slightest movement and changing photo-shooting conditions. We propose a practical method for identifying a humpback whale by accurate segmentation of the fluke region using a combination of deep neural networking and GrabCut. Then useful features for identifying each individual whale are extracted by both histograms of image features and wavelet transform of the trailing edge. The test results for 323 photos show the correct individuals are ranked within the top 30 for 89% of the photos, and at the same time for 76% of photos ranked at the top.
Download

Paper Nr: 202
Title:

17K-Graffiti: Spatial and Crime Data Assessments in São Paulo City

Authors:

Bahram Lavi, Eric K. Tokuda, Felipe Moreno-Vera, Luis G. Nonato, Claudio T. Silva and Jorge Poco

Abstract: Graffiti is an inseparable element of most large cities. It is of critical value to recognize whether it is an artistry product or a distortion sign. This study develops a larger graffiti dataset containing a variety of graffiti types and annotated boundary boxes. We use this data to obtain a robust graffiti detection model. Compared with existing methods on the task, the proposed model achieves superior results. As a case study, the created model is evaluated on a vast number of street view images to localize graffiti incidence in the city of São Paulo, Brazil. We also validated our model using the case study data, and, again, the method achieved outstanding performance. The robustness of the technique enabled further analysis of the geographical distribution of graffiti. Considering graffiti as a spatial element of the city, we investigated its relation with crime occurrences. Relatively high correlation values were obtained between graffiti and crimes against pedestrians. Finally, this work raises many questions, such as the understanding of how these relationships change across the city according to the types of graffiti.
Download

Paper Nr: 213
Title:

Classification of Histopathological Images of Penile Cancer using DenseNet and Transfer Learning

Authors:

Marcos M. Lauande, Amanda M. Teles, Leandro Lima da Silva, Caio F. Matos, Geraldo Braz Júnior, Anselmo Cardoso de Paiva, João D. Sousa de Almeida, Rui C. Oliveira, Haissa O. Brito, Ana G. Nascimento, Ana F. Pestana, Ana D. Santos and Fernanda F. Lopes

Abstract: Penile cancer is a rare tumor that accounts for 2% of cancer cases in men in Brazil. Histopathological analyzes are commonly used in its diagnosis, making it possible to assess the degree of the disease, its evolution, and its nature. About a decade ago, scientific works in the field of deep learning were developed to help pathologists make decisions quickly and reliably, opening up possibilities for new contributions to improve such a complex and time-consuming activity for these professionals. In this work, we present the development of a method that uses a DenseNet to diagnose penile cancer in histopathological images, and the construction of a dataset (via the Legal Amazon Penis Cancer Project) used to validate this method. In the experiments performed, an F1-Score of up to 97.39% and a sensitivity of up to 98.33% were achieved in this binary classification problem (normal or squamous cell carcinoma).
Download

Paper Nr: 230
Title:

Semantic Risk-aware Costmaps for Robots in Industrial Applications using Deep Learning on Abstracted Safety Classes from Synthetic Data

Authors:

Thomas Weber, Michael Danner, Bo Zhang, Matthias Rätsch and Andreas Zell

Abstract: For collision and obstacle avoidance as well as trajectory planning, robots usually generate and use a simple 2D costmap without any semantic information about the detected obstacles. Thus a robot’s path planning will simply adhere to an arbitrarily large safety margin around obstacles. A more optimal approach is to adjust this safety margin according to the class of an obstacle. For class prediction, an image processing convolutional neural network can be trained. One of the problems in the development and training of any neural network is the creation of a training dataset. The first part of this work describes methods and free open source software, allowing a fast generation of annotated datasets. Our pipeline can be applied to various objects and environment settings and is extremely easy to use to anyone for synthesising training data from 3D source data. We create a fully synthetic industrial environment dataset with 10 k physically-based rendered images and annotations. Our dataset and sources are publicly available at https://github.com/LJMP/synthetic-industrial-dataset. Subsequently, we train a convolutional neural network with our dataset for costmap safety class prediction. We analyse different class combinations and show that learning the safety classes end-to-end directly with a small dataset, instead of using a class lookup table, improves the quantity and precision of the predictions.
Download

Paper Nr: 232
Title:

Graph-based Shot Type Classification in Large Historical Film Archives

Authors:

Daniel Helm, Florian Kleber and Martin Kampel

Abstract: To analyze films and documentaries (indexing, content understanding), a shot type classification is needed. State-of-the-art approaches use traditional CNN-based methods, which need large datasets for training (CineScale with 792000 frames or MovieShots with 46K shots). To overcome this problem, a Graph-based Shot TypeClassifier (GSTC) is proposed, which is able to classify shots into the following types: Extreme-Long-Shot (ELS), Long-Shot (LS), Medium-Shot (MS), Close-Up (CU), Intertitle (I), and Not Available/Not Clear (NA). The methodology is evaluated on standard datasets as well as a new published dataset: HistShotDS-Ext, including 25000 frames. The proposed Graph-based Shot Type Classifier reaches a classification accuracy of 86%.
Download

Paper Nr: 236
Title:

Detecting Anomalies Reliably in Long-term Surveillance Systems

Authors:

Jinsong Liu, Ivan Nikolov, Mark P. Philipsen and Thomas B. Moeslund

Abstract: In surveillance systems, detecting anomalous events like emergencies or potentially dangerous incidents by manual labor is an expensive task. To improve this, anomaly detection automatically by computer vision relying on the reconstruction error of an autoencoder (AE) is extensively studied. However, these detection methods are often studied in benchmark datasets with relatively short time duration — a few minutes or hours. This is different from long-term applications where time-induced environmental changes impose an additional influence on the reconstruction error. To reduce this effect, we propose a weighted reconstruction error for anomaly detection in long-term conditions, which separates the foreground from the background and gives them different weights in calculating the error, so that extra attention is paid on human-related regions. Compared with the conventional reconstruction error where each pixel contributes the same, the proposed method increases the anomaly detection rate by more than twice with three kinds of AEs (a variational AE, a memory-guided AE, and a classical AE) running on long-term (three months) thermal datasets, proving the effectiveness of the method.
Download

Paper Nr: 272
Title:

Study of LiDAR Segmentation and Model's Uncertainty using Transformer for Different Pre-trainings

Authors:

Mohammed Hassoubah, Ibrahim Sobh and Mohamed Elhelw

Abstract: For the task of semantic segmentation of 2D or 3D inputs, Transformer architecture suffers limitation in the ability of localization because of lacking low-level details. Also for the Transformer to function well, it has to be pre-trained first. Still pre-training the Transformer is an open area of research. In this work, Transformer is integrated into the U-Net architecture as (Chen et al., 2021). The new architecture is trained to conduct semantic segmentation of 2D spherical images generated from projecting the 3D LiDAR point cloud. Such integration allows capturing the the local dependencies from CNN backbone processing of the input, followed by Transformer processing to capture the long range dependencies. To define the best pre-training settings, multiple ablations have been executed to the network architecture, the self-training loss function and self-training procedure, and results are observed. It’s proved that, the integrated architecture and self-training improve the mIoU by +1.75% over U-Net architecture only, even with self-training it too. Corrupting the input and self-train the network for reconstruction of the original input improves the mIoU by highest difference = 2.9% over using reconstruction plus contrastive training objective. Self-training the model improves the mIoU by 0.48% over initialising with imageNet pre-trained model even with self-training the pre-trained model too. Random initialisation of the Batch Normalisation layers improves the mIoU by 2.66% over using selftrained parameters. Self supervision training of the segmentation network reduces the model’s epistemic uncertainty. The integrated architecture and self-training outperformed the SalsaNext (Cortinhal et al., 2020) (to our knowledge it’s the best projection based semantic segmentation network) by 5.53% higher mIoU, using the SemanticKITTI (Behley et al., 2019) validation dataset with 2D input dimension 1024×64.
Download

Paper Nr: 54
Title:

Color-Light Multi Cascade Network for Single Image Depth Prediction on One Perspective Artifact Images

Authors:

Aufaclav K. Frisky, Simon Brenner, Sebastian Zambanini and Robert Sablatnig

Abstract: Different color material and extreme lighting change pose a problem for single image depth prediction on archeological artifacts. These conditions can lead to misprediction on the surface of the foreground depth reconstruction. We propose a new method, the Color-Light Multi-Cascade Network, to overcome single image depth prediction limitations under these influences. Two feature extractions based on Multi-Cascade Networks (MCNet) are trained to deal with light and color problems individually for this new approach. By concatenating both of the features, we create a new architecture capable of reducing both color and light problems. Three datasets are used to evaluate the method with respect to color and lighting variations. Our experiments show that the individual Color-MCNet can improve the performance in the presence of color variations and fails to handle extreme light changes; the Light-MCNet, on the other hand, shows consistent results under changing lighting conditions but lacks detail. When joining the feature maps of Color-MCNet and Light-MCNet, we obtain a detailed surface both in the presence of different material colors in relief images, and under different lighting conditions. These results prove that our networks outperform state-of-the-art in limited number dataset. Finally, we also evaluate our joined network on the NYU Depth V2 Dataset to compare it with other state-of-the-art methods and obtain comparable performance.
Download

Paper Nr: 155
Title:

Distributed Deep Learning for Multi-Label Chest Radiography Classification

Authors:

Maram Mahmoud A. Monshi, Josiah Poon and Vera Chung

Abstract: Chest radiography supports the clinical diagnosis and treatment for a series of thoracic diseases, such as cardiomegaly, pneumonia, and lung lesion. With the revolution of deep learning and the availability of large chest radiography datasets, binary chest radiography classifiers have been widely proposed in the literature. However, these automatic classifiers neglect label co-occurrence and inter-dependency in chest radiography and fail to make full use of accelerators, resulting in inefficient and computationally expensive models. This paper first studies the effect of chest radiography image format, variations of Dense Convolutional Network (DenseNet-121) architecture, and parallel training on chest radiography multi-label classification task. Then, we propose Xclassifier, an efficient multi-label classifier that trains an enhanced DenseNet-121 with a blur pooling framework to classify chest radiography based on fourteen predefined labels. Xclassifier accomplishes an ideal memory utilization and GPU computation and achieves 84.10% AUC on the MIMIC-CXR dataset and 83.89% AUC on the CheXpert dataset. The code used to generate the experiment results mentioned in this paper can be found here: https://github.com/MaramMonshi/Xclassifier.
Download

Area 5 - Motion, Tracking and Stereo Vision

Full Papers
Paper Nr: 114
Title:

MOD SLAM: Mixed Method for a More Robust SLAM without Loop Closing

Authors:

Thomas Belos, Pascal Monasse and Eva Dokladalova

Abstract: In recent years, the state-of-the-art of monocular SLAM has seen remarkable advances in reducing errors and improving robustness. At the same time, this quality of results can be obtained in real-time on small CPUs. However, most algorithms have a high failure rate out-of-the-box. Systematic error such as drift remains still significant even for the best algorithms. This can be handled by a global measure as a loop closure, but it penalizes online data processing. We propose a mixed SLAM, based on ORB-SLAM2 and DSO: MOD SLAM. It is a fusion of photometric and feature-based methods, without being a simple copy of both. We propose a decision system to predict at each frame which optimization will produce the minimum drift so that only one will be selected to save computational time and resources. We propose a new implementation of the map that is equipped with the ability to actively work with DSO and ORB points at the same time. Our experimental results show that this method increases the overall robustness and reduces the drift without compromising the computational resources. Contrary to the best state-of-the-art algorithms, MOD SLAM can handle 100% of KITTI, TUM, and random phone videos, without any configuration change.
Download

Paper Nr: 121
Title:

Event-based Extraction of Navigation Features from Unsupervised Learning of Optic Flow Patterns

Authors:

Paul Fricker, Tushar Chauhan, Christophe Hurter and Benoit R. Cottereau

Abstract: We developed a Spiking Neural Network composed of two layers that processes event-based data captured by a dynamic vision sensor during navigation conditions. The training of the network was performed using a biologically plausible and unsupervised learning rule, Spike-Timing-Dependent Plasticity. With such an approach, neurons in the network naturally become selective to different components of optic flow, and a simple classifier is able to predict self-motion properties from the neural population output spiking activity. Our network has a simple architecture and a restricted number of neurons. Therefore, it is easy to implement on a neuromorphic chip and could be used for embedded applications necessitating low energy consumption.
Download

Paper Nr: 151
Title:

Scan2Part: Fine-grained and Hierarchical Part-level Understanding of Real-World 3D Scans

Authors:

Alexandr Notchenko, Vladislav Ishimtsev, Alexey Artemov, Vadim Selyutin, Emil Bogomolov and Evgeny Burnaev

Abstract: We propose Scan2Part, a method to segment individual parts of objects in real-world, noisy indoor RGB-D scans. To this end, we vary the part hierarchies of objects in indoor scenes and explore their effect on scene understanding models. Specifically, we use a sparse U-Net-based architecture that captures the fine-scale detail of the underlying 3D scan geometry by leveraging a multi-scale feature hierarchy. In order to train our method, we introduce the Scan2Part dataset, which is the first large-scale collection providing detailed semantic labels at the part level in the real-world setting. In total, we provide 242,081 correspondences between 53,618 PartNet parts of 2,477 ShapeNet objects and 1,506 ScanNet scenes, at two spatial resolutions of 2 cm3 and 5 cm3. As output, we are able to predict fine-grained per-object part labels, even when the geometry is coarse or partially missing. Overall, we believe that both our method as well as newly introduced dataset is a stepping stone forward towards structural understanding of real-world 3D environments.
Download

Paper Nr: 183
Title:

Classification Performance of RanSaC Algorithms with Automatic Threshold Estimation

Authors:

Clément Riu, Vincent Nozick, Pascal Monasse and Joachim Dehais

Abstract: The RANdom SAmpling Consensus method (RanSaC) is a staple of computer vision systems and offers a simple way of fitting parameterized models to data corrupted by outliers. It builds many models from small sets of randomly selected data points, and then scores them to keep the best. The original scoring function is the number of inliers, points that fit the model up to some tolerance. The threshold that separates inliers from outliers is data- and model-dependent, and estimating the quality of a RanSaC method is difficult as ground truth data is scarce or not quite reliable. To remedy that, we propose a data generation method to create at will ground truths both realistic and perfectly reliable. We then compare the RanSaC methods that simultaneously fit a model and estimate an appropriate threshold. We measure their classification performance on those semi-synthetic feature correspondence pairs for homography, fundamental, and essential matrices. Whereas the reviewed methods perform on par with the RanSaC baseline for standard cases, they do better in difficult cases, maintaining over 80 % precision and recall. The performance increase comes at the cost of running time and analytical complexity, and unexpected failures for some algorithms.
Download

Paper Nr: 192
Title:

BRDF-based Irradiance Image Estimation to Remove Radiometric Differences for Stereo Matching

Authors:

Kebin Peng, John Quarles and Kevin Desai

Abstract: Existing stereo matching methods assume that the corresponding pixels between left and right views have similar intensity. However, in real situations, image intensity tends to be dissimilar because of the radiometric differences obtained due to change in light reflected. In this paper, we propose a novel approach for removing these radiometric differences to perform stereo matching effectively. The approach estimates irradiance images based on the Bidirectional Reflectance Distribution Function (BRDF) which describes the ratio of radiance to irradiance for a given image. We demonstrate that to compute an irradiance image we only need to estimate the light source direction and the object’s roughness. We consider an approximation that the dot product of the unknown light direction parameters follows a Gaussian distribution and we use that to estimate the light source direction. The object’s roughness is estimated by calculating the pixel intensity variance using a local window strategy. By applying the above steps independently on the original stereo images, we obtain the illumination invariant irradiance images that can be used as input to stereo matching methods. Experiments conducted on well-known stereo estimation datasets demonstrate that our proposed approach significantly reduces the error rate of stereo matching methods.
Download

Paper Nr: 201
Title:

Combining Local and Global Pose Estimation for Precise Tracking of Similar Objects

Authors:

Niklas Gard, Anna Hilsmann and Peter Eisert

Abstract: In this paper, we present a multi-object 6D detection and tracking pipeline for potentially similar and non-textured objects. The combination of a convolutional neural network for object classification and rough pose estimation with a local pose refinement and an automatic mismatch detection enables direct application in real-time AR scenarios. A new network architecture, trained solely with synthetic images, allows simultaneous pose estimation of multiple objects with reduced GPU memory consumption and enhanced performance. In addition, the pose estimates are further improved by a local edge-based refinement step that explicitly exploits known object geometry information. For continuous movements, the sole use of local refinement reduces pose mismatches due to geometric ambiguities or occlusions. We showcase the entire tracking pipeline and demonstrate the benefits of the combined approach. Experiments on a challenging set of non-textured similar objects demonstrate the enhanced quality compared to the baseline method. Finally, we illustrate how the system can be used in a real AR assistance application within the field of construction.
Download

Paper Nr: 264
Title:

Pushing the Efficiency of StereoNet: Exploiting Spatial Sparsity

Authors:

Georgios Zampokas, Christos-Savvas Bouganis and Dimitrios Tzovaras

Abstract: Current CNN-based stereo matching methods have demonstrated superior performance compared to traditional stereo matching methods. However, mapping these algorithms into embedded devices, which exhibit limited compute resources, and achieving high performance is a challenging task due to the high computational complexity of the CNN-based methods. The recently proposed StereoNet network, achieves disparity estimation with reduced complexity, whereas performance does not greatly deteriorate. Towards pushing this performance to complexity trade-off further, we propose an optimization applied to StereoNet that adapts the computations to the input data, steering the computations to the regions of the input that would benefit from the application of the CNN-based stereo matching algorithm, where the rest of the input is processed by a traditional, less computationally demanding method. Key to the proposed methodology is the introduction of a lightweight CNN that predicts the importance of refining a region of the input to the quality of the final disparity map, allowing the system to trade-off computational complexity for disparity error on-demand, enabling the application of these methods to embedded systems with real-time requirements.
Download

Short Papers
Paper Nr: 12
Title:

3-D Tennis Ball Trajectory Estimation from Multi-view Videos and Its Parameterization by Parabola Fitting

Authors:

Kenta Sohara and Yasuyuki Sugaya

Abstract: We propose a new method for estimating 3-D trajectories of a tennis ball from multiple video sequences and parameterizing the 3-D trajectories. We extract candidate positions of a ball by a frame difference technique and reconstruct 3-D positions of them by two-view reconstruction for every image pairs. By analyzing a distribution of the reconstructed 3-D points, we find a cluster among them and decide its cluster center as the 3-D ball position. Moreover, we fit a plane to the estimated 3-D trajectory and express them as a 2-D point data. We parameterize the 2-D data by fitting two parabolas to them. By simulation and real experiments, we demonstrate the efficiency of our method.
Download

Paper Nr: 27
Title:

Seg2Pose: Pose Estimations from Instance Segmentation Masks in One or Multiple Views for Traffic Applications

Authors:

Martin Ahrnbom, Ivar Persson and Mikael Nilsson

Abstract: A system we denote Seg2Pose is presented which converts pixel coordinate tracks, represented by instance segmentation masks across multiple video frames, into world coordinate pose tracks, for road users seen by static surveillance cameras. The road users are bound to a ground surface represented by a number of 3D points and does not necessarily have to be perfectly flat. The system works with one or more views, by using a late fusion scheme. An approximate position, denoted the normal position, is computed from the camera calibration, per-class default heights and the ground surface model. The position is then refined a novel Convolutional Neural Network we denote Seg2PoseNet, taking instance segmentations and cropping positioning as its input. We evaluate this system quantitatively both on synthetic data from CARLA Simulator and on a real recording from a trinocular camera. The system outperforms the baseline method of only using the normal positions, which is roughly equivalent of a typical 2D to 3D conversion system, in both datasets.
Download

Paper Nr: 47
Title:

Motion-constrained Road User Tracking for Real-time Traffic Analysis

Authors:

Nyan B. Bo, Peter Veelaert and Wilfried Philips

Abstract: Reliability of numerous smart traffic applications are highly dependent on the accuracy of underlying road user tracker. Demand on scalability and privacy preservation pushes vision-based smart traffic applications to sense and process images on edge devices and transmit only concise information to decision/fusion nodes. One of the requirements for deploying a vision algorithm on edge devices is its ability to process captured images in real time. To meet these needs, we propose a real-time road user tracker which outperforms state-of-the-art trackers. Our approach utilizes double thresholding on detector responses to suppress initialization of false positive trajectories while assuring corresponding detector responses required for updating trajectories are not wrongly discarded. Furthermore, our proposed Bayes filter reduces fragmentation and merging of trajectories which highly effect the performance of subsequent smart traffic applications. The performance of our tracker is evaluated on the real life traffic data in turning movement counting (TMC) application and it achieves a high precision of 96% and recall of 95% while state-of-the-art tracker in comparison achieves 92% on precision and 87% on recall.
Download

Paper Nr: 63
Title:

3D Object Reconstruction using Stationary RGB Camera

Authors:

José D. S. Júnior, Gustavo R. Lima, Adam M. Pinto, João M. Lima, Veronica Teichrieb, Jonysberg P. Quintino, Fabio Q. B. da Silva, Andre M. Santos and Helder Pinho

Abstract: 3D objects mapping is an important field of computer vision, being applied in games, tracking, and virtual and augmented reality applications. Several techniques implement 3D reconstruction from images obtained by mobile cameras. However, there are situations where it is not possible or convenient to move the acquisition device around the target object, such as when using laptop cameras. Moreover, some techniques do not achieve a good 3D reconstruction when capturing with a stationary camera due to movement differences between the target object and its background. This work proposes two 3D object mapping pipelines from stationary camera images based on COLMAP to solve this type of problem. For that, we modify two background segmentation techniques and motion recognition algorithms to detect foreground without manual intervention or prior knowledge of the target object. Both proposed pipelines were tested with a dataset obtained by a laptop’s simple low-resolution stationary RGB camera. The results were evaluated concerning background segmentation and 3D reconstruction of the target object. As a result, the proposed techniques achieve 3D reconstruction results superior to COLMAP, especially in environments with cluttered backgrounds.
Download

Paper Nr: 93
Title:

Viewpoint-independent Single-view 3D Object Reconstruction using Reinforcement Learning

Authors:

Seiya Ito, Byeongjun Ju, Naoshi Kaneko and Kazuhiko Sumi

Abstract: This paper addresses the problem of reconstructing 3D object shapes from single-view images using reinforcement learning. Reinforcement learning allows us to interpret the reconstruction process of a 3D object by visualizing sequentially selected actions. However, the conventional method used a single fixed viewpoint and was not validated with an arbitrary viewpoint. To handle images from arbitrary viewpoints, we propose a reinforcement learning framework that introduces an encoder to extract viewpoint-independent image features. We train an encoder-decoder network to disentangle shape and viewpoint features from the image. The parameters of the encoder part of the network are fixed, and the encoder is incorporated into the reinforcement learning framework as an image feature extractor. Since the encoder learns to extract viewpoint-independent features from images of arbitrary viewpoints, only images of a single viewpoint are needed for reinforcement learning. The experimental results show that the proposed method can learn faster and achieves better accuracy than the conventional method.
Download

Paper Nr: 168
Title:

LiMoSeg: Real-time Bird’s Eye View based LiDAR Motion Segmentation

Authors:

Sambit Mohapatra, Mona Hodaei, Senthil Yogamani, Stefan Milz, Heinrich Gotzig, Martin Simon, Hazem Rashed and Patrick Maeder

Abstract: Moving object detection and segmentation is an essential task in the Autonomous Driving pipeline. Detecting and isolating static and moving components of a vehicle’s surroundings are particularly crucial in path planning and localization tasks. This paper proposes a novel real-time architecture for motion segmentation of Light Detection and Ranging (LiDAR) data. We use two successive scans of LiDAR data in 2D Bird’s Eye View (BEV) representation to perform pixel-wise classification as static or moving. Furthermore, we propose a novel data augmentation technique to reduce the significant class imbalance between static and moving objects. We achieve this by artificially synthesizing moving objects by cutting and pasting static vehicles. We demonstrate a low latency of 8 ms on a commonly used automotive embedded platform, namely Nvidia Jetson Xavier. To the best of our knowledge, this is the first work directly performing motion segmentation in LiDAR BEV space. We provide quantitative results on the challenging SemanticKITTI dataset, and qualitative results are provided in https://youtu.be/2aJ-cL8b0LI.
Download

Paper Nr: 196
Title:

Evaluating the Impact of Head Motion on Monocular Visual Odometry with Synthetic Data

Authors:

Charles Hamesse, Hiep Luong and Rob Haelterman

Abstract: Monocular visual odometry is a core component of visual Simultaneous Localization and Mapping (SLAM). Nowadays, headsets with a forward-pointing camera abound for a wide range of use cases such as extreme sports, firefighting or military interventions. Many of these headsets do not feature additional sensors such as a stereo camera or an IMU, thus evaluating the accuracy and robustness of monocular odometry remains critical. In this paper, we develop a novel framework for procedural synthetic dataset generation and a dedicated motion model for headset-mounted cameras. With our method, we study the performance of the leading classes of monocular visual odometry algorithms, namely feature-based, direct and deep learning-based methods. Our experiments lead to the following conclusions: i) the performance deterioration on headset-mounted camera images is mostly caused by head rotations and not by translations caused by human walking style, ii) feature-based methods are more robust to fast head rotations compared to direct and deep learning-based methods, and iii) it is crucial to develop uncertainty metrics for deep learning-based odometry algorithms.
Download

Paper Nr: 205
Title:

An Occlusion Aware Five-view Stereo System and Its Application in Video Post-production

Authors:

Changan Zhu, Taryn Laurendeau and Chris Joslin

Abstract: Extracting the live-action elements from videos has been a time-consuming process in the post-production pipeline. The disparity map, however, shows the order of elements in a scene by indicating the distance between the elements and the camera, which could potentially become an effective tool for separating videos into ordered layers and preserving the 3D structure of the elements. In this research, we explored the possibility of simplifying the live-action video element extraction technique with disparity sequences. We developed a five-view disparity estimation and enhancement system with a two-axis setup that helps reduce the occlusions in stereo vision. The system is independent from temporal reconstruction hence is compatible with both dynamic and stationary camera paths. Our results show that the disparities from our system have visually and quantitatively better performance than the traditional binocular stereo method, and its element extraction result is comparable with the existing mature matting techniques in most cases. Ideally, the system design could be applied in cinematography by replacing the center camera with a cinematographic camera, and the output can be used for video object extraction, visual effects composition, video’s 2D to 3D conversion, or producing the training data for neural-network-based depth estimation research.
Download

Paper Nr: 221
Title:

Human Pose Estimation through a Novel Multi-view Scheme

Authors:

Jorge L. Charco, Angel D. Sappa and Boris X. Vintimilla

Abstract: This paper presents a multi-view scheme to tackle the challenging problem of the self-occlusion in human pose estimation problem. The proposed approach first obtains the human body joints of a set of images, which are captured from different views at the same time. Then, it enhances the obtained joints by using a multi-view scheme. Basically, the joints from a given view are used to enhance poorly estimated joints from another view, especially intended to tackle the self occlusions cases. A network architecture initially proposed for the monocular case is adapted to be used in the proposed multi-view scheme. Experimental results and comparisons with the state-of-the-art approaches on Human3.6m dataset are presented showing improvements in the accuracy of body joints estimations.
Download

Paper Nr: 231
Title:

Visual Tilt Correction for Vehicle-mounted Cameras

Authors:

Firas Kastantin and Levente Hajder

Abstract: Assuming planar motion for vehicle-mounted cameras is very beneficial as the visual estimation problems are simplified. Planar algorithms assume that the image plane is parallel to the gravity direction. This paper proposes two methods to correct camera images if this parallelity constraint does not hold. The first one utilizes that the ground is visible in the camera images, therefore its normal gives the gravity vector; the second method uses vanishing points to estimate the relative correction homography. The accuracy of the novel methods is tested on both synthetic and real data. It is demonstrated in real images of a vehicle-mounted camera that the methods work well in practice.
Download

Paper Nr: 244
Title:

Semi-supervised Anomaly Detection for Weakly-annotated Videos

Authors:

Khaled El-Tahan and Marwan Torki

Abstract: One of the significant challenges in surveillance anomaly detection research is the scarcity of surveillance datasets satisfying specific ethical and logistical requirements during the collection process. Weakly supervised models aim to solve those challenges by only weakly annotating surveillance videos and creating sophisticated learning techniques to optimize these models, such as Multiple Instance Learning (MIL), which maximizes the boundary between the most anomalous video clip and the least normal (false alarm) video clip using ranking loss. However, maximizing the boundary does not necessarily assign each clip its correct class. We propose a semi-supervision technique that creates pseudo labels for each correct class. Also, we investigate different video recognition models for better features representation. We evaluate our work on the UCF-Crime (Weakly Supervised) dataset and show that it almost outperforms all other approaches by only using the same simple baseline (multilayer perceptron neural network). Moreover, we incorporate different evaluation metrics to show that not only did our solution increase the AUC, but it also increased the top-1 accuracy drastically.
Download

Paper Nr: 88
Title:

Error Evaluation of Semantic VSLAM Algorithms for Smart Farming

Authors:

Adam Kalisz, Mingjun Sun, Jonas Gedschold, Tim E. Wegner, Giovanni D. Galdo and Jörn Thielecke

Abstract: In recent years, crop monitoring and plant phenotyping are becoming increasingly important tools to improve farming efficiency and crop quality. In the field of smart farming, the combination of high-precision cameras and Visual Simultaneous Localization And Mapping (SLAM) algorithms can automate the entire process from planting to picking. In this work, we systematically analyze errors on trajectory accuracy of a watermelon field created in a virtual environment for the application of smart farming, and discuss the quality of the 3D mapping effects from an optical point of view. By using an ad-hoc synthetic data set we discuss and compare the influencing factors with respect to performance and drawbacks of current state-of-the-art system architectures. We summarize the contributions of our work as follows: (1) We extend ORB-SLAM2 with Semantic Input which we name SI-VSLAM in the following. (2) We evaluate the proposed system using real and synthetic data sets with modelled sensor non-idealities. (3) We provide an extensive analysis of the error behaviours on a virtual watermelon field which can be both static and dynamic as an example for a real use case of the system.
Download

Paper Nr: 139
Title:

Generalizable Online 3D Pedestrian Tracking with Multiple Cameras

Authors:

Victor Lyra, Isabella de Andrade, João P. Lima, Rafael Roberto, Lucas Figueiredo, João M. Teixeira, Diego Thomas, Hideaki Uchiyama and Veronica Teichrieb

Abstract: 3D pedestrian tracking using multiple cameras is still a challenging task with many applications such as surveillance, behavioral analysis, statistical analysis, and more. Many of the existing tracking solutions involve training the algorithms on the target environment, which requires extensive time and effort. We propose an online 3D pedestrian tracking method for multi-camera environments based on a generalizable detection solution that does not require training with data of the target scene. We establish temporal relationships between people detected in different frames by using a combination of graph matching algorithm and Kalman filter. Our proposed method obtained a MOTA and MOTP of 77.1% and 96.4%, respectively on the test split of the public WILDTRACK dataset. Such results correspond to an improvement of approximately 3.4% and 22.2%, respectively, compared to the best existing online technique. Our experiments also demonstrate the advantages of using appearance information to improve the tracking performance.
Download