VISAPP 2019 Abstracts


Area 1 - Image Formation and Preprocessing

Full Papers
Paper Nr: 20
Title:

Avoiding Glare by Controlling the Transmittance of Incident Light

Authors:

Takeharu Kondo, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we introduce a new method for enhancing the visibility of human vision. In particular we propose a method for avoiding glare caused by strong incident light, such as sunlight and headlight of oncoming vehicles, in driving situations. Our method controls the transmittance of incident light pixel by pixel according to the power of the incident light. For computing the transmittance of light efficiently from camera images, we propose a leaning based method utilizing a generative adversarial network (GAN). By using our method, the visibility of drivers can be improved drastically, and objects in dark place become visible even under strong backlight, such as sunlight and headlight of oncoming vehicles.
Download

Paper Nr: 43
Title:

Visibility Estimation in Point Clouds with Variable Density

Authors:

P. Biasutti, A. Bugeau, J-F. Aujol and M. Brédif

Abstract: Estimating visibility in point clouds has many applications such as visualization, surface reconstruction and scene analysis through fusion of LiDAR point clouds and images. However, most current works rely on methods that require strong assumptions on the point cloud density, which are not valid for LiDAR point clouds acquired from mobile mapping systems, leading to low quality of point visibility estimations. This work presents a novel approach for the estimation of the visibility of a point cloud from a viewpoint. The method is designed to be fully automatic and it makes no assumption on the point cloud density. The visibility of each point is estimated by considering its screen-space neighborhood from the given viewpoint. Our results show that our approach succeeds better in estimating the visibility on real-world data acquired using LiDAR scanners. We evaluate our approach by comparing its results to a new manually annotated dataset, which we make available online.
Download

Paper Nr: 201
Title:

Revisiting Gray Pixel for Statistical Illumination Estimation

Authors:

Yanlin Qian, Said Pertuz, Jarno Nikkanen, Joni-Kristian Kämäräinen and Jiri Matas

Abstract: We present a statistical color constancy method that relies on novel gray pixel detection and mean shift clustering. The method, called Mean Shifted Grey Pixel – MSGP, is based on the observation: true-gray pixels are aligned towards one single direction. Our solution is compact, easy to compute and requires no training. Experiments on two real-world benchmarks show that the proposed approach outperforms state-of-the-art methods in the camera-agnostic scenario. In the setting where the camera is known, MSGP outperforms all statistical methods.
Download

Paper Nr: 264
Title:

Deep Neural Network for Fuzzy Automatic Melanoma Diagnosis

Authors:

Wiem Abbes and Dorra Sellami

Abstract: Melanoma is the most serious type of skin cancer. We consider in this paper diagnosing melanoma based on skin lesion images obtained by common optical cameras. Given the lower quality of such images, we should cope with the imprecision of image data. This paper proposes a CAD system for decision making about the skin lesion severity. We first define the fuzzy modeling of the Bag-of-Words (BoW) of the lesion. Indeed, features are extracted from the skin lesion image related to four criteria inspired by the ABCD rule (Asymmetry, Border, Color, and Differential structures). Based on Fuzzy C-Means (FCM), membership degrees are determined for each BoW. Then, a deep neural network classifier is used for decision making. Based on a public database of 206 lesion images, experimental results demonstrate that the fuzzification of feature modeling presents good results in term of sensitivity (90.1%) and of accuracy (87.5%). A comparative study illustrates that our approach offers the best accuracy and sensitivity.
Download

Short Papers
Paper Nr: 19
Title:

Showing Different Images Simultaneously by using Chromatic Temporal Response in Human Vision

Authors:

Hiroki Yamada, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a novel method for showing different images to multiple observers simultaneously by using the difference in their chromatic and temporal retinal response. The chromatic and temporal response characteristics of human retina have individuality, and thus each observer observes slightly different image, even if the same image is presented to these observers. In this paper, we formalize the chromatic and temporal relationship between the input stimulus and the response in human vision, and propose a method for showing arbitrary different images to individual observers simultaneously by using the difference in chromatic temporal response characteristics. We also show a method for obtaining chromatic and temporal response of human vision. Experimental results from a special camera which reproduces the impulse response of human retina show that our proposed method can represent different and arbitrary images to multiple observers.
Download

Paper Nr: 40
Title:

Coarse-to-Fine Clothing Image Generation with Progressively Constructed Conditional GAN

Authors:

Youngki Kwon, Soomin Kim, Donggeun Yoo and Sung-Eui Yoon

Abstract: Clothing image generation is a task of generating clothing product images from input fashion images of people dressed. Results of existing GAN based methods often contain visual artifact with the global consistency issue. To solve this issue, we split the difficult single image generation process into relatively easy multiple stages for image generation process. We thus propose a coarse-to-fine strategy for the image-conditional image generation model, with a multi-stage network training method, called rough-to-detail training. We incrementally add a decoder block for each stage that progressively configures an intermediate target image, to make the generator network appropriate for rough-to-detail training. With this coarse-to-fine process, our model can generate from small size images with rough structures to large size images with details. To validate our model, we perform various quantitative comparisons and human perception study on the LookBook dataset. Compared to other conditional GAN methods, our model can create visually pleasing 256 × 256 clothing images, while keeping the global structure and containing details of target images.
Download

Paper Nr: 45
Title:

Relative Pose Improvement of Sphere based RGB-D Calibration

Authors:

David T. Boas, Sergii Poltaretskyi, Jean-Yves Ramel, Jean Chaoui, Julien Berhouet and Mohamed Slimane

Abstract: RGB-Depth calibration refers to the estimation of both RGB and Depth camera parameters, as well as their relative pose. This step is critical to align streams correctly. However, in the literature there is still no general method for accurate RGB-D calibration. Recently, promising methods proposed to use spheres to perform the calibration, the centers of these objects being well distinguishable by both cameras. This paper proposes a new minimization function which constrains spheres centers positions by requiring the knowledge of sphere radius, and a previously calibrated RGB camera. We show the limits of previous approaches and their correction with the proposed method. Results demonstrate an improvement in relative pose estimation compared to the original method on the selected datasets.
Download

Paper Nr: 66
Title:

STaDA: Style Transfer as Data Augmentation

Authors:

Xu Zheng, Tejo Chalasani, Koustav Ghosal, Sebastian Lutz and Aljosa Smolic

Abstract: The success of training deep Convolutional Neural Networks (CNNs) heavily depends on a significant amount of labelled data. Recent research has found that neural style transfer algorithms can apply the artistic style of one image to another image without changing the latter’s high-level semantic content, which makes it feasible to employ neural style transfer as a data augmentation method to add more variation to the training dataset. The contribution of this paper is a thorough evaluation of the effectiveness of the neural style transfer as a data augmentation method for image classification tasks. We explore the state-of-the-art neural style transfer algorithms and apply them as a data augmentation method on Caltech 101 and Caltech 256 dataset, where we found around 2% improvement from 83% to 85% of the image classification accuracy with VGG16, compared with traditional data augmentation strategies. We also combine this new method with conventional data augmentation approaches to further improve the performance of image classification. This work shows the potential of neural style transfer in computer vision field, such as helping us to reduce the difficulty of collecting sufficient labelled data and improve the performance of generic image-based deep learning algorithms.
Download

Paper Nr: 73
Title:

DCT based Multi Exposure Image Fusion

Authors:

O. Martorell, C. Sbert and A. Buades

Abstract: We propose a novel algorithm for multi-exposure fusion (MEF). This algorithm decomposes image patches with the DCT transform. Coefficients from patches with different exposure are combined. The luminance and chrominance of the different images are fused separately. Details of the fused image are finally enhanced as a post-processing. Experiments with several data sets show that the proposed algorithm performs better than state-of-the-art.
Download

Paper Nr: 84
Title:

Learning to Remove Rain in Traffic Surveillance by using Synthetic Data

Authors:

Chris H. Bahnsen, David Vázquez, Antonio M. López and Thomas B. Moeslund

Abstract: Rainfall is a problem in automated traffic surveillance. Rain streaks occlude the road users and degrade the overall visibility which in turn decrease object detection performance. One way of alleviating this is by artificially removing the rain from the images. This requires knowledge of corresponding rainy and rain-free images. Such images are often produced by overlaying synthetic rain on top of rain-free images. However, this method fails to incorporate the fact that rain fall in the entire three-dimensional volume of the scene. To overcome this, we introduce training data from the SYNTHIA virtual world that models rain streaks in the entirety of a scene. We train a conditional Generative Adversarial Network for rain removal and apply it on traffic surveillance images from SYNTHIA and the AAU RainSnow datasets. To measure the applicability of the rain-removed images in a traffic surveillance context, we run the YOLOv2 object detection algorithm on the original and rain-removed frames. The results on SYNTHIA show an 8% increase in detection accuracy compared to the original rain image. Interestingly, we find that high PSNR or SSIM scores do not imply good object detection performance.
Download

Paper Nr: 252
Title:

Improvement of Range Resolution of FDMAS Beamforming in Ultrasound Imaging

Authors:

Ryoya Kozai, Jing Zhu, Kan Okubo and Norio Tagawa

Abstract: Ultrasound imaging is applied to various fields because it is noninvasive and real-time imaging is possible. However, in diagnosis applications, ultrasound imaging is inferior in resolution to other modalities, so researches for improving resolution have been actively conducted. Recently, researches on beamforming methods have been advanced for the purpose of improving lateral resolution. In order to form a narrower beam, adaptive beamformers such as the MV (minimum variance) beamformer that adaptively changes the beamforming weights have been proposed, but these methods increase the computational complexity. Therefore, in recent years, the FDMAS (Filtered-Delay Multiply And Sum) beamformer which can realize high resolution and high contrast without using complicated calculation attracts attention. On the other hand, we proposed a method called the SCM (Super resolution FM-Chirp correlation Method) that improves range resolution based on frequency sweep. In addition, we proposed a new DAS (Delay And Sum) beam former which improves range resolution by multiplying the echo signal by the result of the SCM before DAS processing. This method is constructed in the usual RF (Radio Frequency) band. In this study, we first reconstruct the FDMAS as a baseband processing in order to improve the SNR, and apply the SCM result to the FDMAS in order to improve both range and lateral resolution.
Download

Paper Nr: 9
Title:

Fast In-the-Wild Hair Segmentation and Color Classification

Authors:

Tudor A. Ileni, Diana L. Borza and Adrian S. Darabant

Abstract: In this paper we address the problem of hair segmentation and hair color classification in facial images using a machine learning approach based on both convolutional neural networks and classical neural networks. Hair with its color shades, shape and length represents an important feature of the human face and is used in domains like biometrics, visagisme (the art of aesthetically matching fashion and medical accessories to the face region) , hair styling, fashion, etc. We propose a deep learning method for accurate and fast hair segmentation followed by a histogram feature based classification of the obtained hair region on five color classes. We developed a hair and face annotation tool to enrich the training data. The proposed solutions are trained on publicly available and own annotated databases. The proposed method attained a hair segmentation accuracy of 91.61% and a hair color classification accuracy of 89.6%.
Download

Paper Nr: 26
Title:

Multi-layer Extreme Learning Machine-based Autoencoder for Hyperspectral Image Classification

Authors:

Muhammad Ahmad, Adil M. Khan, Manuel Mazzara and Salvatore Distefano

Abstract: Hyperspectral imaging (HSI) has attracted the formidable interest of the scientific community and has been applied to an increasing number of real-life applications to automatically extract the meaningful information from the corresponding high dimensional datasets. However, traditional autoencoders (AE) and restricted Boltzmann machines are computationally expensive and do not perform well due to the Hughes phenomenon which is observed in HSI since the ratio of the labeled training pixels on the number of bands is usually quite small. To overcome such problems, this paper exploits a multi-layer extreme learning machine-based autoencoder (MLELM-AE) for HSI classification. MLELM-AE learns feature representations by adopting a singular value decomposition and is used as basic building block for learning machine-based autoencoder (MLELM-AE). MLELM-AE method not only maintains the fast speed of traditional ELM but also greatly improves the performance of HSI classification. The experimental results demonstrate the effectiveness of MLELM-AE on several well-known HSI dataset.
Download

Paper Nr: 60
Title:

Bilateral Random Projection based High-speed Face and Expression Recognition Method

Authors:

Jieun Lee, Miran Heo and Yoonsik Choe

Abstract: Face and expression recognition problem can be converted into superposition of low-rank matrix and sparse error matrix, which have the merits of robustness to occlusion and disguise. Low-rank matrix manifests neutral facial image and sparse matrix captures emotional expression with respect to whole image. To separate these matrices, the problem is formulated to minimize the nuclear norm and L1 norm, then can be solved by using a closed-form proximal operator which is called Singular Value Thresholding (SVD). However, this conventional approach has high computational complexity since it requires computation of singular value decomposition of large sized matrix at each iteration. In this paper, to reduce this computational burden, a fast approximation method for SVT is proposed, utilizing a suitable low-rank matrix approximation involving random projection. Basically, being associated with sampling, a low-rank matrix is modeled as bilateral factorized matrices, then update these matrices with greedy manner. Experiments are conducted on publicly available different dataset for face and expression recognition. Consequently, proposed algorithm results in the improved recognition accuracy and also further speeding up the process of approximating low-rank matrix, compared to the conventional SVT based approximation methods. The best recognition accuracy score of 98.1% in the JAFFE database is acquired with our method about 55 times faster than SVD based method.
Download

Paper Nr: 95
Title:

Local LUT Upsampling for Acceleration of Edge-preserving Filtering

Authors:

Hiroshi Tajima, Teppei Tsubokawa, Yoshihiro Maeda and Norishige Fukushima

Abstract: Edge-preserving filters have been used in various applications in image processing. As the number of pixels of digital cameras has been increasing, the computational cost becomes higher, since the order of the filters depends on the image size. There are several acceleration approaches for the edge-preserving filtering; however, most approaches reduce the dependency of filtering kernel size to the processing time. In this paper, we propose a method to accelerate the edge-preserving filters for high-resolution images. The method subsamples an input image and then performs the edge-preserving filtering on the subsampled image. Our method then upsamples the subsampled image with the guidance, which is the high-resolution input images. For this upsampling, we generate per-pixel LUTs for high-precision upsampling. Experimental results show that the proposed method has higher performance than the conventional approaches.
Download

Paper Nr: 128
Title:

Facial Image Generation by Generative Adversarial Networks using Weighted Conditions

Authors:

Hiroki Adachi, Hiroshi Fukui, Takayoshi Yamashita and Hironobu Fujiyoshi

Abstract: CGANs are generative models that depend on Deep Learning and can generate images that meet given conditions. However, if a network has a deep architecture, conditions do not provide enough information, so unnatural images are generated. In this paper, we propose a facial image generation method by introducing weighted conditions to CGANs. Weighted condition vectors are input in each layer of a generator, and then a discriminator is extend to multi-tasks so as to recognize input conditions. This approach can step-by-step reflect conditions inputted to the generator at every layer, fulfill the input conditions, and generate high quality images. We demonstrate the effectiveness of our method in both subjective and objective evaluation experiments.
Download

Paper Nr: 130
Title:

More Accurate Pose Initialization with Redundant Measurements

Authors:

Ksenia Klionovska, Heike Benninghoff and Felix Huber

Abstract: The problem described in this paper concerns the problem of initial pose estimation of a non-cooperative target for space applications. We propose to use a Photonic Mixer Device (PMD) sensor in a close range for the visual navigation in order to estimate position and attitude of the space object. The advantage of the ranging PMD sensor is that it provides two different sources of data: depth and amplitude information of the imaging scene. In this work we make use of it and propose a follow-up initial pose improvement technique with the amplitude images from PMD sensor. It means that we primary calculate the pose of the target with the depth image and then correct the pose to get more accurate result. The algorithm is tested for the set of images in the range 8 to 4.9 meters. The obtained results have shown the evident improvement of the initial pose after correction with the proposed technique.
Download

Paper Nr: 134
Title:

Action Anticipation from Multimodal Data

Authors:

Tiziana Rotondo, Giovanni M. Farinella, Valeria Tomaselli and Sebastiano Battiato

Abstract: The idea of multi-sensor data fusion is to combine the data coming from different sensors to provide more accurate and complementary information to solve a specific task. Our goal is to build a shared representation related to data coming from different domains, such as images, audio signal, heart rate, acceleration, etc., in order to anticipate daily activities of a user wearing multimodal sensors. To this aim, we consider the Stanford-ECM Dataset which contains syncronized data acquired with different sensors: video, acceleration and heart rate signals. The dataset is adapted to our action prediction task by identifying the transitions from the generic “Unknown” class to a specific “Activity”. We discuss and compare a Siamese Network with the Multi Layer Perceptron and the 1D CNN where the input is an unknown observation and the output is the next activity to be observed. The feature representations obtained with the considered deep architecture are classified with SVM or KNN classifiers. Experimental results pointed out that prediction from multimodal data seems a feasible task, suggesting that multimodality improves both classification and prediction. Nevertheless, the task of reliably predicting next actions is still open and requires more investigations as well as the availability of multimodal dataset, specifically built for prediction purposes.
Download

Paper Nr: 143
Title:

Calibration of Two 3D Sensors with Perpendicular Scanning Directions by using a Piece of Paper

Authors:

Ju-Hwan Lee and Soon-Yong Park

Abstract: It is difficult to find the 3D transformation relationship between two 3D sensors when the scanning directions of the two sensors has very large angle, for example 90 degrees or more. In this paper, we propose a very simple and efficient calibration method to get 3D transformation between two 3D sensors using a piece of white-colored paper. A piece of white-colored paper is folded in a quadrangular pyramid shape. The paper calibration object is placed on the transparent acryl board to get the shape of the object from two 3D sensors whose scanning direction is about 90 degree. The performance of the proposed calibration method is verified through 3D model reconstruction experiments. The calibration error between two sensors is less than 0.5 mm.
Download

Paper Nr: 159
Title:

Evaluation of Transfer Learning Techniques for Classification and Localization of Marine Animals

Authors:

Parmeet Singh and Mae Seto

Abstract: The objective is to evaluate methods for simultaneous classification and localization towards a better size estimate of marine animals in still images. Marine animals in such images vary in orientations and size. It is challenging to create a bounding box that predicts the shape of the object. We compare axis-aligned and rotatable bounding box techniques for size estimation.
Download

Paper Nr: 162
Title:

GPU Accelerated Sparse Representation of Light Fields

Authors:

Gabriel Baravdish, Ehsan Miandji and Jonas Unger

Abstract: We present a method for GPU accelerated compression of light fields. The approach is by using a dictionary learning framework for compression of light field images. The large amount of data storage by capturing light fields is a challenge to compress and we seek to accelerate the encoding routine by GPGPU computations. We compress the data by projecting each data point onto a set of trained multi-dimensional dictionaries and seek the most sparse representation with the least error. This is done by a parallelization of the tensor-matrix product computed on the GPU. An optimized greedy algorithm to suit computations on the GPU is also presented. The encoding of the data is done segmentally in parallel for a faster computation speed while maintaining the quality. The results shows an order of magnitude faster encoding time compared to the results in the same research field. We conclude that there are further improvements to increase the speed, and thus it is not too far from an interactive compression speed.
Download

Paper Nr: 164
Title:

Color Beaver: Bounding Illumination Estimations for Higher Accuracy

Authors:

Karlo Koščević, Nikola Banić and Sven Lončarić

Abstract: The image processing pipeline of most contemporary digital cameras performs illumination estimation in order to remove the influence of illumination on image scene colors. In this paper an experiment is described that examines some of the basic properties of illumination estimation methods for several Canon’s camera models. Based on the obtained observations, an extension to any illumination estimation method is proposed that under certain conditions alters the results of the underlying method. It is shown that with statistics-based methods as underlying methods the proposed extension can outperform camera’s illumination estimation in terms of accuracy. This effectively demonstrates that statistics-based methods can still be successfully used for illumination estimation in digital cameras. The experimental results are presented and discussed. The source code is available at https://ipg.fer.hr/ipg/resources/color_constancy.
Download

Paper Nr: 166
Title:

Blue Shift Assumption: Improving Illumination Estimation Accuracy for Single Image from Unknown Source

Authors:

Nikola Banić and Sven Lončarić

Abstract: Color constancy methods for removing the influence of illumination on object colors are divided into statistics-based and learning-based ones. The latter have low illumination estimation error, but only on images taken with the same sensor and in similar conditions as the ones used during training. For an image taken with an unknown sensor, a statistics-based method will often give higher accuracy than an untrained or specifically trained learning-based method because of its simpler assumptions not bounded to any specific sensor. The accuracy of a statistics-based method also depends on its parameter values, but for an image from an unknown source these values can be tuned only blindly. In this paper the blue shift assumption is proposed, which acts as a heuristic for choosing the optimal parameter values in such cases. It is based on real-world illumination statistics coupled with the results of a subjective user study and its application outperforms blind tuning in terms of accuracy. The source code is available at http://www.fer.unizg.hr/ipg/resources/color_constancy/.
Download

Area 2 - Image and Video Analysis

Full Papers
Paper Nr: 13
Title:

Superpixel-wise Assessment of Building Damage from Aerial Images

Authors:

Lukas Lucks, Dimitri Bulatov, Ulrich Thönnessen and Melanie Böge

Abstract: Surveying buildings that are damaged by natural disasters, in particular, assessment of roof damage, is challenging, and it is costly to hire loss adjusters to complete the task. Thus, to make this process more feasible, we developed an automated approach for assessing roof damage from post-loss close-range aerial images and roof outlines. The original roof area is first delineated by aligning freely available building outlines. In the next step, each roof area is decomposed into superpixels that meet conditional segmentation criteria. Then, 52 spectral and textural features are extracted to classify each superpixel as damaged or undamaged using a Random Forest algorithm. In this way, the degree of roof damage can be evaluated and the damage grade can be computed automatically. The proposed approach was evaluated in trials with two datasets that differed significantly in terms of the architecture and degree of damage. With both datasets, an assessment accuracy of about 90% was attained on the superpixel level for roughly 800 buildings.
Download

Paper Nr: 14
Title:

TempSeg-GAN: Segmenting Objects in Videos Adversarially using Temporal Information

Authors:

Saptakatha Adak and Sukhendu Das

Abstract: This paper studies the problem of Video Object Segmentation which aims at segmenting objects of interest throughout entire videos, when provided with initial ground truth annotation. Although, variety of works in this field have been done utilizing Convolutional Neural Networks (CNNs), adversarial training techniques have not been used in spite of their effectiveness as a holistic approach. Our proposed architecture consists of a Generative Adversarial framework for the purpose of foreground object segmentation in videos coupled with Intersection-over-union and temporal information based loss functions for training the network. The main contribution of the paper lies in formulation of the two novel loss functions: (i) Inter-frame Temporal Symmetric Difference Loss (ITSDL) and (ii) Intra-frame Temporal Loss (IFTL), which not only enhance the segmentation quality of the predicted mask but also maintain the temporal consistency between the subsequent generated frames. Our end-to-end trainable network exhibits impressive performance gain compared to the state-of-the-art model when evaluated on three popular real-world Video Object Segmentation datasets viz. DAVIS 2016, SegTrack-v2 and YouTube-Objects dataset.
Download

Paper Nr: 31
Title:

One Shot Learning for Generic Instance Segmentation in RGBD Videos

Authors:

Xiao Lin, Josep R. Casas and Montse Pardàs

Abstract: Hand-crafted features employed in classical generic instance segmentation methods have limited discriminative power to distinguish different objects in the scene, while Convolutional Neural Networks (CNNs) based semantic segmentation is restricted to predefined semantics and not aware of object instances. In this paper, we combine the advantages of the two methodologies and apply the combined approach to solve a generic instance segmentation problem in RGBD video sequences. In practice, a classical generic instance segmentation method is employed to initially detect object instances and build temporal correspondences, whereas instance models are trained based on the few detected instance samples via CNNs to generate robust features for instance segmentation. We exploit the idea of one shot learning to deal with the small training sample size problem when training CNNs. Experiment results illustrate the promising performance of the proposed approach.
Download

Paper Nr: 87
Title:

Efficient Bark Recognition in the Wild

Authors:

Rémi Ratajczak, Sarah Bertrand, Carlos Crispim-Junior and Laure Tougne

Abstract: In this study, we propose to address the difficult task of bark recognition in the wild using computationally efficient and compact feature vectors. We introduce two novel generic methods to significantly reduce the dimensions of existing texture and color histograms with few losses in accuracy. Specifically, we propose a straightforward yet efficient way to compute Late Statistics from texture histograms and an approach to iteratively quantify the color space based on domain priors. We further combine the reduced histograms in a late fusion manner to benefit from both texture and color cues. Results outperform state-of-the-art methods by a large margin on four public datasets respectively composed of 6 bark classes (BarkTex, NewBarkTex), 11 bark classes (AFF) and 12 bark classes (Trunk12). In addition to these experiments, we propose a baseline study on Bark-101, a new challenging dataset including manually segmented images of 101 bark classes that we release publicly.
Download

Paper Nr: 101
Title:

Semantic Image Inpainting through Improved Wasserstein Generative Adversarial Networks

Authors:

Patricia Vitoria, Joan Sintes and Coloma Ballester

Abstract: Image inpainting is the task of filling-in missing regions of a damaged or incomplete image. In this work we tackle this problem not only by using the available visual data but also by incorporating image semantics through the use of generative models. Our contribution is twofold: First, we learn a data latent space by training an improved version of the Wasserstein generative adversarial network, for which we incorporate a new generator and discriminator architecture. Second, the learned semantic information is combined with a new optimization loss for inpainting whose minimization infers the missing content conditioned by the available data. It takes into account powerful contextual and perceptual content inherent in the image itself. The benefits include the ability to recover large regions by accumulating semantic information even it is not fully present in the damaged image. Experiments show that the presented method obtains qualitative and quantitative top-tier results in different experimental situations and also achieves accurate photo-realism comparable to state-of-the-art works.
Download

Paper Nr: 118
Title:

Exploring the Limitations of the Convolutional Neural Networks on Binary Tests Selection for Local Features

Authors:

Bernardo G. Biesseck, Edson R. Araujo Junior and Erickson R. Nascimento

Abstract: Convolutional Neural Networks (CNN) have been successfully used to recognize and extract visual patterns in different tasks such as object detection, object classification, scene recognition, and image retrieval. The CNNs have also contributed in local features extraction by learning local representations. A representative approach is LIFT that generates keypoint descriptors more discriminative than handcrafted algorithms like SIFT, BRIEF, and SURF. In this paper, we investigate the binary tests selection problem, and we present an in-depth study of the limit of searching solutions with CNNs when the gradient is computed from the local neighborhood of the selected pixels. We performed several experiments with a Siamese Network trained with corresponding and non-corresponding patch pairs. Our results show the presence of Local Minima and also a problem that we called Incorrect Gradient Components. We pursued to understand the binary tests selection problem and even some limitations of Convolutional Neural Networks to avoid searching for solutions in unviable directions.
Download

Paper Nr: 137
Title:

A Novel Multispectral Lab-depth based Edge Detector for Color Images with Occluded Objects

Authors:

Safa Mefteh, Mohamed-Bécha Kaâniche, Riadh Ksantini and Adel Bouhoula

Abstract: This paper presents a new method for edge detection based on both Lab color and depth images. The principal challenge of multispectral edge detection consists of integrating different information into one meaningful result, without requiring empirical parameters. Our method combines the Lab color channels and depth information in a well-posed way using the Jacobian matrix. Unlike classical multi-spectral edge detection methods using depth information, our method does not use empirical parameters. Thus, it is quite straightforward and efficient. Experiments have been carried out on Middlebury stereo dataset (Scharstein and Szeliski, 2003; Scharstein and Pal, 2007; Hirschmuller and Scharstein, 2007) and several selected challenging images (Rosenman, 2016; lightfieldgroup, 2016). Experimental results show that the proposed method outperforms recent relevant state-of-the-art methods.
Download

Paper Nr: 156
Title:

A Combination of Histogram of Oriented Gradients and Color Features to Cooperate with Louvain Method based Image Segmentation

Authors:

Thanh-Khoa Nguyen, Mickael Coustaty and Jean-Loup Guillaume

Abstract: This paper presents an image segmentation strategy using histograms of oriented gradients (HOG), color features and Louvain method, a community detection on graphs algorithm, to tackle the image segmentation problem. This strategy relies on the use of community detection based image segmentation which often leads to over-segmented results. To address this problem, we propose an algorithm that agglomerates homogeneous regions using texture and color features properties. The proposed algorithm is tested on the publicly available Berkeley Segmentation Dataset (BSDS300 and BSDS500), and the Microsoft Research Cambridge Object Recognition Image Database (MSRC) datasets. The experimental results point out that our method produces sizable segmentation and outperforms almost other known methods in terms of accuracy and comparative metrics scores.
Download

Short Papers
Paper Nr: 10
Title:

Colorization of Grayscale Image Sequences using Texture Descriptors

Authors:

Andre P. Ramos and Franklin C. Flores

Abstract: Colorization is the process of adding colors to a monochromatic image or video. Usually, the process involves to segment the image in regions of interest and then apply colors to each one, for videos, this process is repeated for each frame, which makes it a tedious and time-consuming job. We propose a new assisted method for video colorization, the user only has to colorize one frame and then the colors are propagated to following frames. The user can intervene at any time to correct eventual errors in the color assignment. The method consists of extract intensity and texture descriptors from the frames and then perform a feature matching to determine the best color for each segment. To reduce computation time and give a better spatial coherence we narrow the area of search and give weights for each feature to emphasize texture descriptors. To give a more natural result we use an optimization algorithm to make the color propagation. Experimental results in several image sequences, compared to others existing methods, demonstrates that the proposed method perform a better colorization with less time and user interference.
Download

Paper Nr: 25
Title:

Supervised Spatial Transformer Networks for Attention Learning in Fine-grained Action Recognition

Authors:

Dichao Liu, Yu Wang and Jien Kato

Abstract: We aim to propose more effective attentional regions that can help develop better fine-grained action recognition algorithms. On the basis of the spatial transformer networks’ capability that implements spatial manipulation inside the networks, we propose an extension model, the Supervised Spatial Transformer Networks (SSTNs). This network model can supervise the spatial transformers to capture the regions same as hard-coded attentional regions of certain scale levels at first. Then such supervision can be turned off, and the network model will adjust the region learning in terms of location and scale. The adjustment is conditioned to classification loss so that it is actually optimized for better recognition results. With this model, we are able to capture attentional regions of different levels within the networks. To evaluate SSTNs, we construct a six-stream SSTN model that exploits spatial and temporal information corresponding to three levels (general, middle and detail). The results show that the deep-learned attentional regions captured by SSTNs outperform hard-coded attentional regions. Also, the features learned by different streams of SSTNs are complementary to each other and better result is obtained by fusing the features.
Download

Paper Nr: 52
Title:

Gaussian Curvature Criterion based Random Sample Matching for Improved 3D Registration

Authors:

Faisal Azhar, Stephen Pollard and Guy Adams

Abstract: We propose a novel Gaussian Curvature (GC) based criterion to discard false point correspondences within the RANdom SAmple Matching (RANSAM) framework to improve the 3D registration. The RANSAM method is used to find a point pair correspondence match between two surfaces and GC is used to verify whether this point pair is a correct (similar curvatures) or false (dissimilar curvatures) match. The point pairs which pass the curvature test are used to compute a transformation which aligns the two overlapping surfaces. The results on shape alignment benchmarks show improved accuracy of the GRANSAM versus RANSAM and six other registration methods while maintaining efficiency.
Download

Paper Nr: 76
Title:

A Comparative Study on Voxel Classification Methods for Atlas based Segmentation of Brain Structures from 3D MRI Images

Authors:

Gaetan Galisot, Thierry Brouard, Jean-Yves Ramel and Elodie Chaillou

Abstract: Automatic or interactive segmentation tools for 3D medical images have been developed to help the clinicians. Atlas-based methods are one of the most usual techniques to localized anatomical structures. They have shown to be efficient with various types of medical images and various types of organs. However, a registration step is needed to perform an atlas-based segmentation which can be very time consuming. Local atlases coupled with spatial relationships have been proposed to solve this issue. Local atlases are defined on a sub-part of the image enabling a fast registration step. The positioning of these local atlases on the whole image can be done automatically with learned spatial relationships or interactively by a user when the automatic positioning is not well performed. In this article, different classification methods possibly included in local atlases segmentation methods are compared. Human brain and sheep brain MRI images have been used as databases for the experiments. Depending on the choice of the method, segmentation quality and computation time are very different. Graph-cut or CNN segmentation methods have shown to be more suitable for interactive segmentation because of their low computation time. Multi-atlas based methods like local weighted majority voting are more suitable for automatic process.
Download

Paper Nr: 99
Title:

Grapheme Approach to Recognizing Letters based on Medial Representation

Authors:

Anna Lipkina and Leonid Mestetskiy

Abstract: In this paper we propose a new concept of mathematical model of characters’ grapheme which nowadays is not strictly formalized and a method of constructing graphemes based on the continuous medial representation of letters in digital images. We also suggest the recognition method of the printed text image on the basis of mathematical model of the grapheme used at generation of features and for classifier construction. The results of experiments confirming the efficiency of the grapheme approach, high quality of text recognition in different font variants and in different qualities of the text image are presented.
Download

Paper Nr: 150
Title:

Reading Circular Analogue Gauges using Digital Image Processing

Authors:

Jakob S. Lauridsen, Julius G. Grassmé, Malte Pedersen, David G. Jensen, Søren H. Andersen and Thomas B. Moeslund

Abstract: This paper presents an image processing based pipeline for automated recognition and translation of pointer movement in analogue circular gauges. The proposed method processes an input video frame-wise in a module based manner. Noise is minimized in each image using a bilateral filter before a Gaussian mean adaptive threshold is applied to segment objects. Subsequently, the objects are described by a set of proposed features and classified using probability distributions estimated using Expectation Maximization. The pointer is classified by the Mahalanobis distance and the angle of the pointer is determined using PCA. The output is a low pass filtered digital time series based on the temporal estimations of the pointer angle. Seven test videos have been processed by the algorithm showing promising results. Both source code and video data are publicly available.
Download

Paper Nr: 158
Title:

Model-based Region of Interest Segmentation for Remote Photoplethysmography

Authors:

Peixi Li, Yannick Benezeth, Keisuke Nakamura, Randy Gomez and Fan Yang

Abstract: Remote photoplethysmography (rPPG) is a non-contact technique for measuring vital physiological signs, such as heart rate (HR) and respiratory rate (RR). HR is a medical index which is widely used in health monitoring and emotion detection applications. Therefore, HR measurement with rPPG methods offers a convenient and non-invasive method for these applications. The selection of Region Of Interest (ROI) is a critical first step of many rPPG techniques to obtain reliable pulse signals. The ROI should contain as many skin pixels as possible with a minimum of non-skin pixels. Moreover, it has been shown that rPPG signal is not distributed homogeneously on skin. Some skin regions contain more rPPG signal than others, mainly for physiological reasons. In this paper, we propose to explicitly favor areas where the information is more predominant using a spatially weighted average of skin pixels based on a trained model. The proposed method has been compared to several state of the art ROI segmentation methods using a public database, namely the UBFC-RPPG dataset (Bobbia et al., 2017). We have shown that this modification in how the spatial averaging of the ROI pixels is calculated can significantly increase the final performance of heart rate estimate.
Download

Paper Nr: 215
Title:

Semantic Segmentation of Satellite Images using a Modified CNN with Hard-Swish Activation Function

Authors:

R. Avenash and P. Viswanath

Abstract: Remote sensing is a key strategy used to obtain information related to the Earth’s resources and its usage patterns. Semantic segmentation of a remotely sensed image in the spectral, spatial and temporal domain is an important preprocessing step where different classes of objects like crops, water bodies, roads, buildings are localized by a boundary. The paper proposes to use the Convolutional Neural Network (CNN) called U-HardNet with a new and novel activation function called the Hard-Swish for segmenting remotely sensed images. Along with the CNN, for a precise localization, the paper proposes to use IHS transformed images with binary cross entropy loss minimization. Experiments are done with publicly available images provided by DSTL (Defence Science and Technology Laboratory) for object recognition and a comparison is drawn with some recent relevant techniques.
Download

Paper Nr: 246
Title:

Compact Color Texture Representation by Feature Selection in Multiple Color Spaces

Authors:

M. Alimoussa, N. Vandenbroucke, A. Porebski, R. H. Thami, S. El Fkihi and D. Hamad

Abstract: This paper presents a compact color texture representation based on the selection of features extracted from different configurations of descriptors computed in multiple color spaces. The proposed representation aims to take simultaneously into account several spatial and color properties of different textures. For this purpose, texture images are coded in five different color spaces. Then, texture descriptors with different neighborhood and quantization parameter settings, are calculated from this images in order to extract a high dimensionality feature vector describing the textures. Compact representation is finally obtained by means of a feature selection scheme. Our approach is applied with two well-known color texture descriptors for the classification of three benchmark image databases.
Download

Paper Nr: 7
Title:

Features Extraction based on an Origami Representation of 3D Landmarks

Authors:

Juan F. Montenegro, Mahdi D. Oghaz, Athanasios Gkelias, Georgios Tzimiropoulos and Vasileios Argyriou

Abstract: Feature extraction analysis has been widely investigated during the last decades in computer vision community due to the large range of possible applications. Significant work has been done in order to improve the performance of the emotion detection methods. Classification algorithms have been refined, novel preprocessing techniques have been applied and novel representations from images and videos have been introduced. In this paper, we propose a preprocessing method and a novel facial landmarks’ representation aiming to improve the facial emotion detection accuracy. We apply our novel methodology on the extended Cohn-Kanade (CK+) dataset and other datasets for affect classification based on Action Units (AU). The performance evaluation demonstrates an improvement on facial emotion classification (accuracy and F1 score) that indicates the superiority of the proposed methodology.
Download

Paper Nr: 54
Title:

Roof Segmentation based on Deep Neural Networks

Authors:

Regina Pohle-Fröhlich, Aaron Bohm, Peer Ueberholz, Maximilian Korb and Steffen Goebbels

Abstract: The given paper investigates deep neural networks (DNNs) for segmentation of roof regions in the context of 3D building reconstruction. Point clouds as well as derived depth and density images are used as input data. For 3D building model generation we follow a data driven approach, because it allows the reconstruction of roofs with more complex geometries than model driven methods. For this purpose, we need a preprocessing step that segments roof regions of buildings according to the orientation of their slopes. In this paper, we test three different DNNs and compare results with standard methods using thresholds either on gradients of 2D height maps or on point normals. For the application of a U-Net, we transform point clouds to structured 2D height maps, too. PointNet and PointNet++ directly accept unstructured point clouds as input data. Compared to classical gradient and normal based threshold methods, our experiments with U-Net and PointNet++ lead to better segmentation of roof structures.
Download

Paper Nr: 70
Title:

Non-rigid Shape Registration using Curvature Information

Authors:

Albane Borocco and Beatriz Marcotegui

Abstract: This paper addresses a registration problem for an industrial control application: it meets the need to registrate a model on an image of a flexible object. We propose a non-rigid shape registration approach that deals with a great disparity of the number of points in the model and in the manufactured object. We have developed a method based on a classical minimization process combining a distance term and a regularization term. We observed that, even if the control points fall on the object boundary, the registration failed on high curvature points. In this paper we add a curvature-based term in order to improve the registration on object extremities. We validate our approach on a real industrial application. The addition of this curvature term reduces by two the error of the inner boundaries location on the previously problematic cases of our database.
Download

Paper Nr: 106
Title:

Land Use Land Cover Classification from Satellite Imagery using mUnet: A Modified Unet Architecture

Authors:

Lakshya Garg, Parul Shukla, Sandeep K. Singh, Vaishangi Bajpai and Utkarsh Yadav

Abstract: Land-use-land-cover classification(LULC) is used to automate the process of providing labels, describing the physical land type to represent how a land area is being used. Many sectors such as telecom, utility, hydrology etc need land use and land cover information from remote sensing images. This information provides an insight into the type of geographical distribution of a region with providing low level features such as amount of vegetation, building area, and geometry etc as well as higher level concepts such as land use classes. This information is particularly useful for resource-starved rapidly developing cities for urban planning and resource management. LULC also provides historical changes in land-use patterns over a period of time. In this paper, we analyze patterns of land use in urban and rural neighborhoods using high resolution satellite imagery, utilizing a state of the art deep convolutional neural network. The proposed LULC network, termed as mUnet is based on an encoder-decoder convolutional architecture for pixel-level semantic segmentation. We test our approach on 3 band, FCC satellite imagery covering 225 km2 area of Karachi. Experimental results show the superiority of our proposed network architecture vis-à-vis other state of the art networks.
Download

Paper Nr: 123
Title:

Attribute Operators for Color Images: Color Harmonization based on Maximal Harmonic Segmentation

Authors:

Sérgio S. Filho and Franklin C. Flores

Abstract: Attribute openings and thinnings are morphological connected operators that remove structures from images according to a given criterion. These operators were successfully extended from binary to grayscale images, but such extension to color images is not straightforward. Color attribute operators have been proposed by a combination of color gradients and thresholding decomposition. In this approach, not only structural criteria may be applied, but also criteria based on color features and statistics. This work proposes, in a segmentation framework, a criterion based on color histogram divergence from a harmonic model. This criterion permits a segmentation in maximal harmonic regions. An application indicated that the harmonic segmentation permitted a hue correction that would not cause false colors to appear in regions already harmonic.
Download

Paper Nr: 174
Title:

Bio-inspired Event-based Motion Analysis with Spiking Neural Networks

Authors:

Veís Oudjail and Jean Martinet

Abstract: This paper presents an original approach to analyze the motion of a moving pattern with a Spiking Neural Network, using visual data encoded in the Address-Event Representation. Our objective is to identify a minimal network structure able to recognize the motion direction of a simple binary pattern. For this purpose, we generated synthetic data of 3 different patterns moving in 4 directions, and we designed several variants of a one-layer fully-connected feed-forward spiking neural network with varying number of neurons in the output layer. The networks are trained in an unsupervised manner by presenting the synthetic temporal data to the network for a few seconds. The experimental results show that such networks quickly converged to a state where input classes can be successfully distinguished for 2 of the considered patterns, no network configuration did converge for the third pattern. In the convergence cases, the network proved a remarkable stability for several output layer sizes. We also show that the sequential order of presentation of classes impacts the ability of the network to learn the input.
Download

Paper Nr: 187
Title:

cudaIFT: 180x Faster Image Foresting Transform for Waterpixel Estimation using CUDA

Authors:

Henrique M. Gonçalves, Gustavo J. Q. de Vasconcelos, Paola R. Rangel, Murilo Carvalho, Nathaly L. Archilha and Thiago V. Spina

Abstract: We propose a GPU-based version of the Image Foresting Transform by Seed Competition (IFT-SC) operator and instantiate it to produce compact watershed-based superpixels (Waterpixels). Superpixels are usually applied as a pre-processing step to reduce the amount of processed data to perform object segmentation. However, recent advances in image acquisition techniques can easily produce 3D images with billions of voxels in roughly 1 second, making the time necessary to compute Waterpixels using the CPU-version of the IFT-SC quickly escalate. We aim to address this fundamental issue, since the efficiency of the entire object segmentation methodology may be hindered by the initial process of estimating superpixels. We demonstrate that our CUDA-based version of the sequential IFT-SC operator can speed up computation by a factor of up to 180x for 2D images, with consistent optimum-path forests without requiring additional CPU post-processing.
Download

Paper Nr: 206
Title:

Fingerprint Image Segmentation based on Oriented Pattern Analysis

Authors:

Raimundo S. Vasconcelos and Helio Pedrini

Abstract: Segmentation is a crucial task in automatic fingerprint identification systems. This paper describes a novel segmentation approach which takes into account the directional information inherent in fingerprint ridges. The method considers a directional operator to feed a k-means unsupervised clustering algorithm that labels the image in non-overlapping regions. Morphological operations are performed to fill holes and properly separate foreground from background. Experiments conducted on Fingerprint Verification Competition (FVC) datasets demonstrate that the proposed method, denoted as Oriented Pattern-based Segmentation (OPS), achieves competitive results when compared to other well-known available fingerprint segmentation approaches.
Download

Paper Nr: 236
Title:

UAV-based Inspection of Airplane Exterior Screws with Computer Vision

Authors:

Julien Miranda, Stanislas Larnier, Ariane Herbulot and Michel Devy

Abstract: We propose a new approach to detect and inspect aircraft exterior screws. An Unmanned Aerial Vehicle (UAV) locating itself in the aircraft frame thanks to lidar technology is able to acquire precise images coming with useful metadata. We use a method based on a convolutional neural network (CNN) to characterize zones of interest (ZOI) and to extract screws from images; methods are proposed to create prior model for matching. Classic matching approaches are used to match the screws from this model with the detected ones, to increase screw recognition accuracy and detect missing screws, giving the system a new ability. Computer vision algorithms are then applied to evaluate the state of each visible screw, and detect missing and loose ones.
Download

Paper Nr: 245
Title:

Extraction of Musical Motifs from Handwritten Music Score Images

Authors:

Benammar Riyadh, Véronique Eglin and Christine Largeron

Abstract: A musical motif represents a sequence of musical notes that can determine the identity of a composer or a music style. Musical motifs extraction is of great interest to musicologists to make critical studies of music scores. Musical motifs extraction can be solved by using a string mining algorithm when music data is represented as a sequence. When music data is initially produced in XML or MIDI format or can be converted into those standards, it can be automatically represented as a sequence of notes. So, in this work, starting from digitized images of music scores, our objective is twofold: first, we design a system able to generate musical sequences from handwritten music scores. To address this issue, one of the segmentation-free R-CNN models trained on musical data have been used to detect and recognize musical primitives that are next transcribed into XML sequences. Then, the sequences are processed by a computational model of musical motifs extraction algorithm called CSMA (Constrained String Mining Algorithm). The consistency and performances of the framework are then discussed according to the efficiency of the R-CNN ( Region-proposal Convolutional Neural Network) based recognition system through the estimation of misclassified primitives relating to the detailed account of detected motifs. The carried-out experiments of our complete pipeline show that it is consistent to find more than 70% of motifs with less than 20% of average detection/classification R-CNN errors
Download

Paper Nr: 248
Title:

Detection of Control Points for UAV-Multispectral Sensed Data Registration through the Combining of Feature Descriptors

Authors:

Jocival D. Dias Junior, André R. Backes and Maurício C. Escarpinati

Abstract: The popularization of the Unmanned Aerial Vehicle (UAV) and the development of new sensors has enabled the acquisition and use of multispectral and hyperspectral images in precision agriculture. However, performing the image registration process is a complex task due to the lack of image characteristics among the various spectra and the distortions created by the use of the UAV during the acquisition process. Therefore, the objective of this work is to evaluate different techniques for obtaining control points in multispectral images of soybean plantations obtained by UAVs and to investigate if combining features obtained by different techniques generates better results than when used individually. In this work Were evaluated 3 different feature detection algorithms (KAZE, MEF and BRISK) and their combinations. Results shown that the KAZE technique, achieve better results.
Download

Area 3 - Image and Video Understanding

Full Papers
Paper Nr: 38
Title:

Path Predictions using Object Attributes and Semantic Environment

Authors:

Hiroaki Minoura, Tsubasa Hirakawa, Takayoshi Yamashita and Hironobu Fujiyoshi

Abstract: Path prediction methods with deep learning architectures take into account the interaction of pedestrians and the features of the physical environment in the surrounding area. These methods, however, process all prediction targets as a unified category and it becomes difficult to predict a path suitable for each category. In real scenes, it is necessary to consider not only pedestrians but also automobiles and bicycles. It is considered possible to predict the path corresponding to the type of target by considering the types of multiple targets. Therefore, aiming to achieve path prediction in accordance with individual categories, we propose a path prediction method that represents the target type as an attribute and simultaneously considers the physical environment information. The proposed method inputs feature vectors in a long short-term memory that represents i ) past object trajectory, ii) the attribute, and iii) the semantics of the surrounding area. This makes it possible to predict a path that is proper for each target. Experimental results show that our approach can predict a path with higher precision. Also, changes in accuracy were analyzed by introducing the attribute of the prediction target and the physical environment information.
Download

Paper Nr: 41
Title:

Vehicle Activity Recognition using Mapped QTC Trajectories

Authors:

Alaa AlZoubi and David Nam

Abstract: The automated analysis of interacting objects or vehicles has many uses, including autonomous driving and security surveillance. In this paper we present a novel method for vehicle activity recognition using Deep Convolutional Neural Network (DCNN). We use Qualitative Trajectory Calculus (QTC) to represent the relative motion between pair of vehicles, and encode their interactions as a trajectory of QTC states. We then use one-hot vectors to map the trajectory into 2D matrix which conserves the essential position information of each QTC state in the sequence. Specifically, we project QTC sequences into a two dimensional image texture, and subsequently our method adapt layers trained on the ImageNet dataset and transfer this knowledge to the activity recognition task. We have evaluated our method using two different datasets, and shown that it out-performs state-of-the-art methods, achieving an error rate of no more than 1.16%. Our motivation originates from an interest in automated analysis of vehicle movement for the collision avoidance application, and we present a dataset of vehicle-obstacle interaction, collected from simulator-based experiments.
Download

Paper Nr: 67
Title:

Detection of Imaged Objects with Estimated Scales

Authors:

Xuesong Li, Ngaiming Kwok, Jose E. Guivant, Karan Narula, Ruowei Li and Hongkun Wu

Abstract: Dealing with multiple sizes of the object in the image has always been a challenge in object detection. Predefined multi-size anchors are usually adopted to address this issue, but they can only accommodate a limited number of object scales and aspect ratios. To cover a wider multi-size variation, we propose a detection method that utilizes depth information to estimate the size of anchors. To be more specific, a general 3D shape is selected, for each class of objects, that represents different sizes of 2D bounding boxes in the image according to the corresponding object depths. Given these 2D bounding boxes, a neural network is used to classify them into different categories and do the regression to obtain more accurate 2D bounding boxes. The KITTI benchmark dataset is used to validate the proposed approach. Compared with the detection method using pre-defined anchors, the proposed method has achieved a significant improvement in detection accuracy.
Download

Paper Nr: 91
Title:

Generative Adversarial Networks as an Advanced Data Augmentation Technique for MRI Data

Authors:

Filippos Konidaris, Thanos Tagaris, Maria Sdraka and Andreas Stafylopatis

Abstract: This paper presents a new methodology for data augmentation through the use of Generative Adversarial Networks. Traditional augmentation strategies are severely limited, especially in tasks where the images follow strict standards, as is the case in medical datasets. Experiments conducted on the ADNI dataset prove that augmentation through GANs outperforms traditional methods by a large margin, based both on the validation accuracy and the models’ generalization capability on a holdout test set. Although traditional data augmentation did not seem to aid the classification process in any way, by adding GAN-based augmentation an increase of 11.68% in accuracy was achieved. Furthermore, by combining traditional with GAN-based augmentation schemes, even higher accuracies can be reached.
Download

Paper Nr: 100
Title:

Next Viewpoint Recommendation by Pose Ambiguity Minimization for Accurate Object Pose Estimation

Authors:

Nik Z. Hashim, Yasutomo Kawanishi, Daisuke Deguchi, Ichiro Ide, Hiroshi Murase, Ayako Amma and Norimasa Kobori

Abstract: 3D object pose estimation by using a depth sensor is one of the important tasks in activities by robots. To reduce the pose ambiguity of an estimated object pose, several methods for multiple viewpoint pose estimation have been proposed. However, these methods need to select the viewpoints carefully to obtain better results. If the pose of the target object is ambiguous from the current observation, we could not decide where we should move the sensor to set as the next viewpoint. In this paper, we propose a best next viewpoint recommendation method by minimizing the pose ambiguity of the object by making use of the current pose estimation result as a latent variable. We evaluated viewpoints recommended by the proposed method and confirmed that it helps us to gain better pose estimation results than several comparative methods on a synthetic dataset.
Download

Paper Nr: 131
Title:

Practical License Plate Recognition in Unconstrained Surveillance Systems with Adversarial Super-Resolution

Authors:

Younkwan Lee, Jiwon Jun, Yoojin Hong and Moongu Jeon

Abstract: Although most current license plate (LP) recognition applications have been significantly advanced, they are still limited to ideal environments where training data are carefully annotated with constrained scenes. In this paper, we propose a novel license plate recognition method to handle unconstrained real world traffic scenes. To overcome these difficulties, we use adversarial super-resolution (SR), and one-stage character segmentation and recognition. Combined with a deep convolutional network based on VGG-net, our method provides simple but reasonable training procedure. Moreover, we introduce GIST-LP, a challenging LP dataset where image samples are effectively collected from unconstrained surveillance scenes. Experimental results on AOLP and GIST-LP dataset illustrate that our method, without any scene-specific adaptation, outperforms current LP recognition approaches in accuracy and provides visual enhancement in our SR results that are easier to understand than original data.
Download

Paper Nr: 141
Title:

FinSeg: Finger Parts Semantic Segmentation using Multi-scale Feature Maps Aggregation of FCN

Authors:

Adel Saleh, Hatem A. Rashwan, Mohamed Abdel-Nasser, Vivek K. Singh, Saddam Abdulwahab, Md. K. Sarker, Miguel A. Garcia and Domenec Puig

Abstract: Image semantic segmentation is in the center of interest for computer vision researchers. Indeed, huge number of applications requires efficient segmentation performance, such as activity recognition, navigation, and human body parsing, etc. One of the important applications is gesture recognition that is the ability to understanding human hand gestures by detecting and counting finger parts in a video stream or in still images. Thus, accurate finger parts segmentation yields more accurate gesture recognition. Consequently, in this paper, we highlight two contributions as follows: First, we propose data-driven deep learning pooling policy based on multi-scale feature maps extraction at different scales (called FinSeg). A novel aggregation layer is introduced in this model, in which the features maps generated at each scale is weighted using a fully connected layer. Second, with the lack of realistic labeled finger parts datasets, we propose a labeled dataset for finger parts segmentation (FingerParts dataset). To the best of our knowledge, the proposed dataset is the first attempt to build a realistic dataset for finger parts semantic segmentation. The experimental results show that the proposed model yields an improvement of 5% compared to the standard FCN network.
Download

Paper Nr: 160
Title:

Camera Tampering Detection using Generative Reference Model and Deep Learned Features

Authors:

Pranav Mantini and Shishir K. Shah

Abstract: An unauthorized alteration in the viewpoint of a surveillance cameras is called tampering. This involves comparing images from the surveillance camera against a reference model. The reference model represents the features (e.g. background, edges, and interest points) of the image under normal operating conditions. The approach is to identify a tamper by analysing the distance between the features of the image from surveillance camera and from the reference model. If the distance is not within a certain threshold, the image is labeled as a tamper. Most methods have used images from the immediate past of the surveillance camera to construct the reference model. We propose to employ a generative model that learns the distribution of images from the surveillance camera under normal operating conditions, by training a generative adversarial network (GAN). The GAN is capable of sampling images from the probability density function, which are used as reference. We train a Siamese network that transforms the images into a feature space, so as to maximize the distance between the generated images and tampered images (while minimizing the distance between generated and normal images). The distance between the generated and the surveillance camera image is classified as either normal or tampered. The model is trained and tested over a synthetic dataset that is created by inducing artificial tampering (using image processing techniques). We compare the performance of the proposed model against two existing methods. Results show that the proposed model is highly capable of detecting and classifying tampering, and outperforms the existing methods with respect to accuracy and false positive rate.
Download

Paper Nr: 185
Title:

Understanding How Video Quality Affects Object Detection Algorithms

Authors:

Miloud Aqqa, Pranav Mantini and Shishir K. Shah

Abstract: Video quality is an important practical challenge that is often overlooked in the design of automated video surveillance systems. Commonly, visual intelligent systems are trained and tested on high quality image datasets, yet in practical video surveillance applications the video frames can not be assumed to be of high quality due to video encoding, transmission and decoding. Recently, deep neural networks have obtained state-of-the-art performance on many machine vision tasks. In this paper we provide an evaluation of 4 state-of-the-art deep neural network models for object detection under various levels of video compression. We show that the existing detectors are susceptible to quality distortions stemming from compression artifacts during video acquisition. These results enable future work in developing object detectors that are more robust to video quality.
Download

Paper Nr: 195
Title:

Plant Growth Prediction using Convolutional LSTM

Authors:

Shunsuke Sakurai, Hideaki Uchiyama, Atshushi Shimada and Rin-Ichiro Taniguchi

Abstract: This paper presents a method for predicting plant growth in future images from past images, as a new phenotyping technology. This is achieved by modeling the representation of plant growth based on neural network. In order to learn the long-term dependencies in plant growth from the images, we propose to employ a Convolutional LSTM based framework. Especially, We apply an encoder-decoder model inspired by a framework on future frame prediction to model the representation of plant growth effectively. In addition, we propose two additional loss terms to put the constraints on shape changes of leaves between consecutive images. In the evaluation, we demonstrated the effectiveness of the proposed loss functions through the comparisons using labeled plant growth images.
Download

Paper Nr: 208
Title:

Spatio-temporal Video Autoencoder for Human Action Recognition

Authors:

Anderson E. Santos and Helio Pedrini

Abstract: The demand for automatic systems for action recognition has increased significantly due to the development of surveillance cameras with high sampling rates, low cost, small size and high resolution. These systems can effectively support human operators to detect events of interest in video sequences, reducing failures and improving recognition results. In this work, we develop and analyze a method to learn two-dimensional (2D) representations from videos through an autoencoder framework. A multi-stream network is used to incorporate spatial and temporal information for action recognition purposes. Experiments conducted on the challenging UCF101 and HMDB51 data sets indicate that our representation is capable of achieving competitive accuracy rates compared to the literature approaches.
Download

Paper Nr: 212
Title:

Ontology and HMAX Features-based Image Classification using Merged Classifiers

Authors:

Jalila Filali, Hajer B. Zghal and Jean Martinet

Abstract: Bag-of-Viusal-Words (BoVW) model has been widely used in the area of image classification, which rely on building visual vocabulary. Recently, attention has been shifted to the use of advanced architectures which are characterized by multilevel processing. HMAX model (Hierarchical Max-pooling model) has attracted a great deal of attention in image classification. Recent works, in image classification, consider the integration of ontologies and semantic structures is useful to improve image classification. In this paper, we propose an approach of image classification based on ontology and HMAX features using merged classifiers. Our contribution resides in exploiting ontological relationships between image categories in line with training visual-feature classifiers, and by merging the outputs of hypernym-hyponym classifiers to lead to a better discrimination between classes. Our purpose is to improve image classification by using ontologies. Several strategies have been experimented and the obtained results have shown that our proposal improves image classification. Results based our ontology outperform results obtained by baseline methods without ontology. Moreover, the deep learning network Inception-v3 is experimented and compared with our method, classification results obtained by our method outperform Inception-v3 for some image classes.
Download

Paper Nr: 234
Title:

Bayesian Optimization of 3D Feature Parameters for 6D Pose Estimation

Authors:

Frederik Hagelskjær, Norbert Krüger and Anders G. Buch

Abstract: 6D pose estimation using local features has shown state-of-the-art performance for object recognition and pose estimation from 3D data in a number of benchmarks. However, this method requires extensive knowledge and elaborate parameter tuning to obtain optimal performances. In this paper, we propose an optimization method able to determine feature parameters automatically, providing improved point matches to a robust pose estimation algorithm. Using labeled data, our method measures the performance of the current parameter setting using a scoring function based on both true and false positive detections. Combined with a Bayesian optimization strategy, we achieve automatic tuning using few labeled examples. Experiments were performed on two recent RGB-D benchmark datasets. The results show significant improvements by tuning an existing algorithm, with state-of-art performance.
Download

Paper Nr: 260
Title:

A Modified Self-training Method for Adapting Domains in the Task of Food Classification

Authors:

Elnaz J. Heravi, Hamed H. Aghdam and Domenec Puig

Abstract: Food trackers are tools that recognize foods using their images. In the core of these tools there is usually a neural network that performs the classification. Neural networks are highly expressive models that need a large dataset to generalize well. Since it is hard to collect a training set that captures most of realistic situations in real world, there is usually a shift between the training set and the actual test set. This potentially reduces the performance of the network. In this paper, we propose a method based on self-training to perform unsupervised domain adaptation in the task of food classification. Our method takes into account the uncertainty of predictions instead of probability scores to assign pseudo-labels. Our experiments on the Food-101 and the UPMC-101 datasets show that the proposed method produces more accurate results compared to Tri-training method which had previously surpassed other domain adaptation methods.
Download

Short Papers
Paper Nr: 2
Title:

Topolet: From Atomic Hand Posture Structures to a Comprehensive Gesture Set

Authors:

Amin Dadgar and Guido Brunnett

Abstract: We propose a type of time-series model for hierarchical hand posture database which can be viewed as a Markovian temporal structure. The model employs the topology of the points’ cloud, existing in each layer of the database, and exploits a novel type of atomic structure, we refer to as Topolet. Moreover, our temporal structure utilizes a modified version of another atomic gesture structure, known as Poselet. That modification considers Poselets from the vector-based generative perspective (instead of the pixel-based discriminative one). The results suggest a considerable improvement in the accuracy and time-complexity. Furthermore, in contrast to other approaches, our Topolet is capable of considering random gestures, thus introduces a comprehensive set of gestures (suitable for context-free application domain) within the shape-based approach. We prove that the Topolet could be enhanced to different resolutions of gestures’ set which provide the system with the potential to be adapted to different application requirements.
Download

Paper Nr: 4
Title:

Optical Flow Augmented Semantic Segmentation Networks for Automated Driving

Authors:

Hazem Rashed, Senthil Yogamani, Ahmad El-Sallab, Pavel Křížek and Mohamed El-Helw

Abstract: Motion is a dominant cue in automated driving systems. Optical flow is typically computed to detect moving objects and to estimate depth using triangulation. In this paper, our motivation is to leverage the existing dense optical flow to improve the performance of semantic segmentation. To provide a systematic study, we construct four different architectures which use RGB only, flow only, RGBF concatenated and two-stream RGB + flow. We evaluate these networks on two automotive datasets namely Virtual KITTI and Cityscapes using the state-of-the-art flow estimator FlowNet v2. We also make use of the ground truth optical flow in Virtual KITTI to serve as an ideal estimator and a standard Farneback optical flow algorithm to study the effect of noise. Using the flow ground truth in Virtual KITTI, two-stream architecture achieves the best results with an improvement of 4% IoU. As expected, there is a large improvement for moving objects like trucks, vans and cars with 38%, 28% and 6% increase in IoU. FlowNet produces an improvement of 2.4% in average IoU with larger improvement in the moving objects corresponding to 26%, 11% and 5% in trucks, vans and cars. In Cityscapes, flow augmentation provided an improvement for moving objects like motorcycle and train with an increase of 17% and 7% in IoU.
Download

Paper Nr: 5
Title:

Multi-stream CNN based Video Semantic Segmentation for Automated Driving

Authors:

Ganesh Sistu, Sumanth Chennupati and Senthil Yogamani

Abstract: Majority of semantic segmentation algorithms operate on a single frame even in the case of videos. In this work, the goal is to exploit temporal information within the algorithm model for leveraging motion cues and temporal consistency. We propose two simple high-level architectures based on Recurrent FCN (RFCN) and Multi-Stream FCN (MSFCN) networks. In case of RFCN, a recurrent network namely LSTM is inserted between the encoder and decoder. MSFCN combines the encoders of different frames into a fused encoder via 1x1 channel-wise convolution. We use a ResNet50 network as the baseline encoder and construct three networks namely MSFCN of order 2 & 3 and RFCN of order 2. MSFCN-3 produces the best results with an accuracy improvement of 9% and 15% for Highway and New York-like city scenarios in the SYNTHIACVPR’ 16 dataset using mean IoU metric. MSFCN-3 also produced 11% and 6% for SegTrack V2 and DAVIS datasets over the baseline FCN network. We also designed an efficient version of MSFCN-2 and RFCN-2 using weight sharing among the two encoders. The efficient MSFCN-2 provided an improvement of 11% and 5% for KITTI and SYNTHIA with negligible increase in computational complexity compared to the baseline version.
Download

Paper Nr: 12
Title:

An Extensible Deep Architecture for Action Recognition Problem

Authors:

Isaac Sanou, Donatello Conte and Hubert Cardot

Abstract: Human action Recognition has been extensively addressed by deep learning. However, the problem is still open and many deep learning architectures show some limits, such as extracting redundant spatio-temporal informations, using hand-crafted features, and instability of proposed networks on different datasets. In this paper, we present a general method of deep learning for the human action recognition. This model fits on any type of database and we apply it on CAD-120 which is a complex dataset. Our model thus clearly improves in two aspects. The first aspect is on the redundant informations and the second one is the generality and the multi-functionality application of our deep architecture. Our model uses only raw data for human action recognition and the approach achieves state-of-the-art action classification performance.
Download

Paper Nr: 22
Title:

Unsupervised Facial Biometric Data Filtering for Age and Gender Estimation

Authors:

Krešimir Bešenić, Jörgen Ahlberg and Igor S. Pandžić

Abstract: Availability of large training datasets was essential for the recent advancement and success of deep learning methods. Due to the difficulties related to biometric data collection, datasets with age and gender annotations are scarce and usually limited in terms of size and sample diversity. Web-scraping approaches for automatic data collection can produce large amounts weakly labeled noisy data. The unsupervised facial biometric data filtering method presented in this paper greatly reduces label noise levels in web-scraped facial biometric data. Experiments on two large state-of-the-art web-scraped facial datasets demonstrate the effectiveness of the proposed method, with respect to training and validation scores, training convergence, and generalization capabilities of trained age and gender estimators.
Download

Paper Nr: 44
Title:

A Survey on Databases for Facial Micro-Expression Analysis

Authors:

Jingting Li, Catherine Soladie and Renaud Seguier

Abstract: Micro-expression (ME) is a brief local spontaneous facial expression and an important non-verbal clue to revealing genuine emotion. The study on automatic detection and recognition of ME has been emerging in the last decade. However, the research is restricted by the number of ME databases. In this paper, we propose a survey based on the 15 existing ME databases. Firstly, the databases are analyzed by 13 characteristics grouped into four categories (population, hardware, experimental protocol, and annotation). These characteristics provide a reference not only for choosing a database for special ME analysis purpose but also for future database construction. Concerning the ME analysis based on databases, we firstly present the emotion classification and metric frequency for ME recognition. The most frequently used databases for ME detection are then introduced. Finally, we discuss the future directions of micro-expression databases.
Download

Paper Nr: 47
Title:

Boosting 3D Shape Classification with Global Verification and Redundancy-free Codebooks

Authors:

Viktor Seib, Nick Theisen and Dietrich Paulus

Abstract: We present a competitive approach for 3D data classification that is related to Implicit Shape Models and Naive-Bayes Nearest Neighbor algorithms. Based on this approach we investigate methods to reduce the amount of data stored in the extracted codebook with the goal to eliminate redundant and ambiguous feature descriptors. The codebook is significantly reduced in size and is combined with a novel global verification approach. We evaluate our algorithms on typical 3D data benchmarks and achieve competitive results despite the reduced codebook. The presented algorithm can be run efficiently on a mobile computer making it suitable for mobile robotics applications. The source code of the developed methods is made publicly available to contribute to point cloud processing, the Point Cloud Library (PCL) and 3D classification software in general.
Download

Paper Nr: 59
Title:

Semi-supervised Object Detection with Unlabeled Data

Authors:

Nhu-Van Nguyen, Christophe Rigaud and Jean-Christophe Burie

Abstract: Besides the fully supervised object detection, many approaches have tried other training settings such as weakly-supervised learning which uses only weak labels (image-level) or mix-supervised learning which uses few strong labels (instance-level) and many weak labels. In our work, we investigate the semi-supervised learning with few instance-level labeled images and many unlabeled images. Considering the training of unlabeled images as a latent variable model, we propose an Expectation-Maximization method for semi-supervised object detection with unlabeled images. We estimate the latent labels and optimize the model for both classification part and localization part of object detection. Implementing our method on the one-stage object detection model YOLO, we show that like the weakly labeled images, the unlabeled images also can boost the performance of the detector by empirical experimentation on the Pascal VOC dataset.
Download

Paper Nr: 85
Title:

Performance Evaluation of Real-time and Scale-invariant LoG Operators for Text Detection

Authors:

Dinh C. Nguyen, Mathieu Delalandre, Donatello Conte and The A. Pham

Abstract: This paper presents a state-of-the-art and a performance evaluation of real-time text detection methods, having particular focus on the family of Laplacian of Gaussian (LoG) operators with scale-invariance. The computational complexity of operators is discussed and an adaptation to text detection is obtained through the scale-space representation. In addition, a groundtruthing process and a characterization protocol are proposed, performance evaluation is driven with repeatability and processing time. The evaluation highlights a near-exact approximation with real-time operators at one to two orders of magnitude of execution time. The real-time operators are adapted to recent camera devices to process high resolution images. Perspectives are provided for operator robustness, optimization and characterization of the detection strategy.
Download

Paper Nr: 88
Title:

Effective 2D/3D Registration using Curvilinear Saliency Features and Multi-Class SVM

Authors:

Saddam Abdulwahab, Hatem A. Rashwan, Julian Cristiano, Sylvie Chambon and Domenec Puig

Abstract: Registering a single intensity image to a 3D geometric model represented by a set of depth images is still a challenge. Since depth images represent only the shape of the objects, in turn, the intensity image is relative to viewpoint, texture and lighting condition. Thus, it is essential to firstly bring 2D and 3D representations to common features and then match them to find the correct view. In this paper, we used the concept of curvilinear saliency, related to curvature estimation, for extracting the shape information of both modalities. However, matching the features extracted from an intensity image to thousand(s) of depth images rendered from a 3D model is an exhausting process. Consequently, we propose to cluster the depth images into groups based on Clustering Rule-based Algorithm (CRA). In order to reduce the matching space between the intensity and depth images, a 2D/3D registration framework based on multi-class Support Vector Machine (SVM) is then used. SVM predicts the closest class (i.e., a set of depth images) to the input image. Finally, the closest view is refined and verified by using RANSAC. The effectiveness of the proposed registration approach has been evaluated by using the public PASCAL3D+ dataset. The obtaining results show that the proposed algorithm provides a high precision with an average of 88%.
Download

Paper Nr: 92
Title:

Indexed Operations for Non-rectangular Lattices Applied to Convolutional Neural Networks

Authors:

Mikael Jacquemont, Luca Antiga, Thomas Vuillaume, Giorgia Silvestri, Alexandre Benoit, Patrick Lambert and Gilles Maurin

Abstract: The present paper introduces convolution and pooling operators for indexed images. These operators can be used on images that do not provide Cartesian grids of pixels, as long as a list of neighbor’s indices can be provided for each pixel. They are foreseen being useful for convolutional neural networks (CNN) applied to special sensors, especially in science, without requiring image pre-processing. The present work explains the method and its implementation in the Pytorch framework and shows an application of the indexed kernels to the classification task of images with hexagonal lattices using CNN. The obtained results show that the method gives the same performances as the standard convolution kernels. Indexed convolution thus makes deep neural network frameworks more general and capable of addressing unconventional image lattices. The current implementation, as well as code to reproduce the experiments described in this paper are made available as open-source resources on the repository www.github.com/IndexedConv.
Download

Paper Nr: 94
Title:

Improving Unsupervised Defect Segmentation by Applying Structural Similarity to Autoencoders

Authors:

Paul Bergmann, Sindy Löwe, Michael Fauser, David Sattlegger and Carsten Steger

Abstract: Convolutional autoencoders have emerged as popular methods for unsupervised defect segmentation on image data. Most commonly, this task is performed by thresholding a per-pixel reconstruction error based on an p̀-distance. This procedure, however, leads to large residuals whenever the reconstruction includes slight localization inaccuracies around edges. It also fails to reveal defective regions that have been visually altered when intensity values stay roughly consistent. We show that these problems prevent these approaches from being applied to complex real-world scenarios and that they cannot be easily avoided by employing more elaborate architectures such as variational or feature matching autoencoders. We propose to use a perceptual loss function based on structural similarity that examines inter-dependencies between local image regions, taking into account luminance, contrast, and structural information, instead of simply comparing single pixel values. It achieves significant performance gains on a challenging real-world dataset of nanofibrous materials and a novel dataset of two woven fabrics over state-of-the-art approaches for unsupervised defect segmentation that use per-pixel reconstruction error metrics.
Download

Paper Nr: 102
Title:

Novelty Detection for Person Re-identification in an Open World

Authors:

George Galanakis, Xenophon Zabulis and Antonis A. Argyros

Abstract: A fundamental assumption in most contemporary person re-identification research, is that all query persons that need to be re-identified belong to a closed gallery of known persons, i.e., they have been observed and a representation of their appearance is available. For several real-world applications, this closed-world assumption does not hold, as image queries may contain people that the re-identification system has never observed before. In this work, we remove this constraining assumption. To do so, we introduce a novelty detection mechanism that decides whether a person in a query image exists in the gallery. The re-identification of persons existing in the gallery is easily achieved based on the persons representation employed by the novelty detection mechanism. The proposed method operates on a hybrid person descriptor that consists of both supervised (learnt) and unsupervised (hand-crafted) components. A series of experiments on public, state of the art datasets and in comparison with state of the art methods shows that the proposed approach is very accurate in identifying persons that have not been observed before and that this has a positive impact on re-identification accuracy.
Download

Paper Nr: 108
Title:

ID-Softmax: A Softmax-like Loss for ID Face Recognition

Authors:

Yan Kong, Fuzhang Wu, Feiyue Huang and Yanjun Wu

Abstract: The face recognition between photos from identification documents (ID, Citizen Card or Passport Card) and daily photos, which is named FRBID(Zhang et al., 2017), is widely used in real world scenarios. However, traditional Softmax loss of deep CNN usually lacks the power of discrimination for FRBID. To address this problem, in this paper, we first revisit recent progress of face recognition losses, and give the theoretical and experimental analysis on the reason why Softmax-like losses work badly on ID-daily face recognition. Then we propose an novel approach named ID-Softmax, which use ID face features as class ’agent’ to guide the deep CNNs to learn highly discriminative features between ID photos and daily photos. In order to promote the ID-daily face recognition, we collect a large dataset ID74K, which includes 74,187 identities with corresponding ID photos and daily photos. To test our approach, we evaluate the feature distribution and face verification performance on dataset ID74K. In experiments, we achieve the best performance when comparing with other state-of-the-art methods, which verifies the effectiveness of the proposed ID-Softmax loss.
Download

Paper Nr: 138
Title:

Unconstrained Face Verification and Open-World Person Re-identification via Densely-connected Convolution Neural Network

Authors:

Donghwuy Ko, Jongmin Yu, Ahmad M. Sheri and Moongu Jeon

Abstract: Although various methods based on the hand-crafted features and deep learning methods have been developed for various applications in the past few years, distinguishing untrained identities in testing phase still remains a challenging task. To overcome these difficulties, we propose a novel representation learning approach to unconstrained face verification and open-world person re-identification tasks. Our approach aims to reinforce the discriminative power of learned features by assigning the weight to each training sample. We demonstrate the efficiency of the proposed method by testing on datasets which are publicly available. The experimental results for both face verification and person re-identification tasks show that its performance is comparable to state-of-the-art methods based on hand-crafted feature and general convolutional neural network.
Download

Paper Nr: 204
Title:

Learning Task-specific Activation Functions using Genetic Programming

Authors:

Mina Basirat and Peter M. Roth

Abstract: Deep Neural Networks have been shown to be beneficial for a variety of tasks, in particular allowing for end-to-end learning and reducing the requirement for manual design decisions. However, still many parameters have to be chosen manually in advance, also raising the need to optimize them. One important, but often ignored parameter is the selection of a proper activation function. In this paper, we tackle this problem by learning task-specific activation functions by using ideas from genetic programming. We propose to construct piece-wise activation functions (for the negative and the positive part) and introduce new genetic operators to combine functions in a more efficient way. The experimental results for multi-class classification demonstrate that for different tasks specific activation functions are learned, also outperforming widely used generic baselines.
Download

Paper Nr: 214
Title:

Exploring Deep Spiking Neural Networks for Automated Driving Applications

Authors:

Sambit Mohapatra, Heinrich Gotzig, Senthil Yogamani, Stefan Milz and Raoul Zöllner

Abstract: Neural networks have become the standard model for various computer vision tasks in automated driving including semantic segmentation, moving object detection, depth estimation, visual odometry, etc. The main flavors of neural networks which are used commonly are convolutional (CNN) and recurrent (RNN). In spite of rapid progress in embedded processors, power consumption and cost is still a bottleneck. Spiking Neural Networks (SNNs) are gradually progressing to achieve low-power event-driven hardware architecture which has a potential for high efficiency. In this paper, we explore the role of deep spiking neural networks (SNN) for automated driving applications. We provide an overview of progress on SNN and argue how it can be a good fit for automated driving applications.
Download

Paper Nr: 218
Title:

Visitors Localization in Natural Sites Exploiting EgoVision and GPS

Authors:

Filippo M. Milotta, Antonino Furnari, Sebastiano Battiato, Maria De Salvo, Giovanni Signorello and Giovanni M. Farinella

Abstract: Localization in outdoor contexts such as parks and natural reserves can be used to augment the visitors’ experience and to provide the site manager with valid analytics to improve the fruition of the site. In this paper, we address the problem of visitors localization in natural sites by exploiting both egocentric vision and GPS data. To this aim, we gathered a dataset of first person videos in the Botanical Garden of the University of Catania. Along with the videos, we also acquired GPS coordinates. The data have been acquired by 12 different users, each walking all around the garden for an average of 30 minutes (i.e., a total of about 6 hours of recording). Using the collected dataset, we show that localizing visitors based solely on GPS data is not sufficient to understand the location of the visitors in a natural site. We hence investigate how to exploit visual data to perform localization by casting the problem as the one of classifying images among the different contexts of the natural site. Our investigation highlights that visual information can be leveraged to achieve better localization and that Egocentric Vision and GPS can be exploited jointly to improve accuracy.
Download

Paper Nr: 220
Title:

Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems

Authors:

Victor Talpaert, Ibrahim Sobh, B. R. Kiran, Patrick Mannion, Senthil Yogamani, Ahmad El-Sallab and Patrick Perez

Abstract: Deep Reinforcement Learning (DRL) has become increasingly powerful in recent years, with notable achievements such as Deepmind’s AlphaGo. It has been successfully deployed in commercial vehicles like Mobileye’s path planning system. However, a vast majority of work on DRL is focused on toy examples in controlled synthetic car simulator environments such as TORCS and CARLA. In general, DRL is still at its infancy in terms of usability in real-world applications. Our goal in this paper is to encourage real-world deployment of DRL in various autonomous driving (AD) applications. We first provide an overview of the tasks in autonomous driving systems, reinforcement learning algorithms and applications of DRL to AD systems. We then discuss the challenges which must be addressed to enable further progress towards real-world deployment.
Download

Paper Nr: 222
Title:

A View-invariant Framework for Fast Skeleton-based Action Recognition using a Single RGB Camera

Authors:

Enjie Ghorbel, Konstantinos Papadopoulos, Renato Baptista, Himadri Pathak, Girum Demisse, Djamila Aouada and Björn Ottersten

Abstract: View-invariant action recognition using a single RGB camera represents a very challenging topic due to the lack of 3D information in RGB images. Lately, the recent advances in deep learning made it possible to extract a 3D skeleton from a single RGB image. Taking advantage of this impressive progress, we propose a simple framework for fast and view-invariant action recognition using a single RGB camera. The proposed pipeline can be seen as the association of two key steps. The first step is the estimation of a 3D skeleton from a single RGB image using a CNN-based pose estimator such as VNect. The second one aims at computing view-invariant skeleton-based features based on the estimated 3D skeletons. Experiments are conducted on two well-known benchmarks, namely, IXMAS and Northwestern-UCLA datasets. The obtained results prove the validity of our concept, which suggests a new way to address the challenge of RGB-based view-invariant action recognition.
Download

Paper Nr: 226
Title:

Limitations of Metric Loss for the Estimation of Joint Translation and Rotation

Authors:

Philippe S. Roman, Pascal Desbarats, Jean-Philippe Domenger and Axel Buendia

Abstract: Localizing objects is a key challenge for robotics, augmented reality and mixed reality applications. Images taken in the real world feature many objects with challenging factors such as occlusions, motion blur and changing lights. In manufacturing industry scenes, a large majority of objects are poorly textured or highly reflective. Moreover, they often present symmetries which makes the localization task even more complicated. PoseNet is a deep neural network based on GoogleNet that predicts camera poses in indoor room and outdoor streets. We propose to evaluate this method for the problem of industrial object pose estimation by training the network on the T-LESS dataset. We demonstrate with our experiments that PoseNet is able to predict translation and rotation separately with high accuracy. However, our experiments also prove that it is not able to learn translation and rotation jointly. Indeed, one of the two modalities is either not learned by the network, or forgotten during training when the other is being learned. This justifies the fact that future works will require other formulation of the loss as well as other architectures in order to solve the pose estimation general problem.
Download

Paper Nr: 228
Title:

Real Time Eye Gaze Tracking System using CNN-based Facial Features for Human Attention Measurement

Authors:

Oliver Lorenz and Ulrike Thomas

Abstract: Understanding human attentions in various interactive scenarios is an important task for human-robot collaboration. Human communication with robots includes intuitive nonverbal behaviour body postures and gestures. Multiple communication channels can be used to obtain a understandable interaction between humans and robots. Usually, humans communicate in the direction of eye gaze and head orientation. In this paper, a new tracking system based on two cascaded CNNs is presented for eye gaze and head orientation tracking and enables robots to measure the willingness of humans to interact via eye contacts and eye gaze orientations. Based on the two consecutively cascaded CNNs, facial features are recognised, at first in the face and then in the regions of eyes. These features are detected by a geometrical method and deliver the orientation of the head to determine eye gaze direction. Our method allows to distinguish between front faces and side faces. With a consecutive approach for each condition, the eye gaze is also detected under extreme situations. The applied CNNs have been trained by many different datasets and annotations, thereby the reliability and accuracy of the here introduced tracking system is improved and outperforms previous detection algorithm. Our system is applied on commonly used RGB-D images and implemented on a GPU to achieve real time performance. The evaluation shows that our approach operates accurately in challenging dynamic environments.
Download

Paper Nr: 233
Title:

Semi-automatic Training Data Generation for Semantic Segmentation using 6DoF Pose Estimation

Authors:

Shuichi Akizuki and Manabu Hashimoto

Abstract: In this research, we propose a method using a low cost process to generate large volumes of real images as training data for semantic segmentation. The method first estimates the six-degree-of-freedom (6DoF) pose for objects in images obtained using an RGB-D sensor, and then maps labels that have been pre-assigned to 3D models onto the images. It also captures additional input images while the camera is moving, and is able to map labels to these other input images based on the relative motion of the viewpoint. This method has made it possible to obtain large volumes of ground truth data for real images. The proposed method has been used to create a new publicity available dataset for affordance segmentation, called the NEDO Part-Affordance Dataset v1, which has been used to benchmark some typical semantic segmentation algorithms.
Download

Paper Nr: 6
Title:

Active Learning for Deep Object Detection

Authors:

Clemens-Alexander Brust, Christoph Käding and Joachim Denzler

Abstract: The great success that deep models have achieved in the past is mainly owed to large amounts of labeled training data. However, the acquisition of labeled data for new tasks aside from existing benchmarks is both challenging and costly. Active learning can make the process of labeling new data more efficient by selecting unlabeled samples which, when labeled, are expected to improve the model the most. In this paper, we combine a novel method of active learning for object detection with an incremental learning scheme (Käding et al., 2016b) to enable continuous exploration of new unlabeled datasets. We propose a set of uncertainty-based active learning metrics suitable for most object detectors. Furthermore, we present an approach to leverage class imbalances during sample selection. All methods are evaluated systematically in a continuous exploration context on the PASCAL VOC 2012 dataset (Everingham et al., 2010).
Download

Paper Nr: 21
Title:

Convolutional Neural Network for Detection and Classification with Event-based Data

Authors:

Joubert Damien, Konik Hubert and Chausse Frederic

Abstract: Mainly inspired by biological perception systems, event-based sensors provide data with many advantages such as timing precision, data compression and low energy consumption. In this work, it is analyzed how these data can be used to detect and classify cars, in the case of front camera automotive applications. The basic idea is to merge state of the art deep learning algorithms with event-based data integrated into artificial frames. When this preprocessing method is used in viewing purposes, it suggests that the shape of the targets can be extracted, but only when the relative speed is high enough between the camera and the targets. Event-based sensors seems to provide a more robust description of the target’s trajectory than using conventional frames, the object only being described by its moving edges, and independently of lighting conditions. It is also highlighted how features trained on conventional greylevel images can be transferred to event-based data to efficiently detect car into pseudo images.
Download

Paper Nr: 23
Title:

Indoor Scenes Understanding for Visual Prosthesis with Fully Convolutional Networks

Authors:

Melani Sanchez-Garcia, Ruben Martinez-Cantin and Jose J. Guerrero

Abstract: One of the biggest problems for blind people is to recognize environments. Prosthetic Vision is a promising new technology to provide visual perception to people with some kind of blindness, transforming an image to a phosphenes pattern to be sent to the implant. However, current prosthetic implants have limited ability to generate images with detail required for understanding an environment. Computer vision play a key role in providing prosthetic vision to alleviate key restrictions of blindness. In this work, we propose a new approach to build a schematic representation of indoor environments for phosphene images. We combine computer vision and deep learning techniques to extract structural features in a scene and recognize different indoor environments designed to prosthetic vision. Our method uses the extraction of structural informative edges which can underpin many computer vision tasks such as recognition and scene understanding, being key for conveying the scene structure. We also apply an object detection algorithm by using an accurate machine learning model capable of localizing and identifying multiple objects in a single image. Further, we represent the extracted information using a phosphenes pattern. The effectiveness of this approach is tested with real data from indoor environments with eleven volunteers.
Download

Paper Nr: 32
Title:

Improving Video Object Detection by Seq-Bbox Matching

Authors:

Hatem Belhassen, Heng Zhang, Virginie Fresse and El-Bay Bourennane

Abstract: Video object detection has drawn more and more attention in recent years. Compared with object detection from image, object detection in video is more useful in many practical applications, e.g. self-driving cars, smart video surveillance, etc. It is highly required to build a fast, reliable and low-cost video-based object detection system for these applications. In this work, we propose a novel, simple and highly effective box-level post-processing method to improve the accuracy of video object detection. The proposed method is based on both online and an offline settings. Our experiments on ImageNet object detection from video (VID) dataset show that our method brings important accuracy gains, especially to more challenging fast-moving object detection, with quite light computational overhead in both settings. Applied to YOLOv3, our system achieves so far the best speed/accuracy trade-off for offline video object detection and competitive detection improvements for online object detection.
Download

Paper Nr: 39
Title:

Saw-Mark Defect Detection in Heterogeneous Solar Wafer Images using GAN-based Training Samples Generation and CNN Classification

Authors:

Du-Ming Tsai, Morris K. Fan, Yi-Quan Huang and Wei-Yao Chiu

Abstract: This paper presents a machine vision-based scheme to automatically detect saw-mark defects in solar wafer surfaces. A saw-mark defect is a severe flaw when cutting a silicon ingot into wafers. A multicrystalline solar wafer surface presents random shapes, sizes and orientations of crystal grains in the surface and, thus, results in a heterogeneous texture. It makes the automatic visual inspection task extremely difficult. The deep learning technique is an ideal choice to tackle the problem, but it requires a huge amount of positive (defect-free) and negative (defective) samples for the training. The negative samples are generally not sufficient enough in a manufacturing process. We thus apply a GAN-based model to generate the defective samples for training, and then use the true defect-free samples and the synthesized defective samples to train a CNN model. It solves the imbalanced data arising in manufacturing inspection. The preliminary experiment has shown promising results of the proposed method for detecting various saw-mark defects including black line, white line, and impurity in multicrystalline solar wafers.
Download

Paper Nr: 46
Title:

Subjective Annotations for Vision-based Attention Level Estimation

Authors:

Andrea Coifman, Péter Rohoska, Miklas S. Kristoffersen, Sven E. Shepstone and Zheng-Hua Tan

Abstract: Attention level estimation systems have a high potential in many use cases, such as human-robot interaction, driver modeling and smart home systems, since being able to measure a person’s attention level opens the possibility to natural interaction between humans and computers. The topic of estimating a human’s visual focus of attention has been actively addressed recently in the field of HCI. However, most of these previous works do not consider attention as a subjective, cognitive attentive state. New research within the field also faces the problem of the lack of annotated datasets regarding attention level in a certain context. The novelty of our work is two-fold: First, we introduce a new annotation framework that tackles the subjective nature of attention level and use it to annotate more than 100,000 images with three attention levels and second, we introduce a novel method to estimate attention levels, relying purely on extracted geometric features from RGB and depth images, and evaluate it with a deep learning fusion framework. The system achieves an overall accuracy of 80.02%. Our framework and attention level annotations are made publicly available.
Download

Paper Nr: 49
Title:

Simultaneous Estimation of Facial Landmark and Attributes with Separation Multi-task Networks

Authors:

Ryo Matsui, Takayoshi Yamashita and Hironobu Fujiyoshi

Abstract: Multi-task learning is a machine learning approach in which multiple tasks are solved simultaneously. This approach can improve learning efficiency and prediction accuracy for the task-specific models. Furthermore, it has been used successfully across various applications such as natural language processing and computer vision. Multi-task learning consists of shared layers and task-specific layers. The shared layers extract common low-level features for all tasks, the task-specific layers diverge from the shared layers and extract specific high-level features for each task. Hence, conventional multi-task learning architecture cannot extract the low-level task-specific feature. In this work, we propose Separation Multi-task Networks, a novel multi-task learning architecture that extracts shared features and task-specific features in various layers. Our proposed method extracts low- to high-level task-specific features by feeding task-specific layers in parallel to each shared layer. Moreover, we employ channel-wise convolution when concatenating feature maps of shared layers and task-specific layers. This convolution allows concatenation even if layers have a different number of channels of feature maps. In experiments on CelebA dataset, our proposed method outperformed conventional methods at facial landmark detection and facial attribute estimation.
Download

Paper Nr: 50
Title:

Weighted Random Forest using Gaze Distributions Measured from Observers for Gender Classification

Authors:

Sayaka Yamaguchi, Masashi Nishiyama and Yoshio Iwai

Abstract: We propose a method to improve gender classification from pedestrian images using a random forest weighted by a gaze distribution. When training samples contain a bias in the background surrounding pedestrians, a random forest classifier may incorrectly include the background attributes as discriminative features, thereby degrading the performance of gender classification on test samples. To solve the problem, we use a gaze distribution map measured from observers completing a gender classification task for pedestrian images. Our method uses the gaze distribution to assign weights when generating a random forest. Each decision tree of the random forest then extracts discriminative features from the regions corresponding to the predominant gaze locations. We investigated the effectiveness of our weighted random forest using a gaze distribution by comparing the following alternatives: assigning weights for feature selection, assigning weights for feature values, and assigning weights for information gains. We compare the gender classification results of our method with those of existing random forest methods. Experimental results show our random forest using information gains weighted according to the gaze distribution significantly improved the accuracy of gender classification on a publicly available dataset.
Download

Paper Nr: 57
Title:

Top-Down Human Pose Estimation with Depth Images and Domain Adaptation

Authors:

Nelson Rodrigues, Helena Torres, Bruno Oliveira, João Borges, Sandro Queirós, José Mendes, Jaime Fonseca, Victor Coelho and José H. Brito

Abstract: In this paper, a method for estimation of human pose is proposed, making use of ToF (Time of Flight) cameras. For this, a YOLO based object detection method was used, to develop a top-down method. In the first stage, a network was developed to detect people in the image. In the second stage, a network was developed to estimate the joints of each person, using the image result from the first stage. We show that a deep learning network trained from scratch with ToF images yields better results than taking a deep neural network pretrained on RGB data and retraining it with ToF data. We also show that a top-down detector, with a person detector and a joint detector works better than detecting the body joints over the entire image.
Download

Paper Nr: 62
Title:

DeepBall: Deep Neural-Network Ball Detector

Authors:

Jacek Komorowski, Grzegorz Kurzejamski and Grzegorz Sarwas

Abstract: The paper describes a deep network based object detector specialized for ball detection in long shot videos. Due to its fully convolutional design, the method operates on images of any size and produces ball confidence map encoding the position of detected ball. The network uses hypercolumn concept, where feature maps from different hierarchy levels of the deep convolutional network are combined and jointly fed to the convolutional classification layer. This allows boosting the detection accuracy as larger visual context around the object of interest is taken into account. The method achieves state-of-the-art results when tested on publicly available ISSIA-CNR Soccer Dataset.
Download

Paper Nr: 72
Title:

Image Based Localization with Simulated Egocentric Navigations

Authors:

Santi A. Orlando, Antonino Furnari, Sebastiano Battiato and Giovanni M. Farinella

Abstract: Current methods for Image-Based Localization (IBL) require the collection and labelling of large datasets of images or videos for training purpose. Such procedure generally requires a significant effort, involving the use of dedicated sensors or the employment of structure from motion techniques. To overcome the difficulties of acquiring a dataset suitable to train models to study IBL, we propose a tool to generate simulated egocentric data starting from 3D models of real indoor environments. The generated data are automatically associated to the 3D camera pose information to be exploited during training, hence avoiding the need to perform “manual” labelling. To asses the effectiveness of the proposed tool, we generate and release a huge dataset of egocentric images using a 3D model of a real environment belonging to the S3DIS dataset. We also perform a benchmark study for 3 Degrees of Freedom (3DoF) indoor localization by considering an IBL pipeline based on image-retrieval and a triplet network. Results highlight that the generated dataset is useful to study the IBL problem and allows to perform experiments with different settings in a simple way.
Download

Paper Nr: 77
Title:

Pedestrian Intensive Scanning for Active-scan LIDAR

Authors:

Taiki Yamamoto, Fumito Shinmura, Daisuke Deguchi, Yasutomo Kawanishi, Ichiro Ide and Hiroshi Murase

Abstract: In recent years, LIDAR is playing an important role as a sensor for understanding environments of a vehicle’s surroundings. Active-scan LIDAR is being actively developed as a LIDAR that can control the laser irradiation direction arbitrary and rapidly. In comparison with conventional uniform-scan LIDAR (e.g. Velodyne HDL-64e), Active-scan LIDAR enables us to densely scan even distant pedestrians. In addition, if appropriately controlled, this sensor has a potential to reduce unnecessary laser irradiations towards non-target objects. Although there are some preliminary studies on pedestrian scanning strategy for Active-scan LIDARs, in the best of our knowledge, an efficient method has not been realized yet. Therefore, this paper proposes a novel pedestrian scanning method based on orientation aware pedestrian likelihood estimation using the orientation-wise pedestrian’s shape models with local distribution of measured points. To evaluate the effectiveness of the proposed method, we conducted experiments by simulating Active-scan LIDAR using point-clouds from the KITTI dataset. Experimental results showed that the proposed method outperforms the conventional methods.
Download

Paper Nr: 78
Title:

Domain Adaptation for Pedestrian DCNN Detector toward a Specific Scene and an Embedded Platform

Authors:

Nada Hammami, Ala Mhalla and Alexis Landrault

Abstract: Nowadays, the analysis and the understanding of traffic scenes become a topic of great interest in several computer vision applications. Despite the presence of robust detection methods for multi-categories of objects, the performance of detectors will decrease when applied on a specific scene due to a number of constraints such as the different categories of objects, the recording time of the scene (rush hour, ordinary time), the type of traffic (simple, dense) and the type of transport infrastructure. In order to deal with this problematic, the main idea of the proposed work is to develop a domain adaptation technique to automatically adapt detectors based on deep convolutional neural network toward a specific scene and to calibrate the network parameters in order to deploy it on an embedded platform. Results are presented for the proposed adapted detector in term of global performance in mAP and execution time onto a NVIDIA Jetson TX2 board.
Download

Paper Nr: 80
Title:

Outdoor Scenes Pixel-wise Semantic Segmentation using Polarimetry and Fully Convolutional Network

Authors:

Marc Blanchon, Olivier Morel, Yifei Zhang, Ralph Seulin, Nathan Crombez and Désiré Sidibé

Abstract: In this paper, we propose a novel method for pixel-wise scene segmentation application using polarimetry. To address the difficulty of detecting highly reflective areas such as water and windows, we use the angle and degree of polarization of these areas, obtained by processing images from a polarimetric camera. A deep learning framework, based on encoder-decoder architecture, is used for the segmentation of regions of interest. Different methods of augmentation have been developed to obtain a sufficient amount of data, while preserving the physical properties of the polarimetric images. Moreover, we introduce a new dataset comprising both RGB and polarimetric images with manual ground truth annotations for seven different classes. Experimental results on this dataset, show that deep learning can benefit from polarimetry and obtain better segmentation results compared to RGB modality. In particular, we obtain an improvement of 38.35% and 22.92% in the accuracy for segmenting windows and cars respectively.
Download

Paper Nr: 81
Title:

Exploration of Deep Learning-based Multimodal Fusion for Semantic Road Scene Segmentation

Authors:

Yifei Zhang, Olivier Morel, Marc Blanchon, Ralph Seulin, Mojdeh Rastgoo and Désiré Sidibé

Abstract: Deep neural networks have been frequently used for semantic scene understanding in recent years. Effective and robust segmentation in outdoor scene is prerequisite for safe autonomous navigation of autonomous vehicles. In this paper, our aim is to find the best exploitation of different imaging modalities for road scene segmentation, as opposed to using a single RGB modality. We explore deep learning-based early and later fusion pattern for semantic segmentation, and propose a new multi-level feature fusion network. Given a pair of aligned multimodal images, the network can achieve faster convergence and incorporate more contextual information. In particular, we introduce the first-of-its-kind dataset, which contains aligned raw RGB images and polarimetric images, followed by manually labeled ground truth. The use of polarization cameras is a sensory augmentation that can significantly enhance the capabilities of image understanding, for the detection of highly reflective areas such as glasses and water. Experimental results suggest that our proposed multimodal fusion network outperforms unimodal networks and two typical fusion architectures.
Download

Paper Nr: 96
Title:

Egocentric Point of Interest Recognition in Cultural Sites

Authors:

Francesco Ragusa, Antonino Furnari, Sebastiano Battiato, Giovanni Signorello and Giovanni M. Farinella

Abstract: We consider the problem of the detection and recognition of points of interest in cultural sites. We observe that a “point of interest” in a cultural site may be either an object or an environment and highlight that the use of an object detector is beneficial to recognize points of interest which occupy a small part of the frame. To study the role of objects in the recognition of points of interest, we augment the labelling of the UNICT-VEDI dataset to include bounding box annotations for 57 points of interest. We hence compare two approaches to perform the recognition of points of interest. The first method is based on the processing of the whole frame during recognition. The second method employs a YOLO object detector and a selection procedure to determine the currently observed point of interest. Our experiments suggest that further improvements on point of interest recognition can be achieved fusing the two methodologies. Indeed, the results show the complementarity of the two approaches on the UNICT-VEDI dataset.
Download

Paper Nr: 98
Title:

Design of Real-time Semantic Segmentation Decoder for Automated Driving

Authors:

Arindam Das, Saranya Kandan, Senthil Yogamani and Pavel Křížek

Abstract: Semantic segmentation remains a computationally intensive algorithm for embedded deployment even with the rapid growth of computation power. Thus efficient network design is a critical aspect especially for applications like automated driving which requires real-time performance. Recently, there has been a lot of research on designing efficient encoders that are mostly task agnostic. Unlike image classification and bounding box object detection tasks, decoders are computationally expensive as well for semantic segmentation task. In this work, we focus on efficient design of the segmentation decoder and assume that an efficient encoder is already designed to provide shared features for a multi-task learning system. We design a novel efficient non-bottleneck layer and a family of decoders which fit into a small run-time budget using VGG10 as efficient encoder. We demonstrate in our dataset that experimentation with various design choices led to an improvement of 10% from a baseline performance.
Download

Paper Nr: 110
Title:

Human Action Recognition using Multi-Kernel Learning for Temporal Residual Network

Authors:

Saima Nazir, Yu Qian, Muhammad H. Yousaf, Sergio A. Velastin, Ebroul Izquierdo and Eduard Vazquez

Abstract: Deep learning has led to a series of breakthrough in the human action recognition field. Given the powerful representational ability of residual networks (ResNet), performance in many computer vision tasks including human action recognition has improved. Motivated by the success of ResNet, we use the residual network and its variations to obtain feature representation. Bearing in mind the importance of appearance and motion information for action representation, our network utilizes both for feature extraction. Appearance and motion features are further fused for action classification using a multi-kernel support vector machine (SVM). We also investigate the fusion of dense trajectories with the proposed network to boost up the network performance. We evaluate our proposed methods on a benchmark dataset (HMDB-51) and results shows the multi-kernel learning shows the better performance than the fusion of classification score from deep network SoftMax layer. Our proposed method also shows good performance as compared to the recent state-of-the-art methods.
Download

Paper Nr: 116
Title:

Micro-expression Recognition Under Low-resolution Cases

Authors:

Guifeng Li, Jingang Shi, Jinye Peng and Guoying Zhao

Abstract: Micro-expression is an essential non-verbal behavior that can faithfully express the human’s hidden emotions. It has a wide range of applications in the national security and computer aided diagnosis, which encourages us to conduct the research of automatic micro-expression recognition. However, the images captured from surveillance video easily suffer from the low-quality problem, which causes the difficulty in real applications. Due to the low quality of captured images, the existing algorithms are not able to perform as well as expected. For addressing this problem, we conduct a comprehensive study about the micro-expression recognition problem under low-resolution cases with face hallucination method. The experimental results show that the proposed framework obtains promising results on micro-expression recognition under low-resolution cases.
Download

Paper Nr: 125
Title:

Hard Negative Mining from in-Vehicle Camera Images based on Multiple Observations of Background Patterns

Authors:

Masashi Hontani, Haruya Kyutoku, David Wong, Daisuke Deguchi, Yasutomo Kawanishi, Ichiro Ide and Hiroshi Murase

Abstract: In recent years, the demand for highly accurate pedestrian detectors has increased due to the development of advanced driving support systems. For the training of an accurate pedestrian detector, it is important to collect a large number of training samples. To support this, this paper proposes a “hard negative” mining method to automatically extract background images which tend to be erroneously detected as pedestrians. Negative samples are selected based on the assumption that frequent patterns observed multiple times in the same location are most likely parts of the background scene. As a result of an evaluation using in-vehicle camera images captured along the same route, we confirmed that the proposed method can automatically collect false positive samples accurately. We also confirmed that a highly accurate detector can be constructed using the additional negative samples.
Download

Paper Nr: 144
Title:

Plant Diseases Recognition from Digital Images using Multichannel Convolutional Neural Networks

Authors:

Andre S. Abade, Ana G. S. de Almeida and Flavio B. Vidal

Abstract: Plant diseases are considered one of the main factors influencing food production and to minimize losses in production, it is essential that crop diseases have a fast detection and recognition. Nowadays, recent studies use deep learning techniques to diagnose plant diseases in an attempt to solve the main problem: a fast, low-cost and efficient methodology to diagnose plant diseases. In this work, we propose the use of classical convolutional neural network (CNN) models trained from scratch and a Multichannel CNN (M-CNN) approach to train and evaluate the PlantVillage dataset, containing several plant diseases and more than 54,000 images (divided into 38 diseases classes with 14 plant species). In both proposed approaches, our results achieved better accuracies than the state-of-the-art, with faster convergence and without the use of transfer learning techniques. Our multichannel approach also demonstrates that the three versions of the dataset (colored, grayscaled and segmented) can contribute to improve accuracy, adding relevant information to the proposed artificial neural network.
Download

Paper Nr: 151
Title:

A Comparison of Embedded Deep Learning Methods for Person Detection

Authors:

Chloe E. Kim, Mahdi D. Oghaz, Jiri Fajtl, Vasileios Argyriou and Paolo Remagnino

Abstract: Recent advancements in parallel computing, GPU technology and deep learning provide a new platform for complex image processing tasks such as person detection to flourish. Person detection is fundamental preliminary operation for several high level computer vision tasks. One industry that can significantly benefit from person detection is retail. In recent years, various studies attempt to find an optimal solution for person detection using neural networks and deep learning. This study conducts a comparison among the state of the art deep learning base object detector with the focus on person detection performance in indoor environments. Performance of various implementations of YOLO, SSD, RCNN, R-FCN and SqueezeDet have been assessed using our in-house proprietary dataset which consists of over 10 thousands indoor images captured form shopping malls, retails and stores. Experimental results indicate that, Tiny YOLO-416 and SSD (VGG-300) are the fastest and Faster-RCNN (Inception ResNet-v2) and R-FCN (ResNet-101) are the most accurate detectors investigated in this study. Further analysis shows that YOLO v3-416 delivers relatively accurate result in a reasonable amount of time, which makes it an ideal model for person detection in embedded platforms.
Download

Paper Nr: 154
Title:

Context-aware Training Image Synthesis for Traffic Sign Recognition

Authors:

Akira Sekizawa and Katsuto Nakajima

Abstract: In this paper, we propose a method for training traffic sign detectors without using actual images of the traffic signs. The method involves using training images of road scenes that were synthetically generated to train a deep-learning based end-to-end traffic sign detector (which includes detection and classification). Conventional methods for generating training data mostly focus only on producing small images of the traffic sign alone and cannot be used for generating images for training end-to-end traffic sign detectors, which use images of the overall scenes as the training data. In this paper, we propose a method for synthetically generating road scenes to use as the training data for end-to-end traffic sign detectors. We also show that considering the context information of the surroundings of the traffic signs when generating scenes is effective for improving the precision.
Download

Paper Nr: 155
Title:

Improved Person Detection on Omnidirectional Images with Non-maxima Supression

Authors:

Roman Seidel, André Apitzsch and Gangolf Hirtz

Abstract: We propose a person detector on omnidirectional images, an accurate method to generate minimal enclosing rectangles of persons. The basic idea is to adapt the qualitative detection performance of a convolutional neural network based method, namely YOLOv2 to fish-eye images. The design of our approach picks up the idea of a state-of-the-art object detector and highly overlapping areas of images with their regions of interests. This overlap reduces the number of false negatives. Based on the raw bounding boxes of the detector we fine-tuned overlapping bounding boxes by three approaches: the non-maximum suppression, the soft non-maximum suppression and the soft non-maximum suppression with Gaussian smoothing. The evaluation was done on the PIROPO database and an own annotated Flat dataset, supplemented with bounding boxes on omnidirectional images. We achieve an average precision of 64.4 % with YOLOv2 for the class person on PIROPO and 77.6 % on Flat. For this purpose we fine-tuned the soft non-maximum suppression with Gaussian smoothing.
Download

Paper Nr: 171
Title:

Recognising Actions for Instructional Training using Pose Information: A Comparative Evaluation

Authors:

Seán Bruton and Gerard Lacey

Abstract: Humans perform many complex tasks involving the manipulation of multiple objects. Recognition of the constituent actions of these tasks can be used to drive instructional training systems. The identities and poses of the objects used during such tasks are salient for the purposes of recognition. In this work, 3D object detection and registration techniques are used to identify and track objects involved in an everyday task of preparing a cup of tea. The pose information serves as input to an action classification system that uses Long-Short Term Memory (LSTM) recurrent neural networks as part of a deep architecture. An advantage of this approach is that it can represent the complex dynamics of object and human poses at hierarchical levels without the need for design of specific spatio-temporal features. By using such compact features, we demonstrate the feasibility of using the hyperparameter optimisation technique of Tree-Parzen Estimators to identify optimal hyperparameters as well as network architectures. The results of 83% recognition show that this approach is viable for similar scenarios of pervasive computing applications where prior scene knowledge exists.
Download

Paper Nr: 178
Title:

Causal Inference in Nonverbal Dyadic Communication with Relevant Interval Selection and Granger Causality

Authors:

Lea Müller, Maha Shadaydeh, Martin Thümmel, Thomas Kessler, Dana Schneider and Joachim Denzler

Abstract: Human nonverbal emotional communication in dyadic dialogs is a process of mutual influence and adaptation. Identifying the direction of influence, or cause-effect relation between participants, is a challenging task due to two main obstacles. First, distinct emotions might not be clearly visible. Second, participants cause-effect relation is transient and variant over time. In this paper, we address these difficulties by using facial expressions that can be present even when strong distinct facial emotions are not visible. We also propose to apply a relevant interval selection approach prior to causal inference to identify those transient intervals where adaptation process occurs. To identify the direction of influence, we apply the concept of Granger causality to the time series of facial expressions on the set of relevant intervals. We tested our approach on synthetic data and then applied it to newly, experimentally obtained data. Here, we were able to show that a more sensitive facial expression detection algorithm and a relevant interval detection approach is most promising to reveal the cause-effect pattern for dyadic communication in various instructed interaction conditions.
Download

Paper Nr: 180
Title:

Detection and Certification of Faint Streaks in Astronomical Images

Authors:

Vojtěch Cvrček and Radim Šára

Abstract: Fast-moving celestial objects, like near-Earth objects (NEOs), orbiting space debris, or meteors, appear as streaks superimposed over the star background in images taken by an optical telescope at long exposures. As the apparent magnitude of the object increases (the object becomes fainter), its detection becomes progressively harder. We discuss a statistical procedure that makes a binary decision on the presence/absence of a streak in the image which is called streak certification. The certification is based purely on a single input image and a public star catalog, using a minimalistic statistical model. Certification accuracy greater than 90% for streaks of arbitrary orientation, longer than 500 pixels, and the signal-to-background log-ratio is better than −10dB is achieved on the same dataset as in an earlier similar method, whose performance is thus exceeded, especially for close-to-horizontal streaks. We also show that the certification decision indicates detection failure well.
Download

Paper Nr: 189
Title:

Discriminant Patch Representation for RGB-D Face Recognition using Convolutional Neural Networks

Authors:

Nesrine Grati, Achraf Ben-Hamadou and Mohamed Hammami

Abstract: This paper focuses on designing data-driven models to learn a discriminant representation space for face recognition using RGB-D data. Unlike hand-crafted representations, learned models can extract and organize the discriminant information from the data, and can automatically adapt to build new compute vision applications faster. We proposed an effective way to train Convolutional Neural Networks to learn face patch discriminant features. The proposed solution was tested and validated on state-of-the-art RGB-D datasets and showed competitive and promising results relatively to standard hand-crafted feature extractors.
Download

Paper Nr: 197
Title:

Lane Detection and Scene Interpretation by Particle Filter in Airport Areas

Authors:

Claire Meymandi-Nejad, Salwa El Kaddaoui, Michel Devy and Ariane Herbulot

Abstract: Lane detection has been widely studied in the literature. However, it is most of the time applied to the automotive field, either for Advanced Driver-Assistance Systems (ADAS) or autonomous driving. Few works concern aeronautics, i.e. pilot assistance for taxiway navigation in airports. Now aircraft manufacturers are interested by new functionalities proposed to pilots in future cockpits, or even for autonomous navigation of aircrafts in airports. In this paper, we propose a scene interpretation module using the detection of lines and beacons from images acquired from the camera mounted in the vertical fin. Lane detection is based on particle filtering and polygonal approximation, performed on a top view computed from a transformation of the original image. For now, this algorithm is tested on simulated images created by a product of the OKTAL-SE company.
Download

Paper Nr: 203
Title:

Street-view Change Detection via Siamese Encoder-decoder Structured Convolutional Neural Networks

Authors:

Xinwei Zhao, Haichang Li, Rui Wang, Changwen Zheng and Song Shi

Abstract: In this paper, we propose a siamese encoder-decoder structured network for street scene change detection. The encoder-decoder structures have been successfully applied for semantic segmentation. Our work is inspired by the similarity between change detection and semantic segmentation, and the success of siamese network in comparing image patches. Our method is able to precisely detect changes of street scene at the presence of irrelevant visual differences caused by different shooting conditions and weather. Moreover, the encoder and decoder parts are decoupled. Various combinations of different encoders and decoders are evaluated in this paper. Experiments on two street scene datasets, TSUNAMI and GSV, demonstrate that our method outperforms previous ones by a large margin.
Download

Paper Nr: 209
Title:

Virtual Flattening of a Clothing Surface by Integrating Geodesic Distances from Different Three-dimensional Views

Authors:

Yasuyo Kita and Nobuyuki Kita

Abstract: We propose a method of virtually flattening a largely deformed surface using three-dimensional images taken from different directions. In a previous paper (Kita and Kita, 2016), we proposed a method of virtually fattening a surface from a 3D depth image according to the calculation of geodesic lines, which are the shortest paths between two points on an arbitrary curved surface. Although the work showed the promise of the proposed approach, only gently curved surfaces can be flattened owing to the limit of the observation being made from one direction. To apply the method to a wider range of surfaces, including sharply curved surfaces, we extended the method to three-dimensional depth images taken from different directions integratively. This was done by combining equations obtained from each observation through the surface points observed commonly in different observations and by solving all the equations simultaneously. Experiments using actual clothing items demonstrated the effect of the integration.
Download

Paper Nr: 223
Title:

Bi-Directional Attention Flow for Video Alignment

Authors:

Reham Abobeah, Marwan Torki, Amin Shoukry and Jiro Katto

Abstract: In this paper, a novel technique is introduced to address the video alignment task which is one of the hot topics in computer vision. Specifically, we aim at finding the best possible correspondences between two overlapping videos without the restrictions imposed by previous techniques. The novelty of this work is that the video alignment problem is solved by drawing an analogy between it and the machine comprehension (MC) task in natural language processing (NLP). Simply, MC seeks to give the best answer to a question about a given paragraph. In our work, one of the two videos is considered as a query, while the other as a context. First, a pre-trained CNN is used to obtain high-level features from the frames of both the query and context videos. Then, the bidirectional attention flow mechanism; that has achieved considerable success in MC; is used to compute the query-context interactions in order to find the best mapping between the two input videos. The proposed model has been trained using 10k of collected video pairs from ”YouTube”. The initial experimental results show that it is a promising solution for the video alignment task when compared to the state of the art techniques.
Download

Paper Nr: 235
Title:

Unsupervised Learning of Scene Categories on the Lunar Surface

Authors:

Thorsten Wilhelm, Rene Grzeszick, Gernot A. Fink and Christian Wöhler

Abstract: Learning scene categories is a challenging task due to the high diversity of images. State-of-the-art methods are typically trained in a fully supervised manner, requiring manual labeling effort. In some cases, however, these manual labels are not available. In this work, an example of completely unlabeled scene images, where labels are hardly obtainable, is presented: orbital images of the lunar surface. A novel method that exploits feature representations derived from a CNN trained on a different data source is presented. These features are adapted to the lunar surface in an unsupervised manner, allowing for learning scene categories and detecting regions of interest. The experiments show that meaningful representatives and scene categories can be derived in a fully unsupervised fashion.
Download

Paper Nr: 237
Title:

Robust Fitting of Geometric Primitives on LiDAR Data

Authors:

Tekla Tóth and Levente Hajder

Abstract: This paper deals with robust surface fitting on spatial points measured by a LiDAR device. The point clouds contain hundreds of thousands data points. Therefore, the time demand of the algorithms is crucial for fast operation. We present two novel algorithms based on the RANSAC method: one for plane detection and one for other object detection. The execution time of the novel algorithms is significantly lower as only one random sampling is required because a deterministic teqnique selects the other data points. The accuracy of the novel methods are validated on synthesized data as well as real indoor and outdoor measurements.
Download

Paper Nr: 238
Title:

How Low Can You Go? Privacy-preserving People Detection with an Omni-directional Camera

Authors:

Timothy Callemein, Kristof Van Beeck and Toon Goedemé

Abstract: In this work, we use a ceiling-mounted omni-directional camera to detect people in a room. This can be used as a sensor to measure the occupancy of meeting rooms and count the amount of flex-desk working spaces available. If these devices can be integrated in an embedded low-power sensor, it would form an ideal extension of automated room reservation systems in office environments. The main challenge we target here is ensuring the privacy of the people filmed. The approach we propose is going to extremely low image resolutions, such that it is impossible to recognise people or read potentially confidential documents. Therefore, we retrained a single-shot low-resolution person detection network with automatically generated ground truth. In this paper, we prove the functionality of this approach and explore how low we can go in resolution, to determine the optimal trade-off between recognition accuracy and privacy preservation. Because of the low resolution, the result is a lightweight network that can potentially be deployed on embedded hardware. Such embedded implementation enables the development of a decentralised smart camera which only outputs the required meta-data (i.e. the number of persons in the meeting room).
Download

Paper Nr: 243
Title:

Comparison of Sparse Image Descriptors for Eyes Detection in Thermal Images

Authors:

Mateusz Knapik and Bogusław Cyganek

Abstract: Eye detection and localization are basic steps in many computer systems aimed at human fatigue monitoring. In this paper we evaluate performance of two sparse image descriptors for eye detection in the long-range IR spectrum. In the training phase, sparse descriptors of the training images are computed and used to create features vocabulary. Final detections are done using bag-of-words approach and additional heuristic for geometric constraints. Several thermal video sequences were recorded to allow for quantitive analysis of this approach. Experimental results show that our method achieves high accuracy in real conditions.
Download

Paper Nr: 257
Title:

AuxNet: Auxiliary Tasks Enhanced Semantic Segmentation for Automated Driving

Authors:

Sumanth Chennupati, Ganesh Sistu, Senthil Yogamani and Samir Rawashdeh

Abstract: Decision making in automated driving is highly specific to the environment and thus semantic segmentation plays a key role in recognizing the objects in the environment around the car. Pixel level classification once considered a challenging task which is now becoming mature to be productized in a car. However, semantic annotation is time consuming and quite expensive. Synthetic datasets with domain adaptation techniques have been used to alleviate the lack of large annotated datasets. In this work, we explore an alternate approach of leveraging the annotations of other tasks to improve semantic segmentation. Recently, multi-task learning became a popular paradigm in automated driving which demonstrates joint learning of multiple tasks improves overall performance of each tasks. Motivated by this, we use auxiliary tasks like depth estimation to improve the performance of semantic segmentation task. We propose adaptive task loss weighting techniques to address scale issues in multi-task loss functions which become more crucial in auxiliary tasks. We experimented on automotive datasets including SYNTHIA and KITTI and obtained 3% and 5% improvement in accuracy respectively.
Download

Paper Nr: 261
Title:

Challenges in Designing Datasets and Validation for Autonomous Driving

Authors:

Michal Uřičář, David Hurych, Pavel Křížek and Senthil Yogamani

Abstract: Autonomous driving is getting a lot of attention in the last decade and will be the hot topic at least until the first successful certification of a car with Level 5 autonomy (International, 2017). There are many public datasets in the academic community. However, they are far away from what a robust industrial production system needs. There is a large gap between academic and industrial setting and a substantial way from a research prototype, built on public datasets, to a deployable solution which is a challenging task. In this paper, we focus on bad practices that often happen in the autonomous driving from an industrial deployment perspective. Data design deserves at least the same amount of attention as the model design. There is very little attention paid to these issues in the scientific community, and we hope this paper encourages better formalization of dataset design. More specifically, we focus on the datasets design and validation scheme for autonomous driving, where we would like to highlight the common problems, wrong assumptions, and steps towards avoiding them, as well as some open problems.
Download

Paper Nr: 262
Title:

Optimal Score Fusion via a Shallow Neural Network to Improve the Performance of Classical Open Source Face Detectors

Authors:

Moumen T. El-Melegy, Hesham M. Haridi, Samia A. Ali and Mostafa A. Abdelrahman

Abstract: Face detection exemplifies an essential stage in most of the applications that are interested in visual understanding of human faces. Recently, face detection witnesses a huge improvement in performance as a result of dependence on convolution neural networks. On the other hand, classical face detectors in many renowned open source libraries for computer vision like OpenCV and Dlib may suffer in performance, yet they are still used in many industrial applications. In this paper, we try to boost the performance of these classical detectors and suggest a fusion method to combine the face detectors in OpenCV and Dlib libraries. The OpenCV face detector using the frontal and profile models as well as the Dlib HOG-based face detector are run in parallel on the image of interest, followed by a skin detector that is used to detect skin regions on the detected faces. To figure out the aggregation method for these detectors in an optimal way, we employ a shallow neural network. Our approach is implemented and tested on the popular FDDB and WIDER face datasets, and it shows an improvement in the performance compared to the classical open source face detectors.
Download

Area 4 - Applications and Services

Full Papers
Paper Nr: 149
Title:

Recovering and Visualizing Deformation in 3D Aegean Sealings

Authors:

Bartosz Bogacz, Nikolas Papadimitriou, Diamantis Panagiotopoulos and Hubert Mara

Abstract: Archaeological research into Aegean sealings and sigils reveals valuable insights into the Aegean socio-political organization and administration. An important question arising is the determination of authorship and origin of seals. The similarity of sealings is a key factor as it can indicate different seals with the same depiction or the same seal imprinted by different persons. Analyses of authorship and workmanship require the comparison of shared patterns and detection of differences between these artifacts. These are typically performed qualitatively by manually discovering and observing shared visual traits. In our work, we quantify and highlight visual differences, by exposing and directly matching shared features. Further, we visualize and measure the deformation of shape necessary to match sigils. The sealings used in our dataset are 3D structured light scans of plasticine and latex molds of originals. We compute four different feature descriptors on the projected surfaces and its curvature. Then, these features are matched with a rigid RANSAC estimation before a non-rigid thin-plate spline (TPS) matching is performed to fine-tune the deformation. We evaluate our approach by synthesizing artificial deformations on real world data and measuring the distance to the re-constructed deformation.
Download

Paper Nr: 161
Title:

NoisyArt: A Dataset for Webly-supervised Artwork Recognition

Authors:

R. Del Chiaro, A Bagdanov and A. Del Bimbo

Abstract: This paper describes the NoisyArt dataset, a dataset designed to support research on webly-supervised recognition of artworks. The dataset consists of more than 90,000 images and in more than 3,000 webly-supervised classes, and a subset of 200 classes with verified test images. Candidate artworks are identified using publicly available metadata repositories, and images are automatically acquired using Google Image and Flickr search. Document embeddings are also provided for short descriptions of all artworks. NoisyArt is designed to support research on webly-supervised artwork instance recognition, zero-shot learning, and other approaches to visual recognition of cultural heritage objects. Baseline experimental results are given using pretrained Convolutional Neural Network (CNN) features and a shallow classifier architecture. Experiments are also performed using a variety of techniques for identifying and mitigating label noise in webly-supervised training data.
Download

Paper Nr: 249
Title:

Active Object Search with a Mobile Device for People with Visual Impairments

Authors:

Jacobus C. Lock, Grzegorz Cielniak and Nicola Bellotto

Abstract: Modern smartphones can provide a multitude of services to assist people with visual impairments, and their cameras in particular can be useful for assisting with tasks, such as reading signs or searching for objects in unknown environments. Previous research has looked at ways to solve these problems by processing the camera’s video feed, but very little work has been done in actively guiding the user towards specific points of interest, maximising the effectiveness of the underlying visual algorithms. In this paper, we propose a control algorithm based on a Markov Decision Process that uses a smartphone’s camera to generate real-time instructions to guide a user towards a target object. The solution is part of a more general active vision application for people with visual impairments. An initial implementation of the system on a smartphone was experimentally evaluated with participants with healthy eyesight to determine the performance of the control algorithm. The results show the effectiveness of our solution and its potential application to help people with visual impairments find objects in unknown environments.
Download

Short Papers
Paper Nr: 121
Title:

An Indoor Sign Dataset (ISD): An Overview and Baseline Evaluation

Authors:

João R. Almeida, Franklin C. Flores, Max N. Roecker, Marco K. Braga and Yandre G. Costa

Abstract: Visually impaired people need help from others when they need to find specific destinations and cannot guide themselves in indoor environments using signs. Computer Vision Systems can help them with this kind of tasks. In this paper, we present to the research community an Indoor Sign Dataset (ISD), a novel dataset composed of 1,200 samples of indoor signs images labeled into one of the following classes: accessibility, emergency exit, men’s toilets, women’s toilets, wifi and no smoking. The ISD dataset consists of images in different environments conditions, perspectives, and appearance that turns the recognition task quite challenging. A data augmentation technique was applied, generating 69,120 images. We also present baseline results obtained using handcrafted features, like LBP, Color Histogram, HOG, and DAISY applied on SVM, k-NN, and MLP classifiers. We further make non-handcrafted features learned using convolutional neural networks (CNN). The best result was obtained using a CNN model, with an accuracy of 90.33%. This dataset and techniques can be applied to design a wearable device able to help visually impaired people.
Download

Paper Nr: 133
Title:

Humans Vs. Algorithms: Assessment of Security Risks Posed by Facial Morphing to Identity Verification at Border Control

Authors:

Andrey Makrushin, Tom Neubert and Jana Dittmann

Abstract: Facial morphing, if applied to a biometric portrait intended for an identity document application, compromises further identity verification by means of the issued document. An electronic machine readable travel document is a prime target of a face morphing attack because a successful attack allows a wanted criminal for illicit border crossing. The open question is whether human examiners and algorithms can be fooled only by professionally created manual morphs or even by automatically generated morphs with evident visual artifacts. In this paper, we introduce a border control simulation to examine the ability of humans in recognizing morphed passport photographs as well as in mismatching morphed passport photographs against "live" faces of travelers. The error rates of humans are compared with those of algorithms to emphasize the necessity for computer-aided support of border guards.
Download

Paper Nr: 139
Title:

Depth from Small Motion using Rank-1 Initialization

Authors:

Peter O. Fasogbon

Abstract: Depth from Small Motion (DfSM) (Ha et al., 2016) is particularly interesting for commercial handheld devices because it allows the possibility to get depth information with minimal user effort and cooperation. Due to speed and memory issue on these devices, the self calibration optimization of the method using Bundle Adjustment (BA) need as little as 10-15 images. Therefore, the optimization tends to take many iterations to converge or may not converge at all in some cases. This work propose a robust initialization for the bundle adjustment using the rank-1 factorization method (Tomasi and Kanade, 1992), (Aguiar and Moura, 1999a). We create a constraint matrix that is rank-1 in a noiseless situation, then use SVD to compute the inverse depth values and the camera motion. We only need about quarter fraction of the bundle adjustment iteration to converge. We also propose grided feature extraction technique so that only important and small features are tracked all over the image frames. This also ensure speedup in the full execution time on the mobile device. For the experiments, we have documented the execution time with the proposed Rank-1 initialization on two mobile device platforms using optimized accelerations with CPU-GPU co-processing. The combination of Rank 1-BA generates more robust depth-map and is significantly faster than using BA alone.
Download

Paper Nr: 176
Title:

A 3D Lung Nodule Candidate Detection by Grouping DCNN 2D Candidates

Authors:

Fernando R. Pereira, David Menotti and Lucas Ferrari de Oliveira

Abstract: Lung cancer has attracted the attention of scientific communities as being the main causes of morbidity and mortality worldwide. Computed Tomography (CT) scan is highly indicated to detect patterns such as lung nodules, where their correct detection and accurate classification is paramount for clinical decision-making. In this paper, we propose a two-step method for lung nodule candidate detection using a Deep Convolutional Neural Network (DCNN), more specifically the Single Shot MultiBox Detector, for candidate detection in 2D images/slices, and then a fusion technique to group the inter-slice adjacent detected candidates. The DCNN system was trained and validated with data from Lung Image Database Consortium and Image Database Resource Initiative, we also use LUng Nodule Analysis 2016 challenge data and metrics to evaluate the system. We had as result sensitivity of 96.7% and an average of 77.4 False Positives (FPs) per scan (an entire set of CT images/slices for a patient). The sensitivity result is ranking two in the state of art (rank one is 97.1%) but with FPs/scan rate which is almost three times smaller than the first one (219.1).
Download

Paper Nr: 181
Title:

Unsupervised Method based on Probabilistic Neural Network for the Segmentation of Corpus Callosum in MRI Scans

Authors:

Amal Jlassi, Khaoula ElBedoui, Walid Barhoumi and Chokri Maktouf

Abstract: In this paper, we introduce an unsupervised method for the segmentation of the Corpus Callosum (CC) from Magnetic Resonance Imaging (MRI) scans. In fact, in order to extract the CC from sagittal scans in brain MRI, we adopted the Probabilistic Neural Network (PNN) as a clustering technique. Then, we used k-means to obtain the target classes. After that, we introduced a cluster validity measure based on the maximum entropy principle (Vmep), which aims to define dynamically the optimal number of classes. The later criterion was applied in the hidden layer output of the PNN, while varying the number of classes. Finally, we isolated the CC using a spatial-based process. We validated the performance of the proposed method on two challenging datasets using objective metrics (accuracy, sensitivity, Dice coefficient, specificity and Jaccard similarity), and the obtained results proved the superiority of this method against relevant methods from the state of the art.
Download

Paper Nr: 193
Title:

Predicting the Early Stages of the Alzheimer’s Disease via Combined Brain Multi-projections and Small Datasets

Authors:

Kauê N. Duarte, Pedro V. V. de Paiva, Paulo S. Martins and Marco G. Carvalho

Abstract: Alzheimer is a neurodegenerative disease that usually affects the elderly. It compromises a patient’s memory, his/her cognition, and perception of the environment. Alzheimer’s Disease detection in its initial stage, known as Mild Cognitive Impairment, attracts special efforts from experts due to the possibility of using drugs to delay the progression of the disease. This paper aims to provide a method for the detection of this impairment condition via the classification of brain images using Transfer Learning - Deep Features and Support Vector Machine. The small number of images used in this work justifies the application of Transfer Learning, which employs weights from VGG19 initial layers used for ImageNet classification as deep features extractor, and then applies Support Vector Machines. Majority Voting, False-Positive Priori, and Super Learner were applied to combine previous classifiers predictions. The final step was a detection to assign a label to the previous voting outcomes, determining the presence or absence of an Alzheimers pre-condition. The OASIS-1 database was used with a total of 196 images (axial, coronal, and sagittal). Our method showed a promising performance in terms of accuracy, recall and specificity.
Download

Paper Nr: 227
Title:

Generation of 3D Building Models from City Area Maps

Authors:

Roman Martel, Chaoqun Dong, Kan Chen, Henry Johan and Marius Erdt

Abstract: In this paper, we propose a pipeline that converts buildings described in city area maps to 3D models in the CityGML LOD1 standard. The input documents are scanned city area maps provided by a city authority. The city area maps were recorded and stored over a long time period. This imposes several challenges to the pipeline such as different font styles of typewriters, handwritings of different persons, varying layout, low contrast, damages and scanning artifacts. The novel and distinguishing aspect of our approach is its ability to deal with these challenges. In the pipeline we, firstly, identify and analyse text boxes within the city area maps to extract information like height and location of its described buildings. Secondly, we extract the building shapes based on these locations from an online city map API. Lastly, using the extracted building shapes and heights, we generate 3D models of the buildings.
Download

Paper Nr: 229
Title:

Authentication of Medicine Blister Foils: Characterization of the Rotogravure Printing Process

Authors:

Iuliia Tkachenko, Alain Trémeau and Thierry Fournel

Abstract: Nowadays the number of medicine packaging counterfeits increases very quickly. The rotogravure printing technique is worldwide used for medicine blister foils production. However, the existing anti-counterfeiting solutions do not take into account this printing process. Additionally, it is not easy to apply conventional solutions while using blister foils instead of uncoated/coated paper. In this paper, we study some features of the rotogravure printing and identify the future paths to fight the increasing number of counterfeited medicine products. We present the result of a preliminary study of such a process, extended to foils, and discuss some promising solutions for blister foils authentication.
Download

Paper Nr: 29
Title:

A Smartphone Tool for Evaluating Cardiopulmonary Resuscitation (CPR) Delivery

Authors:

Gavin Corkery and Kenneth Dawson-Howe

Abstract: This paper presents a prototype smartphone application to aid with the delivery of cardiopulmonary resuscitation (CPR). The person giving CPR is viewed from the side and both compressions and breaths are identified primarily using optical flow. This allows the system to provide near real time feedback on the chest compression rate (CCR) and on the timing of breaths (which affects the Chest Compression Fraction (CCF)). The system is evaluated on over 25 minutes of video of 6 different participants delivering CPR to a test dummy. A quantitative evaluation is presented which shows that the system recognised 99% of compressions and all of the breaths (although two false positive breaths were classified). It computed the CCF to within 1%.
Download

Paper Nr: 35
Title:

Motion Evaluation of Therapy Exercises by Means of Skeleton Normalisation, Incremental Dynamic Time Warping and Machine Learning: A Comparison of a Rule-Based and a Machine-Learning-Based Approach

Authors:

Julia Richter, Christian Wiede, Ulrich Heinkel and Gangolf Hirtz

Abstract: The assessment of motions by means of technical assistance systems is attracting widespread interest in fields such as competitive sports, fitness and rehabilitation. Current research has achieved to generate feedback that concerns quantity or the grade of similarity with regard to correct reference motions. In view of post-operative rehabilitation exercises, such type of feedback is regarded as insufficient. That is why recent research aims at providing a qualitative feedback by communicating motion errors. While existing systems investigated the use of manually defined rules to detect motion errors, we suggest to employ machine learning techniques in combination with dynamic time warping and to train classifiers with sample exercise executions represented by 3-D skeletons joint trajectories. This study describes both a rule-based and a machine-learning-based approach and compares them with regard to their accuracy. In the second place, this study seeks to investigate the effect of using normalised hierarchical coordinates on the classification accuracy if data of different persons is used for the machine-learning-based approach. The results reveal that the performance of the machine-learning-based method compares well with the rule-based concept. Another outcome to emerge from this study is that normalised hierarchical coordinates allow to use data of different persons.
Download

Paper Nr: 153
Title:

Constrained Multi Camera Calibration for Lane Merge Observation

Authors:

Kai Cordes and Hellward Broszio

Abstract: For the trajectory planning in autonomous driving, the accurate localization of the vehicles is required. Accurate localizations of the ego-vehicle will be provided by the next generation of connected cars using 5G. Until all cars participate in the network, un-connected cars have to be considered as well. These cars are localized via static cameras positioned next to the road. To achieve high accuracy in the vehicle localization, the highly accurate calibration of the cameras is required. Accurately measured landmarks as well as a priori knowledge about the camera configuration are used to develop the proposed constrained multi camera calibration technique. The reprojection error for all cameras is minimized using a differential evolution (DE) optimization strategy. Evaluations on data recorded on a test track show that the proposed calibration technique provides adequate calibration accuracy while the accuracies of reference implementations are insufficient.
Download

Paper Nr: 199
Title:

Smartphone Teleoperation for Self-balancing Telepresence Robots

Authors:

Antti E. Ainasoja, Said Pertuz and Joni-Kristian Kämäräinen

Abstract: Self-balancing mobile platforms have recently been adopted in many applications thanks to their light-weight and slim build. However, inherent instability in their behaviour makes both manual and autonomous operation more challenging as compared to traditional self-standing platforms. In this work, we experimentally evaluate three teleoperation user interface approaches to remotely control a self-balancing telepresence platform: 1) touchscreen button user interface, 2) tilt user interface and 3) hybrid touchscreen-tilt user interface. We provide evaluation in quantitative terms based on user trajectories and recorded control data, and qualitative findings from user surveys. Both quantitative and qualitative results support our finding that the hybrid user interface (a speed slider with tilt turn) is a suitable approach for smartphone-based teleoperation of self-balancing telepresence robots. We also introduce a client-server based multi-user telepresence architecture using open source tools.
Download

Paper Nr: 244
Title:

Vision Substitution with Object Detection and Vibrotactile Stimulus

Authors:

Ricardo Ribani and Mauricio Marengoni

Abstract: The present work proposes the creation of a system that implements sensory substitution of vision through a wearable item with vibration motors positioned on the back of the user. In addition to the developed hardware, the proposal consists in the construction of a system that uses deep learning techniques to detect and classify objects in controlled environments. The hardware comprise of a simple HD camera, a pair of Arduinos, 9 cylindrical DC motors and a Raspberry Pi (responsible for the image processing and to translate the signal to the Arduinos). In the first trial of image classification and localization, the ResNet-50 model pre-trained with the ImageNet database was tried. Then we implemented a Single Shot Detector with a MobileNetV2 to perform real-time detection on the Raspberry Pi, sending the detected object class and location as defined patterns to the motors.
Download

Paper Nr: 251
Title:

A Hybrid Method for Remote Eye Tracking using RGB-IR Camera

Authors:

Kenta Yamagishi and Kentaro Takemura

Abstract: Methods for eye tracking using images can be divided largely into two categories: Methods using a near-infrared image and methods using a visible image. These images have been used independently in conventional eye-tracking methods; however, each category of methods have different advantageous features. Therefore, we propose using these images simultaneously to compensate for the weak points in each technique, and an RGB-IR camera, which can capture visible and near-infrared images simultaneously, is employed. Pupil detection can yield better results than iris detection because the eyelid often occludes the iris. On the other hand, the iris area can be used for model fitting because the iris size is constant. The model fitting can be automated at initialization; thus, the relationship between the 3D eyeball model and eye camera is solved. Additionally, the positions of the eye and gaze vectors are estimated continuously using these images for tracking. We conducted several experiments for evaluating the proposed method and confirmed its feasibility.
Download

Paper Nr: 258
Title:

A Robust Blind Video Watermarking Scheme based on Discrete Wavelet Transform and Singular Value Decomposition

Authors:

Amal Hammami, Amal Ben Hamida and Chokri Ben Amar

Abstract: The outgrowth in technological world has massively promoted to information fraud and misappropriation by the ease of multimedia data content regeneration and modification. Consequently, security of digital media is considered among the biggest issues in multimedia services. Watermarking, consisting in hiding a signature known as watermark in a host signal, is one of the potential solutions used purposely for media security and authentication. In this paper, we propose a robust video watermarking scheme using Discrete Wavelet Transform and Singular Value Decomposition. We embed the watermark into the mid frequency sub-bands based on an additive method. The extraction process operates following a blind detection algorithm. Several attacks are applied and different performance metrics are computed to assess the robustness and the imperceptibility of the proposed watermarking. The results reveal that the proposed scheme is robust against different attacks and achieves a good level of imperceptibility.
Download

Area 5 - Motion, Tracking and Stereo Vision

Full Papers
Paper Nr: 42
Title:

Subpixel Catadioptric Modeling of High Resolution Corneal Reflections

Authors:

Chengyuan Lin and Voicu Popescu

Abstract: We present a calibration procedure that achieves a sub-pixel accurate model of the catadioptric imaging system defined by two corneal spheres and a camera. First, the eyes’ limbus circles are used to estimate the positions of the corneal spheres. Then, corresponding features in the corneal reflections are detected and used to optimize the corneal spheres’ positions with a RANSAC framework customized to the corneal catadioptric model. The framework relies on a bundle adjustment optimization that minimizes the corneal reflection reprojection error of corresponding features. In our experiments, for images with a total resolution of 5,472 × 3,648, and a limbus resolution of 600 × 600, our calibration procedure achieves an average reprojection error smaller than one pixel, over hundreds of correspondences. We demonstrate the calibration of the catadioptric system in the context of sparse, feature-based, and dense, pixel-based reconstruction of several 3D scenes from corneal reflections.
Download

Paper Nr: 86
Title:

Spatio-temporal Upsampling for Free Viewpoint Video Point Clouds

Authors:

Matthew Moynihan, Rafael Pagés and Aljosa Smolic

Abstract: This paper presents an approach to upsampling point cloud sequences captured through a wide baseline camera setup in a spatio-temporally consistent manner. The system uses edge-aware scene flow to understand the movement of 3D points across a free-viewpoint video scene to impose temporal consistency. In addition to geometric upsampling, a Hausdorff distance quality metric is used to filter noise and further improve the density of each point cloud. Results show that the system produces temporally consistent point clouds, not only reducing errors and noise but also recovering details that were lost in frame-by-frame dense point cloud reconstruction. The system has been successfully tested in sequences that have been captured via both static or handheld cameras.
Download

Paper Nr: 103
Title:

Weighted Linear Combination of Distances within Two Manifolds for 3D Human Action Recognition

Authors:

Amani Elaoud, Walid Barhoumi, Hassen Drira and Ezzeddine Zagrouba

Abstract: Human action recognition based on RGB-D sequences is an important research direction in the field of computer vision. In this work, we incorporate the skeleton on the Grassmann manifold in order to model the human action as a trajectory. Given the couple of matched points on the Grassmann manifold, we introduce the special orthogonal group SO(3) to exploit the rotation ignored by the Grassmann manifold. In fact, our objective is to define the best weighted linear combination between distances in Grassmann and SO(3) manifolds according to the nature of action, while modeling human actions by temporal trajectories and finding the best weighted combination. The effectiveness of combining the two non-Euclidean spaces was validated on three standard challenging 3D human action recognition datasets (G3D-Gaming, UTD-MHAD multimodal action and Florence3D-Action), and the preliminary results confirm the accuracy of the proposed method comparatively to relevant methods from the state of the art.
Download

Paper Nr: 115
Title:

Quantitative Affine Feature Detector Comparison based on Real-World Images Taken by a Quadcopter

Authors:

Zoltán Pusztai and Levente Hajder

Abstract: Feature detectors are frequently used in computer vision. Recently, detectors which can extract the affine transformation between the features have become popular. With affine transformations, it is possible to estimate the properties of the camera motion and the 3D scene from significantly fewer feature correspondences. This paper quantitatively compares the affine feature detectors on real-world images captured by a quadcopter. The ground truth (GT) data are calculated from the constrained motion of the cameras. Accurate and very realistic testing data are generated for both the feature locations and the corresponding affine transformations. Based on the generated GT data, many popular affine feature detectors are quantitatively compared.
Download

Paper Nr: 168
Title:

Real-time 3D Pose Estimation from Single Depth Images

Authors:

Thomas Schnürer, Stefan Fuchs, Markus Eisenbach and Horst-Michael Groß

Abstract: To allow for safe Human-Robot-Interaction in industrial scenarios like manufacturing plants, it is essential to always be aware of the location and pose of humans in the shared workspace. We introduce a real-time 3D pose estimation system using single depth images that is aimed to run on limited hardware, such as a mobile robot. For this, we optimized a CNN-based 2D pose estimation architecture to achieve high frame rates while simultaneously requiring fewer resources. Building upon this architecture, we extended the system for 3D estimation to directly predict Cartesian body joint coordinates. We evaluated our system on a newly created dataset by applying it to a specific industrial workbench scenario. The results show that our system’s performance is competitive to the state of the art at more than five times the speed for single person pose estimation.
Download

Paper Nr: 202
Title:

An MRF Optimisation Framework for Full 3D Helmholtz Stereopsis

Authors:

Gianmarco Addari and Jean-Yves Guillemaut

Abstract: Accurate 3D modelling of real world objects is essential in many applications such as digital film production and cultural heritage preservation. However, current modelling techniques rely on assumptions to constrain the problem, effectively limiting the categories of scenes that can be reconstructed. A common assumption is that the scene’s surface reflectance is Lambertian or known a priori. These constraints rarely hold true in practice and result in inaccurate reconstructions. Helmholtz Stereopsis (HS) addresses this limitation by introducing a reflectance agnostic modelling constraint, but prior work in this area has been predominantly limited to 2.5D reconstruction, providing only a partial model of the scene. In contrast, this paper introduces the first Markov Random Field (MRF) optimisation framework for full 3D HS. First, an initial reconstruction is obtained by performing 2.5D MRF optimisation with visibility constraints from multiple viewpoints and fusing the different outputs. Then, a refined 3D model is obtained through volumetric MRF optimisation using a tailored Iterative Conditional Modes (ICM) algorithm. The proposed approach is evaluated with both synthetic and real data. Results show that the proposed full 3D optimisation significantly increases both geometric and normal accuracy, being able to achieve sub-millimetre precision. Furthermore, the approach is shown to be robust to occlusions and noise.
Download

Paper Nr: 224
Title:

FDMO: Feature Assisted Direct Monocular Odometry

Authors:

Georges Younes, Daniel Asmar and John Zelek

Abstract: Visual Odometry (VO) can be categorized as being either direct (e.g. DSO) or feature-based (e.g. ORB-SLAM). When the system is calibrated photometrically, and images are captured at high rates, direct methods have been shown to outperform feature-based ones in terms of accuracy and processing time; they are also more robust to failure in feature-deprived environments. On the downside, direct methods rely on heuristic motion models to seed an estimate of camera motion between frames; in the event that these models are violated (e.g., erratic motion), direct methods easily fail. This paper proposes FDMO (Feature assisted Direct Monocular Odometry), a system designed to complement the advantages of both direct and featured based techniques to achieve sub-pixel accuracy, robustness in feature deprived environments, resilience to erratic and large inter-frame motions, all while maintaining a low computational cost at frame-rate. Efficiencies are also introduced to decrease the computational complexity of the feature-based mapping part. FDMO shows an average of 10% reduction in alignment drift, and 12% reduction in rotation drift when compared to the best of both ORB-SLAM and DSO, while achieving significant drift (alignment, rotation & scale) reductions (51%, 61%, 7% respectively) going over the same sequences for a second loop. FDMO is further evaluated on the EuroC dataset and was found to inherit the resilience of feature-based methods to erratic motions, while maintaining the accuracy of direct methods.
Download

Short Papers
Paper Nr: 34
Title:

Structural Change Detection by Direct 3D Model Comparison

Authors:

Marco Fanfani and Carlo Colombo

Abstract: Tracking the structural evolution of a site has important fields of application, ranging from documenting the excavation progress during an archaeological campaign, to hydro-geological monitoring. In this paper, we propose a simple yet effective method that exploits vision-based reconstructed 3D models of a time-changing environment to automatically detect any geometric changes in it. Changes are localized by direct comparison of time-separated 3D point clouds according to a majority voting scheme based on three criteria that compare density, shape and distribution of 3D points. As a by-product, a 4D (space + time) map of the scene can also be generated and visualized. Experimental results obtained with two distinct scenarios (object removal and object displacement) provide both a qualitative and quantitative insight into method accuracy.
Download

Paper Nr: 65
Title:

Incorporating Plane-Sweep in Convolutional Neural Network Stereo Imaging for Road Surface Reconstruction

Authors:

Hauke Brunken and Clemens Gühmann

Abstract: Convolutional neural networks, which estimate depth from stereo pictures in a single step, have become state of the art recently. The search space for matching pixels is hard coded in these networks and in literature is chosen to be the disparity space, corresponding to a search in the cameras viewing direction. In the proposed method, the search space is altered by a plane sweep approach, reducing necessary search steps for depth map estimation of flat surfaces. The described method is shown to provide high quality depth maps of road surfaces in the targeted application of pavement distress detection, where the stereo cameras are mounted behind the windshield of a moving vehicle. It provides a cheap replacement for laser scanning for this purpose.
Download

Paper Nr: 79
Title:

Fast View Synthesis with Deep Stereo Vision

Authors:

Tewodros Habtegebrial, Kiran Varanasi, Christian Bailer and Didier Stricker

Abstract: Novel view synthesis is an important problem in computer vision and graphics. Over the years a large number of solutions have been put forward to solve the problem. However, the large-baseline novel view synthesis problem is far from being ”solved”. Recent works have attempted to use Convolutional Neural Networks (CNNs) to solve view synthesis tasks. Due to the difficulty of learning scene geometry and interpreting camera motion, CNNs are often unable to generate realistic novel views. In this paper, we present a novel view synthesis approach based on stereo-vision and CNNs that decomposes the problem into two sub-tasks: view dependent geometry estimation and texture inpainting. Both tasks are structured prediction problems that could be effectively learned with CNNs. Experiments on the KITTI Odometry dataset show that our approach is more accurate and significantly faster than the current state-of-the-art.
Download

Paper Nr: 83
Title:

Critical Parameter Consensus for Efficient Distributed Bundle Adjustment

Authors:

Zhuohao Liu, Changyu Diao, Wei Xing and Dongming Lu

Abstract: We present a critical parameter consensus framework to improve the efficiency of Distributed Bundle Adjustment (DBA). Existing DBA methods are based solely on either camera consensus or point consensus, often resulting in excessive local computation time or large data transmission overhead. To address this issue, we jointly partition points and cameras, and perform the consensus on both overlapping cameras and points. Our joint block partitioning method first initializes a non-overlapping block partition, maximizing local problem constraints and ensuring a uniform partition. Then overlapping cameras and points are added in a greedy manner to maximize the partition score that quantifies the efficiency of DBA for local blocks. Experimental results on public datasets show that we can achieve better computational efficiency without loss of accuracy.
Download

Paper Nr: 140
Title:

A Coarse and Relevant 3D Representation for Fast and Lightweight RGB-D Mapping

Authors:

Bruce Canovas, Michele Rombaut, Amaury Negre, Serge Olympieff and Denis Pellerin

Abstract: In this paper we present a novel lightweight and simple 3D representation for real-time dense 3D mapping of static environments with an RGB-D camera. Our approach builds and updates a low resolution 3D model of an observed scene as an unordered set of new primitives called supersurfels, which can be seen as elliptical planar patches, generated from superpixels segmented RGB-D live measurements. While most of the actual solutions focuse on the accuracy of the reconstructed 3D model, the implemented method is well-adapted to run on robots with reduced/limited computing capacity and memory size, which do not need a highly detailed map of their environment but can settle for an approximate one.
Download

Paper Nr: 183
Title:

3D Cylinder Pose Estimation by Maximization of Binary Masks Similarity: A simulation Study for Multispectral Endoscopy Image Registration

Authors:

O. Zenteno, S. Treuillet and Y. Lucas

Abstract: In this paper we address the problem of simultaneous pose estimation for multi-modal registration of images captured using a fiberscope (multispectral) inserted through the instrument channel of a commercial endoscope (RGB). We developed a virtual frame using the homography-derived extrinsics parameters using a chessboard pattern to estimate the initial pose of both cameras and simulate two types of fiberscope movements (i.e, insertion and precession). The fiberscope pose is calculated by the maximization of similarity measures between the 2D projection of the simulated fiberscope and the fiberscope tip segmentation from the endoscopic images. We used the virtual frame to generate sets of synthetic fiberscope data at two different poses and compared them after the maximization of similarity. The performance was assessed by measuring the reprojection error of the control points for each pair of images and the pose absolute error in a sequential movement mimicking scenario. The mean reprojection error was 0.38 ± 0.5 pixels and absolute error in the tracking scenario was 0.05 ± 0.07 mm.
Download

Paper Nr: 192
Title:

Subpixel Unsynchronized Unstructured Light

Authors:

Chaima El Asmi and Sébastien Roy

Abstract: This paper proposes to add subpixel accuracy to the unsynchronized unstructured light method while achieving high-speed dense reconstruction without any camera-projector synchronization. This allows scanning faces which is notoriously difficult due to involuntary movements on the part of the model and the reduced possibilities of 3D scanner approaches such as laser scanners because of speed or eye protection. The unsynchronized unstructured light method achieves this with low-cost hardware and at a high capture and projection frame rate (up to 60 fps). The proposed approach proceeds by complementing a discrete binary coded match with a continuous interpolated code which is matched to subpixel precision. This subpixel matching can even correct for erroneous camera-projector correspondences. The obtained results show that highly accurate unfiltered 3D models can be reconstructed even in difficult capture conditions such as indirect illumination, scene discontinuities, or low hardware quality.
Download

Paper Nr: 200
Title:

Fine-grained 3D Face Reconstruction from a Single Image using Illumination Priors

Authors:

Weibin Qiu, Yao Yu, Yu Zhou and Sidan Du

Abstract: 3D face reconstruction has a wide range of applications, but it is still a challenging problem, especially when dealing with a single image. Inspired by recent works in face illumination estimation and face animation from video, we propose a novel method for 3D face renconstruction with geometric details from a single image by using three steps. First, a coarse 3D face is generated in morphable model space by landmarks alignment. Afterwards, using the face illumination priors and surface normals generated from the coarse 3D face, we estimate both illumination condition and facial texture, making it possible for the final step that refines geometric details through shape-from-shading method. Experiments prove that our method outperforms state-of-the-art method in terms of accuracy and geometric preservation.
Download

Paper Nr: 213
Title:

Performance Characterization of Absolute Scale Computation for 3D Structure from Motion Reconstruction

Authors:

Ivan Nikolov and Claus Madsen

Abstract: Structure from Motion (SfM) 3D reconstruction of objects and environments has become a go-to process, when detailed digitization and visualization is needed in the energy and production industry, medicine, culture and media. A successful reconstruction must properly capture the 3D information and it must scale everything to the correct scale. SfM has an inherent ambivalence to the scale of the scanned objects, so additional data is needed. In this paper we propose a lightweight solution for computation of absolute scale of 3D reconstructions by using only a real-time kinematic (RTK) GPS, in comparison to other custom solutions, which require multiple sensor fusion. Additionally, our solution estimates the noise sensitivity of the calculated scale, depending on the precision of the positioning sensor. We first test our solution with synthetic data to find how the results depend on changes to the capturing setup. We then test our pipeline using real world data, against the built-in solutions used in two state-of-the-art reconstruction software. We show that our solution gives better results, than both state-of-the-art solutions.
Download

Paper Nr: 256
Title:

Contact-less Vital Parameter Determination: An e-Health Solution for Elderly Care

Authors:

Christian Wiede, Julia Richter and Gangolf Hirtz

Abstract: Vital parameters are key figures for the basis functions of the human body. Without these basis body functions, such as the heart beat, life is impossible. Therefore, vital parameters are indicators for a person’s general medical condition. In recent years, the topic of vital parameter monitoring has been increasingly studied in the field of e-health. Especially the contact-less determination of vital parameters, such as heart rate, respiration rate, oxygen saturation and blood pressure, with consumer cameras brings a variety of advantages. In this work, we present methods to determine the mentioned vital parameters in a contact-less, optical way. Furthermore, we evaluated these methods for an utilisation in home environments with respect to elderly care. As a result, the remote determination of heart and respiration rate show reliable measurements, which makes the proposed methods ready for the application in home environments.
Download

Paper Nr: 17
Title:

Assessing Sequential Monoscopic Images for Height Estimation of Fixed-Wing Drones

Authors:

Nicklas H. Christensen, Frederik Falk, Oliver G. Hjermitslev and Rikke Gade

Abstract: We design a feature-based model to estimate and predict the free height of a fixed-wing drone flying at altitudes up to 100 meters above terrain using the stereo vision principles and a one-dimensional Kalman filter. We design this using a single RGB camera to assess the viability of sequential images for height estimation, and to assess which issues and pitfalls are likely to affect such a system. This model is tested on both simulation data flying above flat and varying terrain, as well as data from a real test flight. Simulation RMSE ranges from 10.7% to 21.0% of maximum flying height. Real estimates vary significantly more, resulting in an RMSE of 27.55% of median flying height of one test flight. Best MAE was roughly 17%, indicating the error to expect from the system. We conclude that feature-based detection appears to be too heavily influenced by noise introduced by the drone and other uncontrollable parameters to be used in reliable height estimation.
Download

Paper Nr: 36
Title:

Efficient GPU Implementation of Lucas-Kanade through OpenACC

Authors:

Olfa Haggui, Claude Tadonki, Fatma Sayadi and Bouraoui Ouni

Abstract: Optical flow estimation stands as an essential component for motion detection and object tracking procedures. It is an image processing algorithm, which is typically composed of a series of convolution masks (approximation of the derivatives) followed by 2 × 2 linear systems for the optical flow vectors. Since we are dealing with a stencil computation for each stage of the algorithm, the overhead from memory accesses is expected to be significant and to yield a genuine scalability bottleneck, especially with the complexity of GPU memory configuration. In this paper, we investigate a GPU deployment of an optimized CPU implementation via OpenACC, a directive-based parallel programming model and framework that ease the process of porting codes to a wide-variety of heterogeneous HPC hardware platforms and architectures. We explore each of the major technical features and strive to get the best performance impact. Experimental results on a Quadro P5000 are provided together with the corresponding technical discussions, taking the performance of the multicore version on a INTEL Broadwell EP as the baseline.
Download

Paper Nr: 53
Title:

Unsupervised Fine-tuning of Optical Flow for Better Motion Boundary Estimation

Authors:

Taha Alhersh and Heiner Stuckenschmidt

Abstract: Recently, convolutional neural network (CNN) based approaches have proven to be successful in optical flow estimation in the supervised as well as in the unsupervised training paradigms. Supervised training requires large amounts of training data with task specific motion statistics. Usually, synthetic datasets are used for this purpose. Fully unsupervised approaches are usually harder to train and show weaker performance, although they have access to the true data statistics during training. In this paper we exploit a well-performing pre-trained model and fine-tune it in an unsupervised way using classical optical flow training objectives to learn the dataset specific statistics. Thus, per dataset training time can be reduced from days to less than 1 minute. Specifically, motion boundaries estimated by gradients in the optical flow field can be greatly improved using the proposed unsupervised fine-tuning.
Download

Paper Nr: 104
Title:

Object Detection, Classification and Localization by Infrastructural Stereo Cameras

Authors:

Christian Hofmann, Florian Particke, Markus Hiller and Jörn Thielecke

Abstract: In the future, autonomously driving vehicles have to navigate in challenging environments. In some situations, their perception capabilities are not able to generate a reliable overview of the environment, by reason of occlusions. In this contribution, an infrastructural stereo camera system for environment perception is proposed. Similar existing systems only detect moving objects by background subtraction algorithms and monocular cameras. In contrast, the proposed approach fuses three different algorithms for object detection and classification and uses stereo vision for object localization. The algorithmic concept is composed of a background subtraction algorithm based on Gaussian Mixture Models, the convolutional neural network ”You only look once” as well as a novel algorithm for detecting salient objects in depth maps. The combination of these complementary object detection principles allows the reliable detection of dynamic as well as static objects. An algorithm for fusing the results of the three object detection methods based on bounding boxes is introduced. The proposed fusion algorithm for bounding boxes improves the detection results and provides an information fusion. We evaluate the proposed concept on real word data. The object detection, classification and localization in the real world scenario is investigated and discussed.
Download

Paper Nr: 120
Title:

B-SLAM-SIM: A Novel Approach to Evaluate the Fusion of Visual SLAM and GPS by Example of Direct Sparse Odometry and Blender

Authors:

Adam Kalisz, Florian Particke, Dominik Penk, Markus Hiller and Jörn Thielecke

Abstract: In order to account for sensor deficiancies, usually a multi-sensor approach is used where various sensors complement each other. However, synchronization of highly accurate Global Positioning System (GPS) and video measurements requires specialized hardware which is not straightforward to set up. This paper proposes a full simulation environment for data generation and evaluation of Visual Simultaneous Localization and Mapping (Visual SLAM) and GPS based on free and open software. Specifically, image data is created by rendering a virtual environment where camera effects such as Motion Blur and Rolling Shutter can be added. Consequently, a ground truth camera trajectory is available and can be distorted via additive Gaussian noise to understand all parameters involved in the use of fusion algorithms such as the Kalman Filter. The proposed evaluation framework will be published as open source online at https://master.kalisz.co for free use by the research community.
Download

Paper Nr: 146
Title:

Real-time Hand Pose Tracking and Classification for Natural Human-Robot Control

Authors:

Bruno Lima, Givanildo L. N. Júnior, Lucas Amaral, Thales Vieira, Bruno Ferreira and Tiago Vieira

Abstract: We present a human-robot natural interaction approach based on teleoperation through body gestures. More specifically, we propose an interface where the user can use his hand to intuitively control the position and status (open/closed) of a robotic arm gripper. In this work, we employ a 6-DOF (six degrees-of-freedom) industrial manipulator which mimics user movements in real-time, positioning the end effector as if the individual was looking into a mirror, entailing a natural and intuitive interface. The controlling hand of the user is tracked using body skeletons acquired from a Microsoft Kinect sensor, while a Convolutional Neural Network recognizes whether the hand is opened or closed using depth data. The network was trained on hand images collected from several individuals, in different orientations, resulting in a robust classifier that performs well regardless of user location or orientation. There is no need for wearable devices, such as gloves or wristbands. We present results of experiments that reveal high performance of the proposed approach to recognize both the user hand position and its status (open/closed); and experiments to demonstrate the robustness and applicability of the proposed approach to industrial tasks.
Download

Paper Nr: 152
Title:

Adaptive SLAM with Synthetic Stereo Dataset Generation for Real-time Dense 3D Reconstruction

Authors:

Antoine Billy, Sébastien Pouteau, Pascal Desbarats, Serge Chaumette and Jean-Philippe Domenger

Abstract: In robotic mapping and navigation, of prime importance today with the trend for autonomous cars, simultaneous localization and mapping (SLAM) algorithms often use stereo vision to extract 3D information of the surrounding world. Whereas the number of creative methods for stereo-based SLAM is continuously increasing, the variety of datasets is relatively poor and the size of their contents relatively small. This size issue is increasingly problematic, with the recent explosion of deep learning based approaches, several methods require an important amount of data. Those multiple techniques contribute to enhance the precision of both localization estimation and mapping estimation to a point where the accuracy of the sensors used to get the ground truth might be questioned. Finally, because today most of these technologies are embedded on on-board systems, the power consumption and real-time constraints turn to be key requirements. Our contribution is twofold: we propose an adaptive SLAM method that reduces the number of processed frame with minimum impact error, and we make available a synthetic flexible stereo dataset with absolute ground truth, which allows to run new benchmarks for visual odometry challenges. This dataset is available online at http://alastor.labri.fr/.
Download

Paper Nr: 157
Title:

Circular Fringe Projection Method for 3D Profiling of High Dynamic Range Objects

Authors:

Jagadeesh K. Mandapalli, Sai S. Gorthi, Ramakrishna S. Gorthi and Subrahmanyam Gorthi

Abstract: Fringe projection profilometry is a widely used active optical method for 3D profiling of real-world objects. Linear fringes with sinusoidal intensity variations along the lateral direction are the most commonly used structured pattern in fringe projection profilometry. The structured pattern, when projected onto the object of interest gets deformed in terms of phase modulation by the object height profile. The deformed fringes are demodulated using methods like Fourier transform profilometry for obtaining the wrapped phase information, and the unwrapped phase provides the 3D profile of the object. One of the key challenges with the conventional linear fringe Fourier transform profilometry (LFFTP) is that the dynamic range of the object height that can be measured with them is very limited. In this paper we propose a novel circular fringe Fourier transform profilometry (CFFTP) method that uses circular fringes with sinusoidal intensity variations along the radial direction as the structured pattern. A new Fourier transform-based algorithm for circular fringes is also proposed for obtaining the height information from the deformed fringes. We demonstrate that, compared to the conventional LFFTP, the proposed CFFTP based structure assessment enables 3D profiling even at low carrier frequencies, and at relatively much higher dynamic ranges. The reasons for increased dynamic range with circular fringes stem from the non-uniform sampling and narrow band spectrum properties of CFFTP. Simulation results demonstrating the superiority of CFFTP over LFFTP are also presented.
Download

Paper Nr: 230
Title:

Rain Nowcasting from Multiscale Radar Images

Authors:

Aniss Zebiri, Dominique Béréziat, Etienne Huot and Isabelle Herlin

Abstract: Rainfall forecasting is a major issue for anticipating severe meteorological events and for agriculture management. Weather radar imaging has been identified in the literature as the best way to measure rainfall on a large domain, with a fine spatial and temporal resolution. This paper describes two methods allowing to improve rain nowcast from radar images at two different scales. These methods are further compared to an operational chain relying on only one type of radar observation. The comparison is led with regional and local criteria. For both, significant improvements are quantified compared to the original method.
Download

Paper Nr: 250
Title:

Natural Stereo Camera Array using Waterdrops for Single Shot 3D Reconstruction

Authors:

Akira Furukawa, Fumihiko Sakaue and Jun Sato

Abstract: In this paper, we propose a stereo 3D reconstruction from a single image including multiple water drops. Water drops on a surface, e.g. camera lens, refract light rays and the refracted rays are roughly converged to a point. This indicates that water drops can be regarded as approximately small lens. Therefore, sub-images refracted by water drops can be regarded as images taken from different viewpoints. That is, virtual stereo camera systems can be constructed from a single image by using these raindrop characteristics. In this paper, we propose an efficient description of this virtual stereo camera system using water drops. Furthermore, we propose methods for the estimation of the camera parameters and for the reconstruction of the scene. We finally display several experimental results and discuss the validation of our proposed camera model from the results.
Download