Robust Detection, Classification and Positioning of Traffic Signs from Street-Level Panoramic Images for Inventory Purposes

Similar documents
Large-scale classification of traffic signs under real-world conditions

Wavelet-based Salient Points with Scale Information for Classification

Discriminative part-based models. Many slides based on P. Felzenszwalb

Corners, Blobs & Descriptors. With slides from S. Lazebnik & S. Seitz, D. Lowe, A. Efros

Properties of detectors Edge detectors Harris DoG Properties of descriptors SIFT HOG Shape context

JOINT INTERPRETATION OF ON-BOARD VISION AND STATIC GPS CARTOGRAPHY FOR DETERMINATION OF CORRECT SPEED LIMIT

Urban land use information retrieval based on scene classification of Google Street View images

Advances in Computer Vision. Prof. Bill Freeman. Image and shape descriptors. Readings: Mikolajczyk and Schmid; Belongie et al.

Feature extraction: Corners and blobs

Lecture 13 Visual recognition

CS5670: Computer Vision

Global Scene Representations. Tilke Judd

A Discriminatively Trained, Multiscale, Deformable Part Model

SUBJECTIVE EVALUATION OF IMAGE UNDERSTANDING RESULTS

Shape of Gaussians as Feature Descriptors

Edges and Scale. Image Features. Detecting edges. Origin of Edges. Solution: smooth first. Effects of noise

Analysis on a local approach to 3D object recognition

Automatic localization of tombs in aerial imagery: application to the digital archiving of cemetery heritage

38 1 Vol. 38, No ACTA AUTOMATICA SINICA January, Bag-of-phrases.. Image Representation Using Bag-of-phrases

Traffic accidents and the road network in SAS/GIS

A RAIN PIXEL RESTORATION ALGORITHM FOR VIDEOS WITH DYNAMIC SCENES

Visual Object Detection

Lecture 8: Interest Point Detection. Saad J Bedros

Tracking Human Heads Based on Interaction between Hypotheses with Certainty

Maximally Stable Local Description for Scale Selection

Feature Vector Similarity Based on Local Structure

Image Processing 1 (IP1) Bildverarbeitung 1

Maarten Bieshaar, Günther Reitberger, Stefan Zernetsch, Prof. Dr. Bernhard Sick, Dr. Erich Fuchs, Prof. Dr.-Ing. Konrad Doll

LoG Blob Finding and Scale. Scale Selection. Blobs (and scale selection) Achieving scale covariance. Blob detection in 2D. Blob detection in 2D

Achieving scale covariance

Robust License Plate Detection Using Covariance Descriptor in a Neural Network Framework

Orientation Map Based Palmprint Recognition

CS 3710: Visual Recognition Describing Images with Features. Adriana Kovashka Department of Computer Science January 8, 2015

Sound Recognition in Mixtures

The state of the art and beyond

Quality and Coverage of Data Sources

Boosting: Algorithms and Applications

Face detection and recognition. Detection Recognition Sally

Detectors part II Descriptors

Loss Functions and Optimization. Lecture 3-1

EE 6882 Visual Search Engine

OBJECT DETECTION FROM MMS IMAGERY USING DEEP LEARNING FOR GENERATION OF ROAD ORTHOPHOTOS

Visibility Estimation of Traffic Signals under Rainy Weather Conditions for Smart Driving Support

Multimodal context analysis and prediction

Blob Detection CSC 767

A Contrario Detection of False Matches in Iris Recognition

FPGA Implementation of a HOG-based Pedestrian Recognition System

Two-Stream Bidirectional Long Short-Term Memory for Mitosis Event Detection and Stage Localization in Phase-Contrast Microscopy Images

Blobs & Scale Invariance

CSE 473/573 Computer Vision and Image Processing (CVIP)

DETECTING HUMAN ACTIVITIES IN THE ARCTIC OCEAN BY CONSTRUCTING AND ANALYZING SUPER-RESOLUTION IMAGES FROM MODIS DATA INTRODUCTION

CS 231A Section 1: Linear Algebra & Probability Review

DM-Group Meeting. Subhodip Biswas 10/16/2014

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Subcellular Localisation of Proteins in Living Cells Using a Genetic Algorithm and an Incremental Neural Network

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Lecture 8: Interest Point Detection. Saad J Bedros

Learning theory. Ensemble methods. Boosting. Boosting: history

Active Detection via Adaptive Submodularity

EXTRACTION OF PARKING LOT STRUCTURE FROM AERIAL IMAGE IN URBAN AREAS. Received September 2015; revised January 2016

Sky Segmentation in the Wild: An Empirical Study

Towards Fully-automated Driving

Road Surface Condition Analysis from Web Camera Images and Weather data. Torgeir Vaa (SVV), Terje Moen (SINTEF), Junyong You (CMR), Jeremy Cook (CMR)

Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning

Region Moments: Fast invariant descriptors for detecting small image structures

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Pedestrian Density Estimation by a Weighted Bag of Visual Words Model

Fisher Vector image representation

Clustering with k-means and Gaussian mixture distributions

CITS 4402 Computer Vision

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

Multiscale Autoconvolution Histograms for Affine Invariant Pattern Recognition

KNOWLEDGE-BASED CLASSIFICATION OF LAND COVER FOR THE QUALITY ASSESSEMENT OF GIS DATABASE. Israel -

RESTORATION OF VIDEO BY REMOVING RAIN

Vision for Mobile Robot Navigation: A Survey

A Hierarchical Convolutional Neural Network for Mitosis Detection in Phase-Contrast Microscopy Images

Human Action Recognition under Log-Euclidean Riemannian Metric

Object Recognition Using Local Characterisation and Zernike Moments

Loss Functions and Optimization. Lecture 3-1

Automatic estimation of crowd size and target detection using Image processing

SIFT keypoint detection. D. Lowe, Distinctive image features from scale-invariant keypoints, IJCV 60 (2), pp , 2004.

MODELING OF 85 TH PERCENTILE SPEED FOR RURAL HIGHWAYS FOR ENHANCED TRAFFIC SAFETY ANNUAL REPORT FOR FY 2009 (ODOT SPR ITEM No.

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Fantope Regularization in Metric Learning

SYMMETRY is a highly salient visual phenomenon and

Parking Place Inspection System Utilizing a Mobile Robot with a Laser Range Finder -Application for occupancy state recognition-

Real-time image-based parking occupancy detection using deep learning. Debaditya Acharya, Weilin Yan & Kourosh Khoshelham The University of Melbourne

INTEREST POINTS AT DIFFERENT SCALES

Image Analysis. Feature extraction: corners and blobs

SYMBOL RECOGNITION IN HANDWRITTEN MATHEMATI- CAL FORMULAS

Rapid Object Recognition from Discriminative Regions of Interest

Major Crime Map Help Documentation

Overview. Introduction to local features. Harris interest points + SSD, ZNCC, SIFT. Evaluation and comparison of different detectors

Real Time Face Detection and Recognition using Haar - Based Cascade Classifier and Principal Component Analysis

Distinguish between different types of scenes. Matching human perception Understanding the environment

Adaptive Binary Integration CFAR Processing for Secondary Surveillance Radar *

Introduction to GIS I

2D Image Processing Face Detection and Recognition

Transcription:

Robust Detection, Classification and Positioning of Traffic Signs from Street-Level Panoramic Images for Inventory Purposes Lykele Hazelhoff and Ivo Creusen CycloMedia Technology B.V. Achterweg 38, 4181 AE Waardenburg, The Netherlands lhazelhoff@cyclomedia.com, icreusen@cyclomedia.com Peter H.N. de With Eindhoven University of Technology Den Dolech 2, 5600 MB Eindhoven, The Netherlands p.h.n.de.with@tue.nl Abstract Accurate inventories of traffic signs are required for road maintenance and increase of the road safety. These inventories can be performed efficiently based on street-level panoramic images. However, this is a challenging problem, as these images are captured under a wide range of weather conditions. Besides this, occlusions and sign deformations occur and many sign look-a-like objects exist. Our approach is based on detecting present signs in panoramic images, both to derive a classification code and to combine multiple detections into an accurate position of the signs. It starts with detecting the present signs in each panoramic image. Then, all detections are classified to obtain the specific sign type, where also false detections are identified. Afterwards, detections from multiple images are combined to calculate the sign positions. The performance of this approach is extensively evaluated in a large, geographical region, where over 85% of the 3, 341 signs are automatically localized, with only 3.2% false detections. As nearly all missed signs are detected in at least a single image, only very limited manual interactions have to be supplied to safeguard the performance for highly accurate inventories. 1. Introduction Nowadays, several companies record street-level panoramic images, which provide a recent and accurate overview of the road infrastructure. Within The Netherlands, these images are captured by private companies (e.g. CycloMedia Technology and Google), where each public road is recaptured annually. The resulting image databases enable efficient inventories of street furniture to support maintenance and cost control. Computer vision techniques facilitate the automatic creation of such inventories and thereby reduce human interaction compared to manual inventories, where all objects are searched and annotated by hand. These inventories are of interest to governmental organizations tasked with road maintenance. Especially traffic signs are of interest, as their presence directly influences road safety. They require accurate and up-to-date inventories, as the sign visibility may be degraded due to e.g. vandalism, vegetation coverage, aging and accidents. This paper describes a framework for road-sign inventories based on computer vision techniques, aiming at retrieving the sign code and position of all traffic signs in a region. Although traffic signs are designed to attract visual attention, automatic detection and classification of road signs is a complicated problem for several reasons. The first is related to capturing from a driving vehicle. As the signs are captured from a wide range of distances, large viewpoint deviations exist, and signs may be occluded by e.g. other road users. Furthermore, capturing outside implies varying weather conditions, including e.g. fog. The second complication comes from the sign features. Signs vary in size, and there are many similar traffic signs, which are sometimes custom versions of official signs, and some signs are designed to contain custom text or symbols. Moreover, the visibility of traffic signs may be lowered due to aforementioned reasons, while especially these signs are of importance for sign maintenance. Thirdly, many sign look-a-like objects exist, including directives for customer parking or restrictions for dog access, which are not traffic signs. Examples of these complicating factors are displayed in Fig. 1. 1.1. Related work In literature, detection and recognition of traffic signs is studied for many years. For example, [15] describes a cascade detection of speed signs within a large scale project, achieving a detection rate of 98.7%. However, this paper only addresses detection of a single sign type. This is also the case for [1], where the image is prefiltered with a color-version of the Viola-Jones algorithm, followed by analysis of Histogram of Oriented Gradient fea- 313

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) Figure 1. Examples of factors complicating detection and classification of traffic signs. (a)-(e): occlusions; (f)-(i): lowered sign visibility. (j)-(m): sign-like objects. (n)-(o): official sign with custom version. tures with a neural network. The authors test their system for triangular warning signs, and report that 260 of the 265 present signs are correctly identified, where 241 signs are correctly classified. Examples of systems focusing on multiple sign types are e.g. [12], where images captured from a car-mounted camera are exploited for sign detection. After threshold-based color-segmentation and shape analysis, the sign type is recognized based upon a grayscale version of the located blobs. The authors report that all the 104 signs are detected at least twice. Color segmentation is also exploited for extraction of sign regions in [13]. Afterwards, the shape of the sign in each blob is extracted, which is subject to classification, based on Support Vector Machines, exploiting the known sign shape. It is reported that 98 of the 102 signs are detected at least once. These proposals detect and classify traffic signs based on single images. A tracking system is proposed in [10], reducing the false alarms by tracking the signs over the frames. The authors report that all 6 signs are detected correctly. Another approach is described by Timofte et al. [16], where a van with 8 cameras is employed for capturing. Their method employs both single-image and multi-view analysis. A fraction of 95.6% of the 269 signs are positioned correctly, where 97.7% of the detected signs are also successfully recognized. 1.2. Our approach This paper presents an approach for performing largescale inventories of traffic signs based on computer vision techniques, where signs are detected, classified and positioned. This is a very challenging problem, and we have experienced that the performance of state-of-the-art algorithms for detection and classification are insufficient for a fully automated inventory in real-world conditions and at a large scale, which is similar as Frome et al. [6] discuss for face detection. Therefore, we aim at a system focusing on both automatic and semi-automatic inventories. This paper describes the automatic version, where we concentrate on seven different sign classes, covering 92 different sign types. The involved classes are displayed in Fig. 2. Instead of constructing a custom capturing device, as some proposals in literature, we exploit the already existing street-level panoramic images. These images are captured on all public roads with a calibrated recording system with a capturing interval of 5 meters. The capturing cars are typically employed efficiently, resulting in images captured in a very wide range of weather conditions, including even fog, which makes the problem even more challenging. Sign appearances vary greatly across the images due to the large variation in weather conditions and due to differences between different camera systems. Therefore, instead of focusing at color, which is common in literature, we investigate color gradients, since they are more robust against these situations. Next to this, we aim at a generic, learningbased system, as this allows adaption to other sign appearances, e.g. in other countries. Our system consists of three stages: sign detection, sign classification and sign positioning. At first, the signs are grouped into classes, e.g. red triangular, and their generic properties are exploited for detection. Due to the genericity of this stage, customized versions of standard signs can also be found, which is beneficial in the semi-automatic approach. For detection, a custom variant of the popular Histogram of Oriented Gradients [5] is applied, which operates with color gradient information to exploit the characteristic sign colors. Then, the minor differences between the signs are analyzed, and all detections are classified using a variant of the popular Bag of Words (BoW) technique [4]. The standard BoW approach is modified to both filter out falsely detected signs and to deal with the large intra-class similarities of the signs. Afterwards, in the sign positioning stage, the sign positions are calculated by combining the detections across multiple images. The performance of this inventory system is evaluated by a large-scale experiment, where an inventory is applied to a large geographical region, containing over 3, 340 traffic signs. We should note that this validation size is rather uncommon in related literature, as is the fact that we also take signs not directly located along the road into account. Besides this, the performance of the individual detection and classification stages are also assessed. The remainder of the paper is organized as follows. Section 2 contains the system overview. Section 3 describes the sign detection stage, Section 4 describes the classification stage, followed by the positioning procedure in Section 5. The performance evaluation can be found in Section 6, followed by the conclusions in Section 7. 314

sign detection sign classification sign detection sign classification red triangular signs red triangular signs give-way signs red circular signs red circular signs Overlap analysis and correction redblue circular signs redblue circular signs no-entry signs blue circular signs blue circular signs yellow diamond signs yellow diamond signs sign detection sign positioning sign classification Figure 2. System overview of our inventory system. 2. System overview is combined with all nearby images. Next, hypotheses of sign positions are obtained by pair wise combinations of detections with both an identical sign code, where combinations of sign codes and wildcards give supporting evidence. Then, the final sign positions are obtained by clustering of these hypotheses. The system overview of our inventory process is depicted in Fig. 2. The system consists of three primary modules, which are briefly described below. 1. Sign detection: At first, each panoramic image is analyzed and present signs are detected by multiple, independent detectors, each focusing at a specific class of signs. These detectors are kept very generic to allow detection of distorted signs and sign-like objects. As some detectors focus at quite similar sign classes, their output may overlap. These overlapping samples are analyzed and a detection fusion step is applied to assign the correct class label. 3. Sign detection The first stage of our inventory system consists of localizing traffic signs within the individual panoramic images. As many traffic signs have similar color and shape (such as e.g. all blue circular direction signs), similar signs are grouped into sign classes (such as blue circular) and detection is performed for each class independently. The class division is displayed in Fig. 2. Since traffic signs are intended to attract attention based on their colors and shape, many traffic sign detection systems in literature start by color filtering of the image, and extracting regions with colors corresponding to the signs. However, we have found that the color and contrast of the signs varies significantly with the capturing conditions, and therefore instead we exploit color differences and shape information, which we have found to be more consistent over the varying circumstances. We apply detectors based on the popular Histogram of Oriented Gradients (HOG) algorithm, originally proposed by Dalal and Triggs [5]. As the standard algorithm extracts the maximum gradient over the color channels for each pixel, it neglects the correlation of gradients over the color channels, and thereby neglects the discriminative color of the traffic signs. We have extended the standard HOG approach with the use of color information, as described in [3]. 2. Sign classification: During the sign detection stage, all detected signs are assigned a sign class label, e.g. red triangular. Next, each detection is assigned a sign code, such as warning sign for a dangerous crossing. As some detections have an insufficient resolution for classification, these small samples are not classified, but assigned a class-specific wildcard. Furthermore, at this stage, the false findings given by the detectors are identified by inclusion of an additional class to the classification procedure, representing the false detections of the respective sign class. These detections are also assigned the wildcard code. Classes with only a single contained sign are not subject to classification. 3. Sign positioning: In the sign positioning stage, detections from multiple images are combined to calculate the position of the traffic signs based on the geometric properties of our source data. For this, each image 315

The modified HOG algorithm works as follows. First, the image is divided into cells of 8 8 pixels, where for each of these cells a histogram of the gradient orientation is calculated. These histograms are normalized w.r.t. adjacent histograms. Next, a sliding window is moved over the cells, covering 5 5 cells, and all included histograms are concatenated. As we perform detection on color images, the histograms of all the color channels are appended. The resulting 1, 200-dimensional feature vector is used for classification by means of a linear Support Vector Machine (SVM). Since multiple classes of signs are detected independently from each other, the same features are exploited for all classes, where each class is found by an individual SVM. As the feature extraction stage is the most timeconsuming task, adding additional classes does not affect processing time significantly. This procedure is repeated at multiple scales to obtain scale invariance, resulting in detections with a size ranging from 24 24 pixels up to 400 400 pixels, corresponding to a typical sign-to-car distance ranging from about 19 to 1 meters, respectively. Although the different detectors operate independently, cross-overlap may exist between the detectors for the different classes, especially when the visual difference is low. This causes signs to be detected by multiple detectors, which is especially the case for the red circular and redblue circular signs, which are both circular with a red border. Therefore, we employ a specific step to distinguish between these two classes. Each sample detected by both detectors is analyzed and assigned a single class label. Since we strive for a generic system, we apply a learning-based approach, exploiting the differences in color distribution between the sign classes. This method first transforms the input samples into the HSV color space, then extracts a color histogram from the signs, followed by classification based on a linear SVM. Afterwards, each sample is assigned the appropriate class label, which is exploited during the classification. 4. Sign classification Each detector focuses at detecting a specific sign class, which typically consists of more than one sign type. Therefore, the detector output is analyzed to obtain the sign code, e.g. danger, crossing children, where detectors that directly locate a sign type are not subject to classification. Whereas the detection stage exploits the generic characteristics of a sign class, such as the sign borders, the classification stage should discriminate between signs based upon the very minor differences within the inner template, as visualized in Fig. 4. This complicates the classification task, especially since the resolution of the discriminative part of the signs is quite low. Therefore, we ignore samples with insufficient resolution and assign them a wildcard code (its use is explained later), and only classify samples larger than 40 40 pixels. The employed classification approach is based on (a) (b) (c) (d) Figure 4. Examples of signs with only very minor differences. Figure 5. Visualization of the modular codebook, containing the concatenation of individual visual dictionaries. Bag of Words [4] (BoW), and is described in [7]. For completion, we will briefly redescribe the key features of the system below. The original BoW approach [4] represents each image by a histogram containing the occurrence frequency of the elements of the visual dictionary. These elements, called visual words, are obtained by clustering the features extracted from training samples, in order to obtain features that occur frequently in all training samples. However, these words may not be the most discriminative [8], which will especially be the case when words occur in all sign types, e.g. representing the sign borders. Therefore, we altered the construction of the visual dictionary, where we generate a separate dictionary for each sign type to ensure that characteristics common for the sign types are extracted. The individual dictionaries are concatenated to a modular codebook. This approach also enables handling of unbalanced training data and allows for easy addition of extra classes without recomputing the complete codebook. As our sign detector also outputs false detections, we have extended the modular codebook with a large dictionary representing them. This prevents false detections from being assigned to a random class, and instead, have those mapped to the false detection class. The resulting modular codebook is visualized in Fig. 5 and is used, as would it be the regular codebook in the standard BoW approach. Based on this modular codebook, new samples are classified as follows. First, the input sample is normalized to a standard size. Since the sign class is known, irrelevant parts are removed based on a standard shape, similar to [13]. As the pattern is more discriminative than the color, each sample is converted to a single color image, where for almost all classes grayscale is used. However, the redblue circular signs are converted to a custom color space, given by the difference between the red and blue color channels, as the red rim and blue inner part have similar grayscale values. Then, SIFT features [11] are extracted using dense sampling, which is shown to outperform the use of interest point operators [14]. We have applied SIFT, as these features are invariant against the commonly occurring ro- 316

fn S Above car, fh = ½p Horizon, fh = 0 p Below car, fh = -½p -½p E ½p N W fh S -p Figure 3. Example panoramic image including a visualization of the geometric properties of the panorama, where ϕn denotes the angle w.r.t. the northern direction and ϕh the angle w.r.t. the horizon. tation and illumination changes. Each extracted feature is matched against the modular codebook, and a word histogram is constructed, which is used for classification based on linear SVMs in a One-vs-All fashion. When none or multiple classifiers recognize the sample, we consider it as unreliable, and assign it a wildcard code. this, we exploit the fact that our source data is geometrically correct, i.e. the angular orientations are linear with the pixel coordinates. This is achieved by extensive calibration of the capturing system. Furthermore, each panoramic image is divided into two equal parts by both the northern direction and the horizon, as visualized in Fig. 3. As a result, when two points corresponding to the same object are known in two images, the position of that object can be calculated by straightforward geometric calculations. The sign positions are retrieved per sign class, where detections from individual classes are combined with other detections, either having an identical sign code or a wildcard code. Although the detections themselves can be correlated, multiple identical signs can result in wrong correspondences. Furthermore, the capturing interval of 5 meters causes large differences in perspective, scale and background, thereby complicating pixel correspondences. Therefore, we only exploit the geometric information, where all the combined pairs give a position estimation, which are passed through when this estimate is closer than 45 meter w.r.t. both images. Each estimate gives a hypothesis of the sign location, which are clustered around the real sign position, as visualized in Fig. 6. These clusters are recovered using the MeanShift algorithm [2]. Afterwards, the clusters are processed from large to small cardinality, where only clusters containing at least 3 or more detections are taken into account. For each cluster, the position estimates given by the combination of all contained detections should be close to the cluster center, and when not valid, the detection with the largest mean position deviation w.r.t. this center is removed. The same rule applies when multiple detections from the same image are present. After a cluster is accepted, all contained detections are removed, and the procedure is repeated. 5. Sign positioning After the classification stage, all detections are assigned a sign code, either corresponding to a specific sign type or to a class-specific wildcard. Next, the positions of the signs are calculated by combining the detections corresponding to the same physical sign appearing in multiple images. For XXX X X XX X XX X X X Figure 6. Visualization of the sign positioning process, where two identical give-way signs are present (indicated by the give-way symbols). The positions of the capturings are displayed by blue circles, the position estimates given by the pair-wise combination of all detections are drawn as red crosses and the actual position of the traffic signs are indicated by the two sign symbols. 317

1 1 0.8 0.8 Precision 0.6 0.4 redblue circle red circle red triangular 0.2 yellow diamond give way blue circle 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Figure 7. Recall-precision curves for the individual traffic sign detectors. The classes indicated correspond to the classes in Fig. 2. Precision 0.6 0.4 redblue circle with detection fusion 0.2 redblue circle without detection fusion red circle with detection fusion red circle without detection fusion 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Figure 8. Recall-precision curves showing the effect of the detection fusion stage for the red and redblue circular signs. 6. Experiments and results The described system is employed to perform inventories of multiple geographical regions, containing both rural and city environments. The output is manually verified, and the resulting ground-truth data is employed to assess the performance of the detection and classification stages of the system. Furthermore, the performance of the complete inventory system is evaluated for a representative geographical area, containing multiple towns within a rural surrounding. 6.1. Detection performance analysis The performance of the sign-detection module is evaluated on about 33, 000 panoramic images, each containing at least a single traffic sign. This set covers both rural and city environments, and all present signs are manually annotated. However, it should be noted that the ground truth was constructed subjectively, so that there is some mismatch for the smaller signs, which may be detected but not annotated. The set is processed within 8.5 hours on a cluster of 4 computers and the resulting recall-precision curves for the different sign classes are shown in Fig. 7. As follows, the detectors are able to localize over 90% of the traffic signs in the individual panoramic images. This figure includes the detection fusion step for the red circular and redblue circular signs. The effect of this step is displayed in Fig. 8, which clearly shows that the additional fusion stage significantly reduces the number of false detections, especially for the redblue circular signs. This indicates that there is indeed a large overlap between the classes. As the percentage of detected signs is not significantly affected, our approach successfully assigns the correct class label in most cases. 6.2. Classification performance analysis The performance of our classification module is evaluated using the detector output on a large set of panoramic images, which is constructed such that a maximum number of different sign types are included with sufficient cardinality. The minimum size of the considered detections is 40 40 pixels, where occurring false detections are included. The classification performance is analyzed by using 10-fold cross-validation, as this approaches the system performance when all samples have been subject to training [9]. Since our test set contains multiple samples of the same physical sign, captured from different viewpoints, these occurrences are forced into the same partition to prevent testing with the training set. Table 1 summarizes the key performance numbers for the classes of interest. Classification of an unseen sample takes about 1 second. It can be concluded that our classification approach successfully discriminates between the different sign types for all sign classes, where about 97% of the returned sign types is correct (neglecting the samples classified as unreliable), regardless whether 2 or 25 different sign types are involved. Next to this, only about 1 1.5% of the samples are incorrectly classified and around 1 2% are classified as unreliable. Furthermore, false detections are identified with high accuracy, while only a few real samples are labeled as background. Also, we have noted that skewed signs, which do occur quite often in practice, do not influence the classification accuracy, since rotation-invariant features are used. We are aware of the fact that this disables discrimination between rotated instances of identical signs, but this approach could be followed without complications, as the officially approved signs do not contain this kind of instances. 6.3. Inventory system performance analysis The performance of the complete inventory system is evaluated for a geographical region containing several towns within a rural surrounding intersected by a highway. Capturings are taken every 5 meters, covering all public roads within this region, resulting in about 147, 000 panoramic images. The described inventory system is applied and the results are manually verified, where all traffic signs are considered for which the front side is visible in at least a single panoramic image. Table 2 lists the number of correct, missing, falsely detected and signs obtained with incorrect sign code for the different classes. As can be no- 318

Sign class: #Total detections (signs+background) 17, 344 19, 319 7, 430 10, 296 4, 251 #Signs detected (true positives) 15, 938 15, 935 6, 752 7, 565 3, 627 #Signs correctly classified 14, 843 15, 309 6, 661 7, 049 3, 508 #Signs falsely classified 259 144 24 92 5 #Signs classified as unrealiable 710 217 38 134 9 #Signs classified as background 126 265 29 290 105 #Background detected (false positives) 1, 406 3, 384 678 2, 731 624 #Background correctly classified 1, 224 3, 161 649 2, 568 585 #Background classified as sign 43 104 24 44 33 #Background classified as unrealiable 139 119 5 119 6 Table 1. Key performance numbers of our classification module. ticed, over 85% of the present signs are localized correctly. We have observed that the performance varies slightly over the sign classes, but is significantly lower for the red circular signs. Therefore, we will separately analyze the performance of this class and the other classes below. Most signs obtained with wrong sign codes correspond to signs located at some distance to the road, and to signs with degraded visibility, e.g. damaged or besmeared, thereby complicating discrimination between similar sign types. Moreover, as shown in Fig. 9(a)-9(e), some red circular signs contain a custom template with an arbitrary number, contributing to a lowered classification score. The falsely detected signs are mainly caused by two reasons. The first comes from GPS inaccuracies, where signs located along roads that are captured in both directions are identified twice, e.g. at 0.25 meter apart. Fusion of both signs is not a solution, since two identical signs may be present, as shown in Fig. 9(f). Second, objects very similar to traffic signs exist, which is especially the case for the red circular signs. There are not only custom prohibition signs, but also the red letter O is often recognized as a traffic sign. Examples of these are displayed in Fig. 9(g)-9(j). For the other sign classes, the number of falsely detected signs is quite low, indicating that our classification module successfully filters out almost all present false detections. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 9. Examples. (a)-(e): identical signs with metadata; (f): example of two nearby, identical signs; (g)-(j): sign-like objects. When analyzing the missed signs, we have noted that almost all missed signs (except for two very degraded signs) are detected in at least a single image. However, our approach requires 3 detections, which is disabled by e.g. occlusions by other traffic, a sign orientation parallel to the road and a sign position far away from the road. The latter is especially the case for the blue circular signs when they indicate the start or end of a bicycle path, as displayed in Fig. 10(a). For the red circular signs, another phenomenon causes the rather large amount of missed signs. We have noted that framed speed signs, for which an example is displayed in Fig. 10(b), have a very small red circle, which complicates detection from a distance. This causes more than half of the missed signs for this class. Detection of the frame and sign combination would form a possible solution. As from almost all signs a single detection is obtained, and over 85% of the signs is automatically located with a correct sign code, highly accurate inventories can be realized by addition of a limited amount of manual interaction. This consists of checking all detections classified as a sign, but not part of a sign, and of checking all located signs for correctness. Both checks can be performed efficiently, and allow for addition of other added-value attributes, including subsign texts and sign states (skewed, stickered, etc). 7. Conclusions and future work This paper has described an inventory system for traffic signs from street-level panoramic images, which is a challenging problem as capturing conditions vary, signs may be deformed and many sign look-a-like objects exist. The system starts with localizing the signs in the individual images, using independent detectors for the different sign classes. Then, each detection is classified to obtain the sign code, where also falsely detected signs are identified. Afterwards, detections from multiple images are combined to calculate the sign positions. The performance of the proposed system is evaluated by performing an inventory of a large geographical region, 319

Sign class: Total signs Correct signs Wrong sign code Falsely det. signs Missed signs 703 627 35 13 41 508 483 0 12 25 733 616 31 32 86 698 489 69 40 140 96 88 0 1 8 323 301 2 9 20 280 257 1 1 22 Total 3, 341 2, 861 138 108 342 Table 2. Performance overview of the complete inventory system for the seven different sign classes. (a) [4] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Proc. European Conference on Computer Vision (ECCV), May 2004. [5] N. Dalal and B. Triggs. Histogram of oriented gradients for human detection. In Proc. IEEE Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886 893, June 2005. [6] A. Frome, G. Cheung, A. Abdulkader, M. Zennaro, B. Wu, A. Bissacco, H. Adam, H. Neven, and L. Vincent. Largescale privacy protection in google street view. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 2373 2380, October 2009. [7] L. Hazelhoff, I. M. Creusen, D. W. J. M. van de Wouw, and P. H. N. de With. Large-scale classification of trafic signs under real-world conditions. In Proc. SPIE 8304B-34, 2012. [8] F. Jurie and B. Triggs. Creating efficient codebooks for visual recognition. In ICCV, pages 604 610, 2005. [9] R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI, 1995. [10] S. Lafuente-Arroyo, S. Maldonado-Bascon, P. Gil-Jimenez, J. Acevedo-Rodriguez, and R. Lopez-Sastre. A tracking system for automated inventory of road signs. In Intelligent Vehicles Symposium, 2007 IEEE, pages 166 171, june 2007. [11] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journal of Computer Vision (IJCV), 60(2), January 2004. [12] S. Maldonado-Bascon, S. Lafuente-Arroyo, P. Gil-Jimenez, H. Gomez-Moreno, and F. Lopez-Ferreras. Road-sign detection and recognition based on support vector machines. Intelligent Transportation Systems, IEEE Transactions on, 8(2):264 278, june 2007. [13] S. Maldonado-Bascon, S. Lafuente-Arroyo, P. Siegmann, H. Gomez-Moreno, and F. Acevedo-Rodriguez. Traffic sign recognition system for inventory purposes. In Intelligent Vehicles Symposium, 2008 IEEE, pages 590 595, june 2008. [14] E. Nowak, F. Jurie, and B. Triggs. Sampling strategies for bag-of-features image classification. In Proc. European Conference on Computer Vision (ECCV), pages 490 503. Springer, 2006. [15] G. Overett and L. Petersson. Large scale sign detection using hog feature variants. In Intelligent Vehicles Symposium (IV), 2011 IEEE, pages 326 331, june 2011. [16] R. Timofte, K. Zimmermann, and L. V. Gool. Multi-view traffic sign detection, recognition, and 3d localisation. In Applications of Computer Vision (WACV), 2009 Workshop on, pages 1 8, dec. 2009. (b) Figure 10. Examples of specific situations. (a): missed sign, located far away from capturing locations (there are no capturings present at the bicycle path); (b): framed speed sign. where over 85% of the 3, 341 signs are correctly localized. Despite the high number of sign look-a-like objects, only a limited number of objects are falsely detected as sign. Furthermore, nearly all missed signs are detected in at least a single image, where position retrieval is mainly limited by the capturing interval of 5 meters. As this performance is achieved at the large scale for a complete geographic region, where also signs not located directly along the road are taken into account, we consider this as an accurate result, especially since signs may be damaged, besmeared or partly occluded. By allowing a limited amount of manual interaction, a highly accurate inventory can be realized, with additional added-values such as indications about the sign state and possible subsign texts. In the future, we will extend the system with additional sign types, including framed (speed) signs and subsigns. Furthermore, we will perform additional validation experiments, possibly including an evaluation with a lower capturing interval of e.g. 2.5 m. References [1] I. Bonaci, I. Kusalic, I. Kovacek, Z. Kalafatic, and S. Segvic. Addressing false alarms and localization inaccuracy in traffic sign detection and recognition. In 16th computer vision winter workshop, pages 1 8, 2011. [2] D. Comanicu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:603 619, 2002. [3] I. M. Creusen, R. G. J. Wijnhoven, E. Herbschleb, and P. H. N. de With. Color exploitation in hog-based traffic sign detection. In Proc. IEEE International Conference on Image Processing (ICIP), pages 2669 2672, September 2010. 320