Visual Object Detection

Size: px

Start display at page:

Download "Visual Object Detection"

Bernadette Paul
5 years ago
Views:

1 Visual Object Detection Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL / 47

2 Visual Object Detection I Detecting an object in an image I I output: location and size of all instances of this object class Challenges I I I I I I what is an object? how to describe the object? how likely is an image an object of interest? how to handle the scale changes? how to handle the orientation of the target? how to handle all sorts of visual variabilities? 2 / 47

3 Outline Basics in Detection Theory Boosting-based Detection Feature Template-based Detection Deformable Parts Model (DPM) based Detection Deep Network based Detection 3 / 47

4 Action and Risk Classes: {ω 1, ω 2,..., ω c } Actions: {α 1, α 2,..., α a } Loss: λ(α k ω i ) Conditional risk: R(α k x) = c λ(α k ω i )p(ω i x) i=1 Decision function, α(x), specifies a decision rule. Overall risk: R = x R(α(x) x)p(x)dx It is the expected loss associated with a given decision rule. 4 / 47

5 Bayesian Decision and Bayesian Risk Bayesian decision α = argmin R(α k x) k This leads to the minimum overall risk. (why?) Bayesian risk: the minimum overall risk R = R(α x)p(x)dx Bayesian risk is the best one can achieve. x 5 / 47

6 Example: Minimum-error-rate classification Let s have a specific example of Bayesian decision In classification problems, action α k corresponds to ω k Let s define a zero-one loss function λ(α k ω i ) = { 0 i = k 1 i k i, k = 1,..., c This means: no loss for correct decisions & all errors are equal It easy to see: the conditional risk error rate R(α k x) = i k P(ω i x) = 1 P(ω k x) Bayesian decision rule minimum-error-rate classification Decide ω k if P(ω k x) > P(ω i x) i k 6 / 47

7 Classifier and Discriminant Functions Discriminant function: g i (x), i = 1,..., C, assigns ω i to x Classifier Examples: x ω i if g i (x) > g j (x) g i (x) = P(ω i x) g i (x) = P(x ω i )P(ω i ) j i g i (x) = ln P(x ω i ) + ln P(ω i ) Note: the choice of D-function is not unique, but they may give equivalent classification result. Decision region: the partition of the feature space x R i if g i (x) > g j (x) j i Decision boundary: 7 / 47

8 Miss Detections vs. False Positives errors: miss detections AND false positives No free lunch! 8 / 47

9 Visual Detection: Three Key Issues Target representation Rule-based models Shape template-based models Image appearance-based models Visual feature-based models Pattern classification various choices of classifers training Effective search determing the location: scanning all pixels locations determinng the scale: scanning the scale space how to make the search faster? 9 / 47

10 Outline Basics in Detection Theory Boosting-based Detection Feature Template-based Detection Deformable Parts Model (DPM) based Detection Deep Network based Detection 10 / 47

11 Example: front-view face detection locate the faces in an image challenges: large variations in the visual appearances due to: scale and/or rotation illumination facial expression partial occlusion 11 / 47

12 Viola-Jones Detector Feature Simple Harr wavelet features Classifier AdaBoost feature selection Smart ideas to speed things up integral image cascading classifiers 12 / 47

13 Feature: Harr-like wavelet A bank of Harr-like wavelet filters Applying a filter to a pixel location produces a feature How many such features does a detection window generate? How to compute such features rapidly? 13 / 47

14 A Smart Idea: Integral Image The value of the integral image at (x, y) is the sum of all the pixels above and to the left II (x, y) = I (u, v) u x,v y This is done only once for an image Then the computation of the sum of all pixels within any rectangular region is a constant complexity Ex: the sum within D is done via / 47

15 Weak Classifier Weak features and weak classifiers a weak classifier uses only one simple feature for classification { x if p j f j (x) < p j θ j h j (x) = 0 otherwise a weak classifer: (f j, θ j, p j ) Why not combining multiple weak classifiers? 15 / 47

16 AdaBoost for Feature Selection 16 / 47

17 Feature Selection and Combination Strong classifier T 1 α t h t (x) 1 2 h(x) = t=1 0 otherwise Does the selection make senses? T α t t=1 17 / 47

18 Speeding Up: Attentional Cascade Motivation most deteciton windwos contain non-faces thus most computation is wasted Idea? can we save computation on non-faces? early rejection? using simple classifiers for screening. Does the selection make senses? 18 / 47

19 Designing Cascade Design parameters # of cascade stages # of features for each stage parameters of each stage Example: a 32-stage classifier S1: 2-feature, detect 100% faces and reject 60% non-faces S2: 5-feature, detect 100% faces and reject 80% non-faces S3-5: 20-feature S6-7: 50-feature S8-12: 100-feature S13-32: 200-feature Designing a good cascade needs tremendous engineering efforts 19 / 47

20 Cascade Performance A 200-feature classifier vs a 10-stage 20-feature cascade Similar accuracy, but cascade is 10 times faster 20 / 47

21 Training Images Data collection: positive data and negative data Validation set 21 / 47

22 Results 22 / 47

23 Summary Advantages simple: easy to implement fast: real-time performances Limitations and open problems difficult to design cascade cannot handle out-of-plane rotation difficult to handle partial occlusion 23 / 47

24 Outline Basics in Detection Theory Boosting-based Detection Feature Template-based Detection Deformable Parts Model (DPM) based Detection Deep Network based Detection 24 / 47

25 Viola-Jones Detector for Pedestrain Detection use 45,000 possible features OK results, but still far from satisfactory 25 / 47

26 From Face to Pedstrain Detection I I I articulated poses various views unpredictable cloth 26 / 47

27 Histogram of Gradient Orientations Bining of the gradient orientations within a cell quantization of the orientations of the image gradient weightedby the maganitude of the gradient (not a histogram) Spatial combination (R-HoG and C-HoG) to form a block the purpose is to normalize the local histograms with the block lead to a normalized descriptor A HoG descriptor represents a block 27 / 47

and block size: 2 2 cells HoG-based human representation (an array of HoG vectors)

28 a 7 15 array a 3,780-D vector 28 / 47 HoG Feature HoG descriptor dimension (a 36-D vector) using 9 bins for the orientation quantization ([0, π)) cell size: 8 8 pixels, and block size: 2 2 cells HoG-based human representation (an array of HoG vectors) detection window size: stride (i.e., block overlap): half of the block size (8 pixels)

29 HoG + Linear SVM (a) average gradient image over the training samples (b) each pixel is the max positive SVM weight for the block (b) each pixel is the max negative SVM weight for the block 29 / 47

30 HoG + Linear SVM (d) a test image (e) its R-HOG features (7x15x36) (f) the descriptor weighted by the positive SVM weights (g) the descriptor weighted by the negative SVM weights 30 / 47

31 Examples: before clustering I Search over scale (scaling factor 1.05) 31 / 47

32 Examples Clustering needs to be performed to (1) group multiple detections, (2) reduce false postivies 32 / 47

33 Outline Basics in Detection Theory Boosting-based Detection Feature Template-based Detection Deformable Parts Model (DPM) based Detection Deep Network based Detection 33 / 47

34 Deformable Parts Model Large variations in visual appearances challenge object detection Such variations are induced by: deformation of the target s shape structual composition large appearance chagnes view changes etc. These variations may be tremendous Rigid templates and single deformable models are not able to capture such huge variations Part-based deformable models modeling the structual composition and variations strong expressive power sharing computation a rich model 34 / 47

35 Model: features and filters Scale-space image representation (or image pyramid) p = (x, y, l) is a position (x, y) in the l-th level of the pyramid H(p) is the raw visual feature pyramid (a tensor) Visual features: φ(h, p, w, h) located at p supported by the w h subwindow (whose top-left corner is p) using the H as the raw feature represented as a vector by stacking features in subwindow denoted by φ(p) for short, and w, h are predefined parameters φ(p) are visual observations Filters: F a 2D filter, with size w h represented as a 1D vector by stacking the elements concept: weighting the features in the w h subwindow filter response: < F, φ(p) > F will be learned Existance of an object an object is encoded by a filter the response of such a filter at p indicates how likely this object exists at p 35 / 47

36 Model: configuration and springs The DPM model of an object with n parts is defined by (F 0, P 1,..., P n, b) A root filter F0 : covers an entire object in lower resolution n fine part filters Pi : covers smaller parts in higher resolution b is a real-value bias term A part filter: P i = (F i, v i, d i ) Fi is the filter for the i-th part vi is a 2D vector: an anchor position for this part w.r.t. root di is a 4D vector: coefficients for the deformable cost displacement: [dx i, dy i ] = [x i, y i ] (2[x 0, y 0] + v i ) Define φ d (p i, p 0) = φ d (dx i, dy i ) = [dx i, dy i, dxi 2, dyi 2 ] Configuration and Spring Configuration: (p0, p 1, p 2,..., p n ) Spring : the strength of a spring is di. A star topology Every part is connected to the root No connection among parts Configuration is to be inferred (or estimated) for each image Spring is to be learned based on all training images 36 / 47

37 Model: parameters and the linear model Model parameters Λ = (F 0, F 1,..., F n, d 1,..., d n, b) Model obervations (evidence) Y = φ(p) Model target variable X = p0, the location of the root Model laten variable Z = (p1,..., p n ), the part locations To evaluate a complete hypothesis (X, Z) s(x, Z Y) = s(p 0, p 1,..., p n ) n = < F i, φ(p i ) > i=0 Another way: a linear form where n < d i, φ d (p i, p 0 ) > +b i=1 s(x, Z Y) =< β, ψ(x, Z) > β = [F 0,..., F n, d 1,... d n, b] ψ(x, Z) = [φ(p 0 ),..., φ(p n ), φ d (p 1, p 0 ),..., φ d (p n, p 0 ), 1] 37 / 47

38 Inference The objective of inference is to find s(p 0 ) s(p 0 Y) = For each part, define { D i (p 0 ) = max p i Then, easy to see: max s(p 0, p 1,..., p n Y) (p 1,...,p n) Fi T s(p 0 ) = F T 0 φ(p 0 ) + } φ(p i ) di T φ d (p i, p 0 ) n D i (p 0 ) + d D i (p 0 ) is the maximum contribution of the i-th part to the score of the root at p 0 (i.e., optimal subpath property) This is a very simple dynamic programming problem The part localization is done via back tracking in the DP i=1 p i = arg max p i D i (p 0 ) 38 / 47

39 Inference: Computing D i (p 0 ) The key in DPM inference is to compute D i (p 0 ). { } D i (p 0 ) = max F T p i φ(p i ) di T φ d (p i, p 0 ) i The first term Fi T φ(p i ) the response map of the part filter Fi independent of the root location p 0. easy to compute The second term is di T φ d (p i, p 0 ) penelty of the placement of pi for a given root position p 0 easy to compute as well The major issue is the maximization over p i if considering all possible choices of pi, although it is linear, it wastes a lot of computation This is implemented via a generalized distance tranform. This leads to a transformed response map [transforming from the part filter response map Fi T φ(p i ) to D i (p 0 )] 39 / 47

40 Inference: process 40 / 47

41 Learning: Latent SVM Consider a classifer (discriminative function) in the following form f β (x) = max z Z(x) βt Φ(x, z) where β is a vector of the model parameters, and z are the latent values The set Z(x) defines the domain of z given an x Classification is obtained based on the sign of f β (x) Given training data D = ((x 1, y 1 ),..., (x n, y n )), where y i { 1, 1} minimzing the following objective function L D (β) = 1 n 2 β 2 + C max(0, 1 y i f β (x i )) i=1 where max(0, 1 y i f β (x i ) is the standard hinge loss. C controls the regularization. Note: If Z(x i ) = 1, then it degenerates to linear SVM. 41 / 47

42 Learning: Solving Latent SVM Denote by Z p the latent value for each positive training sample For a positive example, set Z(x i ) = {z i } where z i is the latent value specified for x i by Z p. Define an auxiliary objective function L D (β, Z p ) = L D(Zp)(β) Property: L D (β) = min Z p L D (β, Z p ) i.e., L D (β, Z p ) bounds the LSVM objective. Now, we minimize L D (β, Z p ) instead Minimzing L D (β, Z p ) Relabeling positive examples: optimize L D (β, Z p ) over Z p by selecting the highest-score latent values for each pos. example: z i = arg max z Z(x i ) βt Φ(x i, z) Estimating β: optimize LD (β, Z p ) over β by solving the convex optimziation problem defined by L D(Zp)(β) L D (β, Z p ) = 1 2 β 2 + C max(0, 1 f β (x i )) {x i y i (x i )=1} 42 / 47

43 Some DPM models 43 / 47

44 DPM on PASCAL 2007 Dataset 44 / 47

45 Outline Basics in Detection Theory Boosting-based Detection Feature Template-based Detection Deformable Parts Model (DPM) based Detection Deep Network based Detection 45 / 47

46 Rowley-Baluja-Kanade s Detector Train a multilayer neural network (1998) Receptive fields An early attempt of using neural network for face detection Tremendous Deep networks for face detection nowadays 46 / 47

47 Some Results 47 / 47

Discriminative part-based models. Many slides based on P. Felzenszwalb

More sliding window detection: ti Discriminative part-based models Many slides based on P. Felzenszwalb Challenge: Generic object detection Pedestrian detection Features: Histograms of oriented gradients