INTRODUCTION DETECTION AND FOLDED HIERARCHIES FOR EFFICIENT DONALD GEMAN JOINT WORK WITH FRANCOIS FLEURET 2 / 35 INTRODUCTION DETECTION (CONT.) HIERARCHY OF CLASSIFIERS...... Advantages - Highly efficient scene parsing; - Organized focusing on hard examples. Disadvantages...... - Many classifiers to learn; - Inefficient learning (unless training examples can be synthesized). 3 / 35 4 / 35
INTRODUCTION RELATED WORK Part-based Models - Constellation (M. Weber, M. Welling, P. Perona), - Composite (D. J. Crandall, D. P. Huttenlocher), - Implicit Shape (E. Seemann, M. Fritz, B. Schiele), - Patchwork of Parts (Y. Amit, A. Trouvé), - Compositional (S. Geman). Wholistic - Convolutional Networks (Y. Lecun), - Boosted Cascades (P. Viola, M. Jones), - Bag-of-Features (A.Zisserman, F.F.Li,). INTRODUCTION POSE SPACE Let Y be the space of poses and Y 1,..., Y K a partition. Let I be an image and Y = (Y 1,..., Y K ), where, for each k, Y k is a Boolean variable stating if there is an object in I with pose in Y k. 5 / 35 6 / 35 INTRODUCTION POSE SPACE (CONT.) For faces, typically y = (u c, v c, θ, s). Cats are more complex. TRAINING A DETECTOR FRAGMENTATION Given a training set {( I (t), Y (t))} t, we are interested in building, for each k, a classifier f k : I {0, 1} A coarse description is y = (u h, v h, s h, u b, v b ). for predicting Y k. Without additional knowledge about the relationship between k, Y k and I, we would train f k with {( )} I (t), Y (t) k Hence, each f k is trained with a fragment ( 1/K ) of the positive population. t. 7 / 35 8 / 35
TRAINING A DETECTOR AGGREGATION TRAINING A DETECTOR AGGREGATION (CONT.) To avoid fragmentation, samples are often normalized in pose. Let ξ : I R N denote an N-dimensional vector of features. Let ψ : {1,..., K } I I be a transformation such that the conditional distribution of ξ(ψ(k, I)) given Y k = 1 does not depend on k. Then train a single classifier g with the training set {( ( ψ k, I (t)) )}, Y (t) k and define f k (I) = g(ψ(k, I)). t, k Here, samples are aggregated for training: all positive samples are used to build each f k. 9 / 35 10 / 35 TRAINING A DETECTOR AGGREGATION (CONT.) However: - Evaluating ψ is computationally intensive for any non-trivial transformation. - The mapping ψ does not exist for a complex pose. INTRODUCTION We directly index the features with the pose. {1,..., K } I I ψ X ξ R N Hence, in practice, fragmentation is still the norm to deal with many deformations. But how does on deal with complex poses? 11 / 35 12 / 35
DEFINITION DEFINITION (CONT.) Hence, we define a family of pose-indexed features as a mapping X : {1,..., K } I R N such that they are stationary in the following sense: for every k {1,..., K }, the probability distribution does not depend on k. P(X(k, I) = x Y k = 1), x R N 13 / 35 14 / 35 TOY EXAMPLE, 1D SIGNAL { } Y = (θ 1, θ 2 ) {1,..., N} 2, 1 < θ 1 < θ 2 < N P(I Y = (θ 1, θ 2 )) = φ 0 (I n ) φ 1 (I n ) φ 0 (I n ) n<θ 1 θ 1 n θ 2 θ 2 <n TRAINING We then define f k (I) = g(x(k, I)), k {1,..., K }, 2 1.5 1 0.5 0 2 1.5 1 0.5 0 where one single classifier g : R N {0, 1} is trained with {( ( X k, I (t)) )}, Y (t) k t, k. -0.5-0.5-1 -1 0 5 10 15 20 25 30 0 5 10 15 20 25 30 X((θ 1, θ 2 ), I) = (I(θ 1 1), I(θ 1 ), I(θ 2 ), I(θ 2 + 1)) Notice that stationarity is the main condition required to ensure that the training samples are identically distributed. P(X = (x 1, x 2, x 3, x 4 ) Y = (θ 1, θ 2 )) = φ 0 (x 1 )φ 1 (x 2 )φ 1 (x 3 )φ 0 (x 4 ) 15 / 35 16 / 35
EDGE DETECTORS EDGE DETECTORS (CONT.) Let e φ,σ (I, u, v) be the presence on an edge in image I at location (u, v) with orientation φ {0,..., 7} at scale σ {1, 2, 4}. σ = 1 17 / 35 18 / 35 EDGE DETECTORS (CONT.) EDGE DETECTORS (CONT.) σ = 2 σ = 4 19 / 35 20 / 35
BASE FEATURES We use three types of stationary features: The windows W and W are indexed either with respect to the head or with respect to the head-belly. - Proportion of an edge (φ, σ) in a window W. - L 1 -distance between the histograms of orientations at scale σ in two windows W and W. - L 1 -distance between the histograms of gray-levels in two windows W and W. h h Head registration 21 / 35 h h 22 / 35 CLASSIFIER FOLDED HIERARCHIES b SUMMARY b We build classifiers with Adaboost and an asymmetric weighting by sampling. If the weighted error rate is L(h) = t,k ω t,k 1 { }, h(k, I (t) ) Z (t) k we use all the positive samples, and sub-sample negative ones with µ(k, t) ω t,k 1 { Y (t) k =0}. Total of 2327 scenes containing 1683 cats, 85% for training. Experiments (not shown) demonstrate: - Fragmentation applied Head-belly to complex registration poses is disastrous in practice; - Naive, brute-force exploration of the pose space is computationally intractable. Alternatively: - Stationary features largely avoid fragmentation; - Hierarchical search concentrates computation on ambiguous regions. We call this alternative a folded hierarchy of classifiers. 23 / 35 24 / 35
FOLDED HIERARCHIES SUMMARY (CONT.) SCENE PARSING STRATEGY g 1 g 2 25 / 35 26 / 35 SCENE PARSING ERROR CRITERION SCENE PARSING ERROR RATES TP H B HB 85% 11.29 (3.61) 80% 17.23 (3.87) 5.07 (1.08) 70% 11.40 (2.89) 1.88 (0.32) 60% 7.80 (1.98) 0.95 (0.22) 50% 5.41 (1.62) 0.53 (0.13) Average number of FAs per 640 480 27 / 35 28 / 35
S CENE PARSING S CENE PARSING E RROR R ESULTS ( PICKED RATES ( CONT.) AT RANDOM ) 1 H B HB True positive rate 0.8 0.6 0.4 0.2 0 0.001 0.01 0.1 1 10 Average number of FAs per 640x480 100 29 / 35 30 / 35 S CENE PARSING S CENE PARSING R ESULTS ( PICKED R ESULTS ( SELECTED AT RANDOM, CONT.) 31 / 35 FALSE ALARMS ) 32 / 35
SCENE PARSING SELECTED CONCLUSION A folded hierarchy of classifiers is highly efficient: - For (offline) learning; - For (online) scene processing. It combines the strengths of: - Template matching; - Powerful, wholistic machine learning; - Hierarchical search. However, this is achieved at the expense of - Rich annotation of the training samples. - Designing stationary features. 33 / 35 34 / 35 ACKNOWLEDGMENT 35 / 35