PCA FACE RECOGNITION The slides are from several sources through James Hays (Brown); Srinivasa Narasimhan (CMU); Silvio Savarese (U. of Michigan); Shree Nayar (Columbia) including their own slides.
Goal of Principal Components Analysis We wish to explain/summarize the underlying variance-covariance structure of a large set of variables through a few linear combinations of these variables.
Rotate Coordinate Axes Measure M random variables X 1,,X in the N-dimensional Cartesian coordinate system. Find N orthogonal axes in the directions of greatest variability. M M > N or M < N x_2 x_1 This is accomplished by rotating the original axes.
Algebraic Interpretation (1D) Given M points in an N dimensional space how does one project on to a (say) one-dimensional space? Choose the line that fits the data so that the points are maximally spread out along the line.
Assume the line passed through zero, which means, the mean of all points is already subtracted. We want the axes x such that to the covariance (zero mean) of the points (given) is decreasing as we go along the axes. Bx is (MxN)(Nx1) As we go from the first x to the N-s x each axis correspond to a lesser variance of the points in this direction. The last x corresponds to TLS minimizing the distance to the rest of the space.
Algebraic Solution The algebraic solution starts with the relation below and have N solutions in x. (L2 norm) x T B T Bx subject to x T x = 1 orthogonal x-s B is the matrix with points along the rows, MxN,... nr. of points / coor. per point T (point i)... x the unknown line (column vector), Nx1.
Algebraic Solution Rewriting this: x T B T Bx = e = e x T x = x T (ex) <=> x T (B T Bx ex) = 0 e is a scalar The value of x T B T Bx is obtained each time satisfying B T Bx=ex x x = 1 Find the e-s and associated x-s such that the matrix B T B when applied to x yields same x, scaled by e. x are eigenvectors and e are eigenvalues All eigenvectors are mutually orthogonal and if distinct, form a new N-dimensional basis. T
Problem: Size of Covariance Matrix A Each data point has N coordinates T and the covariance matrix is B B = A with the size of covariance matrix A being NxN and the number of eigenvectors is N. Example: N = 256x256 pixels = 65536 in vector form the size of A will be 65536 x 65536 and the number of eigenvectors is 65536. Typically, only 20-30 eigenvectors suffice. So, this method is very inefficient!
Efficient Computation of Eigenvectors If B is MxN and M<<N then A=B T B is NxN >> MxM M number of images, N number of coor. per point use BB T instead; eigenvector of BB T is easily converted to that of B T B (BB T ) y = e y => B T (BB T ) y = e (B T y) => (B T B)(B T y) = e (B T y) => B T y is the eigenvector of B T B
PCA Ignoring Eigenvectors You can decide to ignore the components of lesser significance. You will lose some information, but if the eigenvalues are small, you don t lose more than 2-5%. N dimensions in your data calculate N eigenvectors and eigenvalues choose only the first p eigenvectors final data set has only p dimensions. The matrix B goes from M x N to M x p where M is the number of points.
2D example of PCA what we have to achieve
Covariance values are not affected by subtracting the mean values. Step 1
Step 2 Calculate the 2x2 covariance matrix.616555556.615444444.615444444.716555556 Since the nondiagonal elements in this covariance matrix are all positive, we should expect that both the x and y variables increase together.
Step 3 Calculate the eigenvectors and eigenvalues of the covariance matrix. eigenvalues 1.28402771.049083398 eigenvectors.677873399 -.735178656.735178656.677873399 first second The eigenvalues are in decreasing order.
Principal components overlayed. Here the mean is still substracted.
1D Reconstruction Along the larger eigenvector.
Face Recognition Digital photography Surveillance Album organization Person tracking/id. Emotions and expressions Security/warfare Teleconferencing etc.
Space of Faces An image is a point in a high dimensional space. For example: an N x M image is a point in R NM a point in the vectorized space. [Thanks to Chuck Dyer, Steve Seitz, Nishino]
Image space Face space a linear approach Computes k-dim subspace such that the projection of the data points onto the subspace has the largest variance among all k-dim subspaces. Maximize the scatter of the training images in face space.
Eigenfaces [Turk and Pentland 91] Images in the possible set The original vector space is Z dimensional. {xˆ} are highly correlated. Compress them to a low-dimensional subspace that captures key appearance characteristics of the visual features. Use PCA for estimating the subspace. Two faces are compared in this subspace by measuring the euclidean distance between them. Among the first successful algorithms outside computer vision. It was a linear approach. Was improved later.
Projecting onto the Eigenfaces v is Zx1 dimensional i The eigenfaces v 1,..., v K span the space of faces A face is converted to eigenface coordinates by
Training Algorithm here N images and Z dim. vec. space (not M images!!) 1. Align training images x 1, x 2,, x N Note that each image is formulated into a long vector! 2. Compute average face u = 1/N Σ x i 3. Compute the difference image φ i = x i u i = 1,..., N
Algorithm Each of the N "points" is a column not a row!! 4. Compute the covariance matrix (total scatter matrix) S T = (1/N) Σ φ i φ i T = BB T B=[φ 1, φ 2 φ N ]. 5. Compute the eigenvectors of the covariance matrix S T. 6. Compute training projections in subspace a1, a2... a k << Z Testing 1. Take query image X. 2. Project X into eigenface space, W = {eigenfaces}, and compute projections ω = W(1...k)(X u). 3. Compare projections ω with all training N projections. k
Reconstruction and Errors k = 4 k = 200 k = 400 Only selecting the top k eigenfaces reduces the dimensionality. Fewer eigenfaces result in more information loss, and less discrimination between faces.
Limitations PCA assumes that the data has a distribution which has mean µ, covariance matrix Σ. Example: The shape of this dataset is not well described by its principal components. Credit slide: S. Lazebnik
The spaces of faces is not convex.
The spaces of faces is not convex. The average of two faces is not another face.
How Humans Detect Faces? We do not know yet! Some Conjectures: Memory-prediction model Match faces with the face model in memory. Parallel computing Detect faces at multiple location/scale combination.
Face Detection in Computers Basic Idea: Slide windows of different sizes across image. At each location match the window to a face model. I.1
Basic Framework For each window Extract Match Features F Face Model Yes / No I.12 Features: Which features represent faces well? Classifier: How to construct/match the face model?
Characteristics of Good Features Discriminate Face/Non-Face I.7 I.10 I.8 I.12 I.9 I.11 Extremely Fast to Compute Need to evaluate tens of thousands windows in an image.
The Viola/Jones Face Detector P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001. P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004. A paradigmatic method for real-time object detection. Training is slow, but detection is very fast. Three ideas interact Integral images for fast feature evaluation. Boosting for feature selection. Attentional cascade for fast rejection of nonface windows.
Integral Image A table that holds the sum of all pixel values to the left and top of a given pixel, inclusive. For example: 98 110 121 125 122 129 99 110 120 116 116 129 97 109 124 111 123 134 98 112 132 108 123 133 97 113 147 108 125 142 95 111 168 122 130 137 96 104 172 130 126 130 Image 98 208 329 454 576 705 197 417 658 899 1137 1395 294 623 988 1340 1701 2093 392 833 1330 1790 2274 2799 489 1043 1687 2255 2864 3531 584 1249 2061 2751 3490 4294 680 1449 2433 3253 4118 5052 Integral Image
Integral Image A table that holds the sum of all pixel values to the left and top of a given pixel, inclusive. For example: 98 110 121 125 122 129 99 110 120 116 116 129 97 109 124 111 123 134 98 112 132 108 123 133 97 113 147 108 125 142 95 111 168 122 130 137 96 104 172 130 126 130 Image 98 208 329 454 576 705 197 417 658 899 1137 1395 294 623 988 1340 1701 2093 392 833 1330 1790 2274 2799 489 1043 1687 2255 2864 3531 584 1249 2061 2751 3490 4294 680 1449 2433 3253 4118 5052 Integral Image
Integral Image A table that holds the sum of all pixel values to the left and top of a given pixel, inclusive. For example: 98 110 121 125 122 129 99 110 120 116 116 129 97 109 124 111 123 134 98 112 132 108 123 133 97 113 147 108 125 142 95 111 168 122 130 137 96 104 172 130 126 130 Image 98 208 329 454 576 705 197 417 658 899 1137 1395 294 623 988 1340 1701 2093 392 833 1330 1790 2274 2799 489 1043 1687 2255 2864 3531 584 1249 2061 2751 3490 4294 680 1449 2433 3253 4118 5052 Integral Image
Integral Image A table that holds the sum of all pixel values to the left and top of a given pixel, inclusive. For example: 98 110 121 125 122 129 99 110 120 116 116 129 97 109 124 111 123 134 98 112 132 108 123 133 97 113 147 108 125 142 95 111 168 122 130 137 96 104 172 130 126 130 Image 98 208 329 454 576 705 197 417 658 899 1137 1395 294 623 988 1340 1701 2093 392 833 1330 1790 2274 2799 489 1043 1687 2255 2864 3531 584 1249 2061 2751 3490 4294 680 1449 2433 3253 4118 5052 Integral Image
Summation Within a Rectangle Fast summations of arbitrary rectangles using integral images. 98 110 121 125 122 129 99 110 120 116 116 129 97 109 124 111 123 134 98 112 132 108 123 133 97 113 147 108 125 142 95 111 168 122 130 137 96 104 172 130 126 130 98 208 329 454 576 705 197 417 658 899 1137 1395 294 623 988 1340 1701 2093 392 833 1330 1790 2274 2799 489 1043 1687 2255 2864 3531 584 1249 2061 2751 3490 4294 680 1449 2433 3253 4118 5052 Image Integral Image (II)
Summation Within a Rectangle Fast summations of arbitrary rectangles using integral images. 98 110 121 125 122 129 98 208 329 454 576 705 99 110 120 116 116 129 197 417 658 899 1137 1395 97 109 124 111 123 134 294 623 988 1340 1701 2093 98 112 132 108 123 133 392 833 1330 1790 2274 2799 97 113 147 108 125 142 489 1043 1687 2255 2864 3531 95 111 168 122 130 137 96 104 172 130 126 130 584 1249 2061 2751 3490 4294 680 1449 2433 3253 4118 5052 P Sum = II P + Image = 3490 + Integral Image (II)
Summation Within a Rectangle Fast summations of arbitrary rectangles using integral images. 98 110 121 125 122 129 98 208 329 454 576 705 99 110 120 116 116 129 97 109 124 111 123 134 197 417 658 899 1137 1395 294 623 988 1340 1701 2093 Q 98 112 132 108 123 133 392 833 1330 1790 2274 2799 97 113 147 108 125 142 489 1043 1687 2255 2864 3531 95 111 168 122 130 137 96 104 172 130 126 130 584 1249 2061 2751 3490 4294 680 1449 2433 3253 4118 5052 P Image Sum = II P II Q + = 3490 1137 + Integral Image (II)
Summation Within a Rectangle Fast summations of arbitrary rectangles using integral images. 98 110 121 125 122 129 98 208 329 454 576 705 99 110 120 116 116 129 97 109 124 111 123 134 197 417 658 899 1137 1395 294 623 988 1340 1701 2093 Q 98 112 132 108 123 133 392 833 1330 1790 2274 2799 97 113 147 108 125 142 489 1043 1687 2255 2864 3531 95 111 168 122 130 137 96 104 172 130 126 130 S 584 1249 2061 2751 3490 4294 680 1449 2433 3253 4118 5052 P Image Integral Image (II) Sum = II P II Q II S + = 3490 1137 1249 +
Summation Within a Rectangle Can be computed in constant time with only 4 references 98 110 121 125 122 129 98 208 329 454 576 705 99 110 120 116 116 129 97 109 124 111 123 134 R 197 417 658 899 1137 1395 294 623 988 1340 1701 2093 Q 98 112 132 108 123 133 392 833 1330 1790 2274 2799 97 113 147 108 125 142 489 1043 1687 2255 2864 3531 95 111 168 122 130 137 96 104 172 130 126 130 S 584 1249 2061 2751 3490 4294 680 1449 2433 3253 4118 5052 P Image Integral Image (II) Sum = II P II Q II S + II R = 3490 1137 1249 + 417 = 1521
Boosting Designing a strong classifier from a set of weak classifier. Background Decision boundary Features Computer screen In some feature space.
Boosting Defines a classifier using an additive model: Strong classifier Features vector Weight Weak classifier We need to define a family of weak classifiers. form a family of weak classifiers. A simple algorithm for learning robust classifiers.
Boosting - mathematics Example of a weak learner value of rectangle feature h ( x) j 1 if f j( x) j 0 otherwise threshold Final strong classifier T 1 1 h( x) hx ( ) 2 0 otherwise t 1 t t t 1 t T
A weak classifier Four kind of rectangle filters Value = (pixels in white area) (pixels in black area) called Haar filters (features). Credit slide: S. Lazebnik
Haar Response using Integral Image T 98 110 121 125 122 129 99 110 120 116 116 129 R 98 208 329 454 576 705 197 417 658 899 1137 1395 Q 97 109 124 111 123 134 294 623 988 1340 1701 2093 98 112 132 108 123 133 392 833 1330 1790 2274 2799 97 113 147 108 125 142 489 1043 1687 2255 2864 3531 95 111 168 122 130 137 96 104 172 130 126 130 S 584 1249 2061 2751 3490 4294 680 1449 2433 3253 4118 5052 P O Image Integral Image (II) V A = (pixels in white area) (pixels in black area) = (II O II T + II R II S ) (II P II Q + II T II O ) = (2061 329+98 584) (3490 576+329 2061) = 64
Face Detection at Different Scales Use filters of different sizes to find faces at corresponding scale
Weak classifier will behave this way......evaluate each rectangle filter on each window and on each example. 1 (,1) x1 ( x2,1) ( x3,0) x4 (,0) ( 5,0) x 6 ( x,0) 0.8 0.7 0.2 0.3 0.8 0.1.. ( x, y ) n n h ( x) j 1 if f j( x) j 0 otherwise threshold a weak classifier, total of T weak classifiers
Viola-Jones detector: features Considering all possible filter parameters: position, scale(1.25), and type. 180,000+ possible features associated on 12 scales. At base 24 x 24 windows. At learning a 24x24 window is a face if it is a positive window and a nonface if it is a negative window, Which subset of these features should we use to determine if a window has a face?
The Viola-Jones detector used a simple boosting method the AdaBoost process. For a single feature, T weak classifiers. (Freund and Schapire, 1995) Learning Negative (more) and positive image examples. Total n images. For a t = 1,...,T find from the t weak classifier which has the minimum training error. At each iteration: The weights of incorrectly classified example are decreased and the correctly classified example increased. The t+1 step tries to correct wrongly classified images. The error decreases almost every step. Weak Class. h_{t+1}(x) The final strong classifier have the T weights inversely proportional to the t = 1,...,T training errors. Testing New images.
Boosting for face detection A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084. Not good enough! We want 1 in 1,000,000 11 classifiers ~10 operations. Receiver operating characteristic (ROC) curve
Boosting: pros and cons Advantages Integrates classification with feature selection. Complexity of training is linear in the number of training examples. Flexibility in the choice of weak learners and boosting schemes. Easy to implement. Disadvantages Needs many training (pos./neg.) examples. Often found to work less well than alternative discriminative classifier, like support vector machine (SVM), especially for many class problems. Slide credit: S. Lazebnik
Cascading classifiers for detection Form a cascade with low false negative rates early on. Apply less accurate but faster classifiers first to discard windows that clearly appear to be negative. Kristen Grauman
Attentional cascade We start with simple classifiers which reject a few of the negative windows while detecting almost all positive windows. Positive response from the first classifier triggers the evaluation of a second (more complex) classifier, and so on. The classifier have progressively lower false positive rates. The detection and false positive rates of the cascade are found by multiplying the individual rates. -6 example: detection rate 0.9 false positive rate ~10 can be achieved by a 10-stage cascade where each stage d.r. 0.99 and f.p.r. 0.3.
Viola-Jones detector: summary Train cascade of classifiers with AdaBoost Faces New image Non-faces Selected features, thresholds, and weights 384x288 new faces [Implementation available in OpenCV: http://www.intel.com/technology/computing/opencv/] Kristen Grauman
The implemented system Training Data 4916 faces All frontal, rescaled to 24x24 pixels per face 350 million non-faces in 9500 non-face images Faces are normalized Scale, translation Many variations Across individuals Illumination Pose Real-time detector using 38 layer cascade, total of 6060 features About a week of training. (~2002, 466Mhz) (Most slides from Paul Viola)
The two curves mean different amount of windows examined. 75million/18million In each layer max. 6000 non-faces were collected. First layer 2 features; rejects 50% non-faces, accepts close to 100% faces. Second layer 10 features; 80% non-faces, ~100% faces. Third and fourth layer 25 features... Average of 10 features evaluated per window on test set.
Output of VJ Face Detector: Test Images
Facial Feature Localization Profile Detection Male vs. female
Face recognition is far from perfect. In a face is moved, say, 30 degrees off frontal, the performance decreases a lot. There are many face recognition system by now, e.g., face recognition in secure entrance. They are much faster and with many more faces in the database. But they are not perfect and, say, the first 20 frontal face images are examed for a querry.