2D Image Processing Face Detection and Recognition

2D Image Processing Face Detection and Recognition Prof. Didier Stricker Kaiserlautern University http://ags.cs.uni-kl.de/ DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de 1

Long range and high-speed eye tracking Video: IR long range, high speed eye-tracking Video: Eye- Tracking from mobile phone camera 2

Sliding window detection Detection

Consumer application: Apple iphoto http://www.apple.com/ilife/iphoto/

Face Detection

Sliding Windows 1. Hypothesize: try all possible rectangle locations, sizes 2. Test: classify if rectangle contains a face (and only the face) Note: 1000's more false windows then true ones. For computational efficiency, we should try to spend as little time as possible on the negative windows

Classification (Discriminative) Background Faces In some feature space

The Viola/Jones Face Detector A seminal approach to real-time object detection Training is slow, but detection is very fast Key ideas Integral images for fast feature evaluation Boosting for feature selection Attentional cascade for fast rejection of non-face windows P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001. P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.

Image Features Rectangle filters Value = (pixels in white area) (pixels in black area)

Fast computation with integral images The integral image computes a value at each pixel (x,y) that is the sum of the pixel values above and to the left of (x,y), inclusive. This can quickly be computed in one pass through the image (x,y)

Computing the integral image

Compu&ng the integral image ii(x, y-1) s(x-1, y) i(x, y) Cumula<ve row sum: s(x, y) = s(x 1, y) + i(x, y) Integral image: ii(x, y) = ii(x, y 1) + s(x, y) MATLAB: ii = cumsum(cumsum(double(i)), 2);

Computing sum within a rectangle Let A,B,C,D be the values of the integral image at the corners of a rectangle What is the sum of pixel values within the rectangle? sum = A B C + D Only 3 additions are required for any size of rectangle! D C B A

Scaling Integral image enables us to evaluate all rectangle sizes in constant time. Therefore, no image scaling is necessary. 1 2 3 4 5 6 Scale the rectangular features instead!

Feature selection For a 24x24 detection region, the number of possible rectangle features is ~160,000! At test time, it is impractical to evaluate the entire feature set. Can we create a good classifier using just a small subset of all possible features?

AdaBoost Adaptive Boosting A learning algorithm Building a strong classifier a lot of weaker ones

Boosting Boosting is a classification scheme that works by combining weak learners into a more accurate ensemble classifier A weak learner need only do better than chance Training consists of multiple boosting rounds During each boosting round, we select a weak learner that does well on examples that were hard for the previous weak learners Hardness is captured by weights attached to training examples Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September, 1999.

AdaBoost Concept h( x) {1, + 1} 1 h2 ( x) {1, + 1}. h ( x) {1, + 1} T weak classifiers slightly better than random T HT ( x) = sign αtht ( x) t= 1 strong classifier

Weaker Classifiers h( x) {1, + 1} 1 h2 ( x) {1, + 1}. h ( x) {1, + 1} T weak classifiers slightly better than random l l l Each weak classifier learns by considering one simple feature T most beneficial features for classification should be selected T HT ( x) = sign αtht ( x) How to t= 1 define features? select beneficial features? train weak classifiers? manage (weight) training samples? strong associate weight classifier to each weak classifier?

The Strong Classifiers h( x) {1, + 1} 1 h2 ( x) {1, + 1}. h ( x) {1, + 1} T weak classifiers slightly better than random How good the strong one will be? T HT ( x) = sign αtht ( x) t= 1 strong classifier

Boosting It is a sequential procedure: x t=1 x t=2 x t Each data point has a class label: y t = +1 ( ) -1 ( ) and a weight: w t =1

Toy example Weak learners from the family of lines Each data point has a class label: y t = +1 ( ) -1 ( ) and a weight: w t =1 h => p(error) = 0.5 it is at chance

Toy example Each data point has a class label: y t = +1 ( ) -1 ( ) and a weight: w t =1 This one seems to be the best This is a weak classifier : It performs slightly better than chance.

Toy example Each data point has a class label: y t = +1 ( ) -1 ( ) We update the weights: w t w t exp{-y t H t } We set a new problem for which the previous weak classifier performs at chance again

Toy example f 1 f 2 f 4 f 3 The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers.

AdaBoost Algorithm Given: m examples (x 1, y 1 ),, (x m, y m ) where x i X, y i Y={-1, +1} Initialize w 1 (i) = 1/m For t = 1 to T The goodness of h t is calculated over D t and the bad guesses. 1. Train learner h t with min error ε 2. Compute the hypothesis weight 3. For each example i = 1 to m w t+1 (i) = w (i) $ & t % Z t '& Output T H( x) = sign α tht t= 1 ( x) e α t e α t t = Pr i~ α = t 1 2 D ifh t (x i ) = y i ifh t (x i ) y i t [ h t ( x 1 ε t ln εt i ) y i ] The weight Adapts. The bigger e t becomes the smaller a t becomes. Boost example if incorrectly predicted. Z t is a normalization factor. Linear combination of models.

Weak Learners for Face Detection value of rectangle feature Parity (indicating the direction of the inequality sign) threshold h( x) t 1 if pf( x) > = 0 otherwise p θ t t t t window

Boosting Training set contains face and nonface examples Initially, with equal weight For each round of boosting: Evaluate each rectangle filter on each example Select best threshold for each filter Select best filter/threshold combination Reweight examples Computational complexity of learning: O(MNK) M rounds, N examples, K features

Features Selected by Boosting First two features selected by boosting: This feature combination can yield 100% detection rate and 50% false positive rate

ROC Curve for 200-Feature Classifier A 200-feature classifier can yield 95% detection rate and a false positive rate of 1 in 14.084. Not good enough! To be practical for real application, the false positive rate must be closer to 1 in 1,000,000.

Attentional Cascade We start with simple classifiers which reject many of the negative sub-windows while detecting almost all positive sub-windows Positive response from the first classifier triggers the evaluation of a second (more complex) classifier, and so on A negative outcome at any point leads to the immediate rejection of the sub-window IMAGE SUB-WINDOW Classifier 1 T T T Classifier 2 Classifier 3 FACE F NON-FACE F NON-FACE F NON-FACE

Attentional Cascade Chain classifiers that are progressively more complex and have lower false positive rates % Detection ROC Curve 100 0 % False Pos 50 0 IMAGE SUB-WINDOW Classifier 1 T T T Classifier 2 Classifier 3 FACE F NON-FACE F NON-FACE F NON-FACE

Cascaded Classifier IMAGE SUB-WINDOW 50% 1 Feature 5 Features 20% 2% 20 Features FACE F F F NON-FACE NON-FACE NON-FACE A 1 feature classifier achieves 100% detection rate and about 50% false positive rate. A 5 feature classifier achieves 100% detection rate and 40% false positive rate (20% cumulative) using data from previous stage. A 20 feature classifier achieve 100% detection rate with 10% false positive rate (2% cumulative)

Training the Cascade Set target detection and false positive rates for each stage Keep adding features to the current stage until its target rates have been met Need to lower AdaBoost threshold to maximize detection (as opposed to minimizing total classification error) Test on a validation set If the overall false positive rate is not low enough, then add another stage Use false positives from current stage as the negative training examples for the next stage

The Implemented System Training Data 5000 faces All frontal, rescaled to 24x24 pixels 300 million non-faces 9500 non-face images Faces are normalized Scale, translation Many variations Across individuals Illumination Pose

Structure of the Detector Cascade Combining successively more complex classifiers in cascade 38 stages included a total of 6060 features All Sub-Windows T T T T T T T T T 1 2 3 4 5 6 7 8 38 Face F F F F F F F F F Reject Sub-Window

Structure of the Detector Cascade All Sub-Windows 2 features, reject 50% non-faces, detect 100% faces 10 features, reject 80% non-faces, detect 100% faces 25 features 50 features by algorithm T T T T T T T T T 1 2 3 4 5 6 7 8 38 Face F F F F F F F F F Reject Sub-Window

Speed of the Final Detector On a 700 Mhz Pentium III processor, the face detector can process a 384 288 pixel image in about.067 seconds 15 Hz 15 times faster than previous detector of comparable accuracy (Rowley et al., 1998) Average of 8 features evaluated per window on test set

Output of Face Detector on Test Images

Other Detection Tasks Facial Feature Localization Profile Detection Male vs. female

Viola Jones face detector Adaboost for face detection Rectangle features Integral images for fast computation Boosting for feature selection Attentional cascade for fast rejection of negative windows

Face detection and recognition Detection Recognition Sally

Face recognition: overview Typical scenario: few examples per face, identify or verify test example What s hard: changes in expression, lighting, age, occlusion, viewpoint Basic approaches (all nearest neighbor) 1. Project into a new subspace (or kernel space) (e.g., Eigenfaces =PCA) 2. Measure face features 3. Make 3d face model, compare shape+appearance (e.g., AAM)

Typical face recognition scenarios Verification: a person is claiming a particular identity; verify whether that is true E.g., security Closed-world identification: assign a face to one person from among a known set General identification: assign a face to a known person or to unknown

What makes face recognition hard? Expression

What makes face recognition hard? Lighting

What makes face recognition hard? Occlusion

What makes face recognition hard? Viewpoint

Simple idea for face recognition 1. Treat face image as a vector of intensities 2. Recognize face by nearest neighbor in database x y...y 1 n k = argmin k k y x

The space of all face images When viewed as vectors of pixel values, face images are extremely high-dimensional 100x100 image = 10,000 dimensions Slow and lots of storage But very few 10,000-dimensional vectors are valid face images We want to effectively model the subspace of face images

The space of all face images Eigenface idea: construct a low-dimensional linear subspace that best explains the variation in the set of face images

Principal Component Analysis (PCA) Given: N data points x 1,,x N in R d We want to find a new set of features that are linear combinations of original ones: u(x i ) = u T (x i µ) (µ: mean of data points) Choose unit vector u in R d that captures the most data variance Forsyth & Ponce, Sec. 22.3.1, 22.3.2

Principal Component Analysis Direction that maximizes the variance of the projected data: Maximize N Projection of data point N subject to u =1 Covariance matrix of data The direction that maximizes the variance is the eigenvector associated with the largest eigenvalue of Σ (can be derived using Raleigh s quotient or Lagrange multiplier)

Principal component analysis The direction that captures the maximum covariance of the data is the eigenvector corresponding to the largest eigenvalue of the data covariance matrix Furthermore, the top k orthogonal directions that capture the most variance of the data are the k eigenvectors corresponding to the k largest eigenvalues

Eigenfaces: Key idea Assume that most face images lie on a low-dimensional subspace determined by the first k (k<d) directions of maximum variance Use PCA to determine the vectors or eigenfaces u 1, u k that span that subspace Represent all face images in the dataset as linear combinations of eigenfaces M. Turk and A. Pentland, Face Recognition using Eigenfaces, CVPR 1991

Eigenfaces example Training images x 1,,x N

Eigenfaces example Top eigenvectors: u 1, u k Mean: µ

Eigenfaces example Face x in face space coordinates: =

Eigenfaces example Face x in face space coordinates: = Reconstruction: = + ^ x = µ + w 1 u 1 +w 2 u 2 +w 3 u 3 +w 4 u 4 +

Recognition with eigenfaces Process labeled training images: Find mean µ and covariance matrix Σ Find k principal components (eigenvectors of Σ) u 1, u k Project each training image x i onto subspace spanned by principal components: (w i1,,w ik ) = (u 1T (x i µ),, u kt (x i µ)) Given novel image x: Project onto subspace: (w 1,,w k ) = (u 1T (x µ),, u k T (x µ)) Optional: check reconstruction error x x ^ to determine whether image is really a face Classify as closest training face in k-dimensional subspace M. Turk and A. Pentland, Face Recognition using Eigenfaces, CVPR 1991

Limitations Global appearance method: not robust to misalignment, background variation

Limitations PCA assumes that the data has a Gaussian distribution (mean µ, covariance matrix Σ) The shape of this dataset is not well described by its principal components

Limitations The direction of maximum variance is not always good for classification

Face verification using deep networks Facebook s DeepFace approach: elaborate 2D-3D alignment followed by a deep network achieves near-human accuracy (for cropped faces) on face verification Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR 2014.

Face verification using deep networks Alignment pipeline. (a) The detected face, with 6 initial fiducial points. (b) The induced 2D-aligned crop. (c) 67 fiducial points on the 2D-aligned crop with their corresponding Delaunay triangulation, we added triangles on the contour to avoid discontinuities. (d) The reference 3D shape transformed to the 2Daligned crop image-plane. (e) Triangle visibility w.r.t. to the fitted 3D-2D camera; darker triangles are less visible. (f) The 67 fiducial points induced by the 3D model that are used to direct the piece-wise affine warping. (g) The final frontalized crop. (h) A new view generated by the 3D model (not used in this paper). Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR 2014.

Face verification using deep networks Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Colors illustrate feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected layers. Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR 2014.

Face verification using deep networks This method reaches an accuracy of 97.35% on the Labeled Faces in the Wild (LFW) dataset, reducing the error of the current state of the art by more than 27%, closely approaching human-level performance. Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR 2014.

Thank you!