Boosting: Algorithms and Applications Lecture 11, ENGN 4522/6520, Statistical Pattern Recognition and Its Applications in Computer Vision ANU 2 nd Semester, 2008 Chunhua Shen, NICTA/RSISE
Boosting Definition of Boosting: Boosting refers to the general problem of producing a very accurate prediction rule by combining rough and moderately inaccurate rules. Boosting procedures Given a set of labeled training examples On each round The booster devises a distribution (importance) over the example set The booster requests a weak hypothesis/classifier/learner with low error Upon convergence, the booster combine the weak hypothesis into a single prediction rule.
Boosting (Freund & Schapire, 1997)
Boosting: 1 st iteration
Boosting: Update Distribution
Boosting as Entropy Projection Minimizing relative entropy to last distribution s.t. linear constraints
Boosting: 2 nd Hypothesis
Boosting: 3 rd Hypothesis
Boosting: 4 th Hypothesis
All hypotheses
AdaBoost
Properties of AdaBoost Adaboost adjusts adaptively the errors of the weak hypotheses by weak learner. Unlike the conventional boosting algorithm, the prior error need not be known ahead of time. The update rule reduces the probability assigned to those examples on which the hypothesis makes a good predictions and increases the probability of the examples on which the prediction is poor.
Multi-class Extensions The previous discussion is restricted to binary classification problems. The traing data could have any number of labels, which is a multi-class problems. The multi-class case (AdaBoost.M1) requires the accuracy of the weak hypothesis greater than ½. This condition in the multi-class is stronger than that in the binary classification cases
Detection Pedestrian Using Patterns of Motion and Appearance Paul Viola, Michael J. Jones, Daniel Snow IEEE ICCV
The System A pedestrian detection system using image intensity information and motion information with the detectors trained by AdaBoost. The first approach combining both the appearance and motion information in a single detector. Advantages: High efficiency High detection rate & low false positive rate
Rectangle Filters Measuring the difference between region averages at various scales, orientations and aspect ratios. However, this information is limited and needs to be boosted to perform accurate classification
Motion Information Information about the direction of motion can be extracted from the difference between shifted versions of the second image in time with the first image Motion filters (direction, shear, magnitude) operate on 5 images: Δ = abs I I U L R D = = = = abs abs abs abs ( ) t t+ 1 ( I I ) t t+ 1 ( It It+ 1 ) ( I I ) t t+ 1 ( I I ) t t+ 1
An Example
Appearance Filter Appearance Filter is rectangular filters that operate on the first input image f = φ ( ) m I t
Integral Image The integral image at location x,y contains the sum of the pixels above and to the left of x,y, inclusive: ii ( x, y) = i( x, y ) x x, y y ii ( x, y) i ( x, y) where is the integral image and is the original image s ii ( x, y) = s( x, y 1) + i( x, y) ( x, y) = ii( x 1, y) + s( x, y) where s(x,y) is the cumulative row sum
Training Filters The rectangle filters can have any size, aspect ratio or position as long as they fit in the detection window; therefore, there are quite a number of possible motion and appearance filters, from which a learning algorithm selects to build classifiers.
Training Process The training process uses AdaBoost to select a subset of features (F) which minimize the weighted error, to construct the classifier. In each round, the learning algorithm chooses a set of filters from motion and appearance filters. Also picks the optimal threshold (t) for each feature as well as the linear weights The outputs of AdaBoost is a linear combination of the selected features.
Training Process A cascade architecture is used to raise the efficiency of the system. The true and false positives passed at the current stage will be used in the next stage of the cascade. The goal is to reduce the false positive rate faster than the detection rate.
Overview of the Cascaded Structure Classifier 1 Weak Classifier 1 Weak Classifier 2 Weak Classifier 3 Weak Classifier 4 0.9 0.7 0.5 0.3 = 0.9 + 0.7 + 0.3 = 1.9 > 1.0 (threshold) Strong Classifier Weight Classifier 2 Weak Classifier 1 Weak Classifier 2 Weak Classifier 3 Weak Classifier 4 0.9 0.7 0.5 0.3 Strong Classifier Weight = 0.5 + 0.3 = 0.8 < 1.0 (threshold)
Experiments Each classifier in the cascade is trained using the original positive examples and the same number of false positives from the previous stage or negative examples at the first stage. The resulting classifier of previous stage is used as the input of the current stage and build a new classifier with lower false positive rate The detection threshold is set using a validation set of image pairs.
Training samples A small sample of positive training examples: A pair of image patterns comprise a single example for training
Training the cascade A large number of motion and appearance filters for training the dynamic pedestrians Fewer number of appearance filters for training the static pedestrians
Training results The first five filters learned for the dynamic pedestrian detector. The six images used in the motion and appearance representation are shown for each filter The first five filters learned for the static pedestrian detector
Testing Detection results of the dynamic detector
Testing Detection results of the static detector
Pedestrian Detection Using Boosting and Covariance Features Sakrapee Paisitkriangkrai, Chunhua Shen, and Jian Zhang, IEEE T-CSVT
Covariance Features The image is divided into small overlapped regions. Each pixel in the region is converted to an eight-dimensional feature vector [ ] = = k k k k k k k Y X Y X n Y X n Y X E Y X 1 1 1 ) )( ( ), cov( μ μ + = YY XX X Y Y X Y X I I I I I I I I y x y x F 1 2 2 tan ), ( Covariance matrix is calculated from To improve the calculation time, technique which employs integral image has been applied. In other words, we compute the integral image of k k k k k k k Y X Y X
Experimental Results 1 Feature Comparsion 0.98 0.96 0.94 Detection Rate 0.92 0.9 0.88 0.86 COV, RBF SVM (g=0.01) HoG, RBF SVM (g=0.01) 0.84 HoG, Quadratic SVM COV, Quadratic SVM 0.82 HoG, Linear SVM COV, Linear SVM 0.8 0 0.002 0.004 0.006 0.008 0.01 False Positive Rate
Remarks Although, covariance features with non-linear SVM outperform many state-of-the-art techniques, it has the following disadvantages: The block size used in SVM is fixed (7x7 pixels), which means unable to capture human body parts with other rectangular shapes e.g. human limbs, torso, etc. Parameter tuning process in SVM is rather tedious. High computation time of non-linear SVM. Building a new, simpler pedestrian detector using covariance features AdaBoost with weighted Fisher linear discriminant analysis (WLDA) based weak classifiers cascaded structure.
Linear Discriminant Analysis (LDA) Motivation Project data onto a line (R n R 1 ) such that patterns become well separated (in a least square sense). Two-dimensional example
Linear Discriminant Analysis (LDA) Motivation Project data onto a line (R n R 1 ) such that patterns become well separated (in a least square sense). Two-dimensional example Best separation between two classes
Covariance features with LDA Combine covariance features with LDA and compare it against haar-like features. var[ X ] cov[ X, Y... cov[ X, I YY ] ] var[ Y ]... cov[ Y, I YY ]...... var[ I YY ] Observations It is possible to achieve a 5% test error rate using either 25 covariance features or 100 Haar-like features
Components Combine multi-dimensional covariance features with weighted LDA #1 # 2 # 3 # 4 0.9 0.7 0.5 0.3 Strong Classifier = Weight 0.9 + 0.7 + 0.3 = 1.9 > 1.0 (threshold) Trained the new features on AdaBoost framework for faster speed and high accuracy. Apply multiple layer boosting with heterogeneous features on cascaded structure
Architecture Architecture of the pedestrian detection system using boosted covariance features. Training dataset A complete set of rectangular filters (Weak classifiers ) Calculate region covariance matrix and stack upper triangle of the matrix into a vector (R n ) Apply Weighted Fisher Linear Discriminant (R n R 1 ) AdaBoost selects best weak learner with respect to the weighted error Update the sample weights F Test the predefined objective Hit rate: 99.5% False Pos: 50% T Strong Classifier
Observations Observations Covariance features The combined covariance features represent a distinct part of the human body. The 1 st covariance feature represents human legs (two parallel vertical bars) The 2 nd covariance feature captures the information of the head and the human body Compare with Haar features 1. The 1 st haar feature represents human head/shoulder contour 2. The 2 nd haar feature represents human left leg.
Experimental Results The proposed boosted covariance detector achieves about ten times faster detection speed than the conventional covariance detector (Tuzel et al. 2007). On a 360 x 288 pixels image, our system can process at around 4 frames per second. This is the first real-time covariance feature based pedestrian detector.
Experimental Results
Face Face Detection Applications Summary: Viola & Jones Face Detector Use Integral image for efficient feature extraction Use AdaBoost for feature selection Apply cascade classifier for efficient non-faces elimination Pros: Fast and robust face detector The system can be run in real-time Cons: Training stage is time consuming (1~2 weeks) depending on number of training samples and number of features used Require a lot of face training samples Discussions: Performance of face detection depends crucially on the features that are used to represent the objects Good features not only result in better generalization abilities but also require smaller training database.
Error Face Detection Applications Proposed work Similar to previous experiment, we apply covariance features to face detection The differences between our work and Viola & Jones framework: We use covariance features We adopt the weighted FDA as weak classifiers To show the better classification capability, we have trained a boosted classifier on the banana dataset with multidimensional decision stump and FDA as weak classifiers. 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Train error Test error 200 400 600 800 1000 # of weak classifiers (multidimensional stump) Error 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Train error Test error 50 100 150 200 # of weak classifiers (Fisher discriminant analysis)
Observations / Experimental results ROC curves show that covariance features significantly outperform Haarlike wavelet features when the training database size is small. As the number of samples grows, the performance difference between the two techniques decreases. ROC curve on MIT + CMU test set (250 faces) ROC curve on MIT + CMU test set (500 faces) 0.9 0.9 Correct Detection Rate 0.85 0.8 0.75 Correct Detection Rate 0.85 0.8 0.75 0.7 COV Features (250 faces) Haar Features (250 faces) 0.65 0 50 100 150 200 250 300 350 400 Number of False Positives 0.7 COV Features (500 faces) Haar Features (500 faces) 0.65 0 50 100 150 200 250 300 350 400 Number of False Positives ROC curves for our algorithm on the MIT+CMU test set.
Experimental Results Some detection results of our face detectors trained using 250 frontal faces on MIT + CMU test images
Summary Boosting AdaBoost AdaBoost for pedestrian detection using Haar features and dynamic temporal information AdaBoost for pedestrian detection using new covariance features Face detection using new covariance features
Questions?