CS 231A Section 1: Linear Algebra & Probability Review

Size: px

Start display at page:

Download "CS 231A Section 1: Linear Algebra & Probability Review"

Abigayle Blankenship
6 years ago
Views:

1 CS 231A Section 1: Linear Algebra & Probability Review 1

2 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability Axioms Basic Properties Bayes Theorem, Chain Rule 2

3 Linear classifiers Find linear function (hyperplane) to separate positive and negative examples x i positive: x i w b 0 x i negative : x i w b 0 w, b Which hyperplane is best? 3

4 Support vector machines Find hyperplane that maximizes the margin between the positive and negative examples Support vectors Margin 4

5 Support Vector Machines (SVM) Wish to perform binary classification, i.e. find a linear classifier Given data and labels where When data is linearly separable we can solve the optimization problem to find our linear classifier 5

6 Datasets that are linearly separable work out great: Nonlinear SVMs 0 x But what if the dataset is just too hard? 0 x We can map it to a higher-dimensional space: x 2 0 x Slide credit: Andrew Moore 6

7 Nonlinear SVMs General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x φ(x) lifting transformation Slide credit: Andrew Moore 7

8 SVM l 1 regularization What if data is not linearly separable? Can use regularization to solve this problem We solve a new optimization problem and tune our regularization parameter C 8

9 Solving the SVM There are many different packages for solving SVMs In PS0 we have you use the liblinear package. This is an efficient implementation but can only use a linear kernel If you wish to have more flexibility with your choice of kernel you can use the LibSVM package 9

10 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability Axioms Basic Properties Bayes Theorem, Chain Rule 10

11 Boosting Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5): , September, x t=1 x t=2 x t Each data point has a class label: y t = +1 ( ) -1 ( ) and a weight: w t =1 It is a sequential procedure: 11

12 Weak learners from the family of lines Toy example Each data point has a class label: y t = +1 ( ) -1 ( ) and a weight: w t =1 h => p(error) = 0.5 it is at chance 12

13 Toy example Each data point has a class label: y t = +1 ( ) -1 ( ) and a weight: w t =1 This one seems to be the best This is a weak classifier : It performs slightly better than chance. 13

14 Toy example Each data point has a class label: +1 ( ) y t = -1 ( ) We update the weights: w t w t exp{-y t H t } 14

15 Toy example Each data point has a class label: y t = +1 ( ) -1 ( ) We update the weights: w t w t exp{-y t H t } 15

16 Toy example Each data point has a class label: y t = +1 ( ) -1 ( ) We update the weights: w t w t exp{-y t H t } 16

17 Toy example Each data point has a class label: y t = +1 ( ) -1 ( ) We update the weights: w t w t exp{-y t H t } 17

18 Toy example f 1 f 2 f 4 f 3 The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers. 18

19 Boosting Defines a classifier using an additive model: Strong classifier Features vector Weight Weak classifier 19

20 Boosting Defines a classifier using an additive model: Strong classifier Features vector Weight Weak classifier We need to define a family of weak classifiers form a family of weak classifiers 20

21 Why boosting? A simple algorithm for learning robust classifiers Freund & Shapire, 1995 Friedman, Hastie, Tibshhirani, 1998 Provides efficient algorithm for sparse visual feature selection Tieu & Viola, 2000 Viola & Jones, 2003 Easy to implement, doesn t require external optimization tools. 21

22 Weak learners Boosting - mathematics value of rectangle feature h ( x) j 1 if f j( x) j 0 otherwise threshold Final strong classifier T 1 1 h( x) hx ( ) 2 0 otherwise t 1 t t t 1 t T 22

23 Weak classifier 4 kind of Rectangle filters Value = (pixels in white area) (pixels in black area) Credit slide: S. Lazebnik 23

24 Weak classifier Source Result Credit slide: S. Lazebnik 24

25 Viola & Jones algorithm 1. Evaluate each rectangle filter on each example 1 ( 1,1) x ( x2,1) ( x3,0) x4 (,0) ( 5,0) x 6 ( x,0).. ( x, y ) n n Weak classifier h ( x) j 1 if f j( x) j 0 otherwise threshold P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR

26 Viola & Jones algorithm For a 24x24 detection region, P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR

27 Viola & Jones algorithm 2. Select best filter/threshold combination a. Normalize the weights b. For each feature, j w h ( x ) i y j i j i i c. Choose the classifier, h t with the lowest error t w ti, w ti, n j 1 w t, j 1 if f j( x) j hj ( x) 0 otherwise 3. Reweight examples 1 h ( x ) y t 1, i t, i t w w t i i t t 1 t P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR

28 Viola & Jones algorithm 4. The final strong classifier is T 1 1 h( x) hx ( ) 2 0 otherwise t 1 t t t 1 t T t 1 log t The final hypothesis is a weighted linear combination of the T hypotheses where the weights are inversely proportional to the training errors P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR

29 Boosting for face detection For each round of boosting: 1. Evaluate each rectangle filter on each example 2. Select best filter/threshold combination 3. Reweight examples 29

The implemented system Training Data 5000 faces All frontal, rescaled to 24x24 pixels 300 million non-faces 9500 non-face images Faces are normalized Scale,

30 The implemented system Training Data 5000 faces All frontal, rescaled to 24x24 pixels 300 million non-faces 9500 non-face images Faces are normalized Scale, translation Many variations Across individuals Illumination Pose P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR

31 System performance Training time: weeks on 466 MHz Sun workstation 38 layers, total of 6061 features Average of 10 features evaluated per window on test set On a 700 Mhz Pentium III processor, the face detector can process a 384 by 288 pixel image in about.067 seconds 15 Hz 15 times faster than previous detector of comparable accuracy (Rowley et al., 1998) P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR

32 Output of Face Detector on Test Images P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR

33 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability Axioms Basic Properties Bayes Theorem, Chain Rule 33

34 Linear Algebra in Computer Vision Representation 3D points in the scene 2D points in the image (Images are matrices) Transformations Mapping 2D to 2D Mapping 3D to 2D 34

35 Notation We adopt the notation for a matrix which is a real valued matrix with m rows, and n columns We adopt the notation for a column vector, and a row vector respectively 35

36 Notation To indicate the element in the i th row and j th column of a matrix we use Similarly to indicate the i th entry in a vector we use 36

37 Norms Intuitively the norm of a vector is the measure of its length The l 2 norm is defined as in this class we will use the l 2 norm unless otherwise noted. Thus we drop the 2 subscript on the norm for convenience. Note that 37

38 Linear Independence and Rank A set of vectors is linearly independent if no vector in the set can be represented as a linear combination of the remaining vectors in the set The rank of a matrix is the maximal number of linearly independent column or rows of a matrix 38

39 Range and Nullspace The range of a matrix is the span of the columns of the matrix, denoted by the set The nullspace of a matrix, is the set of vectors that when multiplied by the matrix result in 0, given by the set 39

40 Eigenvalues and Eigenvectors Given a matrix, and are said to be an eigenvalue and the corresponding eigenvector of the matrix if We can solve for the eigenvalues by solving for the roots of the polynomial generated by 40

41 Eigenvalue Properties The rank of a matrix is equal to the number of its non-zero eigenvalues Eigenvalues of a diagonal matrix, are simply the diagonal entries A matrix is said to be diagonalizable if we can write 41

42 Eigenvalues & Eigenvectors of Symmetric Matrices Eigenvalues of symmetric matrices are real Eigenvectors of symmetric matrices are orthonormal Consider the optimization problem involving the symmetric matrix the maximizing is the eigenvector corresponding to the largest eigenvalue 42

43 Generalized Eigenvalues Generalized Eigenvalue problem Generalized eigenvalues must satisfy This reduces to the original eigenvalue problem when exists Generalized eigenvalues are used in Fisherfaces 43

44 Singular Value Decomposition (SVD) The SVD of matrix is given by Where are the columns of and called the left singular vectors is a diagonal matrix whose values are, and called the singular values are the columns of, and are called the right singular vectors 44

45 SVD If the matrix has rank, then has nonzero singular values are an orthonormal basis for are an orthonormal basis for Singular values of are the square root of the non-zero eigenvalues of or 45

46 Matlab [V,D] = eig(a) The eigenvectors of A are the columns of V. D is a diagonal matrix whose entries are the eigenvalues of A. [V,D] = eig(a,b) The generalized eigenvectors are the columns of V. D is a diagonal matrix whose entries of the generalized eigenvalues. [U,S,V] = svd(x) The columns of U are the left singular vectors of X. S is a diagonal matrix whose entries are the singular values of X. The columns of V are the right singular vectors of X. Recall X = U*S*V ; 46

47 Matrix Calculus -- Gradient Let then the gradient is given by is always the same size as, thus if we just have a vector the gradient is simply 47

48 Gradients From partial derivatives Some common gradients 48

49 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability Axioms Basic Properties Bayes Theorem, Chain Rule 49

50 Probability in Computer Vision Foundation for algorithms to solve Tracking problems Human activity recognition Object recognition Segmentation 50

51 Probability Axioms Sample space: The set of all the outcomes of a random experiment. Denoted by Event space: A set whose elements are subsets of. The event space is denoted by. For example Probability measure: A function that satisfies 51

52 Basic Properties 52

53 Conditional Probability Two events are independent if Conditional Independence 53

54 Product Rule From the definition of conditional probability we can write From the product rule we can derive the chain rule of probability 54

55 Bayes Theorem Likelihood Posterior Probability Normalizing Constant Prior Probability 55

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations