Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval Lecture #3 Machine Learning Edward Y. Chang Edward Chang Founda'ons of LSMM 1

Edward Chang Foundations of LSMM 2

Machine Learning Approaches Introduc'on Linear Models Large D D >> N Genera've vs. Discrimina've Models Non- Linear Models 6/5/12 Ed Chang @ Founda'ons of MM 3

Sta's'cal Learning Program the computers to learn! Computers improve performance with experience at some task Example: Task: playing checkers Performance: % games it wins Experience: expert players 6/5/12 Ed Chang @ Founda'ons of MM 4

Sta's'cal Learning Task Ŷ = f(u) Represented by some model(s) Implies hypothesis Performance Measured by error func'ons Experience (L) Characterized by training data Algorithm (Φ) 6/5/12 Ed Chang @ Founda'ons of MM 5

Supervised Learning X: Data U: Unlabeled pool L: Labeled pool G: Labels Regression Classifica'on Φ: Learning algorithm f = Φ(L) Ŷ = f(u) 6/5/12 Ed Chang @ Founda'ons of MM 6

Learning Algorithms Φ Linear Model K- NN Kernel Methods Neural Networks Probabilis'c Graphic Models Decision Trees Etc. Ed Chang @ Founda'ons of MM 7 6/5/12

Linear Model Ed Chang @ Founda'ons of MM 8 6/5/12

Linear Model ŷ = w0 + Σ Xj wj (j = 1 to d) X is an n d matrix d: data dimension n: number of training instances ŷ = Xw L(w, S) = RSS(w) = (y Xw) T (y Xw) RSS: Residual Sum of Square L(w, S)/ w = - 2 X T y + 2X T Xw = 0 w = (X T X) - 1 X T y Ed Chang @ Founda'ons of MM 9 6/5/12

Three Challenges D is too large Curse of dimensionality D > N Insufficient samples N is too large Later 6/5/12 Ed Chang @ Founda'ons of MM 10

Gene Profiling Example N = 59 cases, D = 4026 genes 6/5/12 Ed Chang @ Founda'ons of MM 11

Subset Selec'on & Shrinkage Least Square ooen suffers from large variance Shrinkage sets some coefficients to zero Algorithms Forward Stepwise Selec'on Backward Stepwise Selec'on Ridge Regression Ed Chang @ Founda'ons of MM 12 6/5/12

Ridge Regression w = argmin w {Σ n (y i - w 0 - Σ d x ij w j ) 2 + λ Σ d w j 2 } Why would this help? Regulariza'on: remedying an ill- posed model Correlated variables Data prepara'on Normalize input Centralize input (removing w 0 ) w = (X T X+ λi d ) - 1 X T y Ed Chang @ Founda'ons of MM 13 6/5/12

Ridge Regression Tikhonov Regulariza'on min L λ (w S) = min λ w 2 + (y i f (x i )) 2 As oppose to min (y Xw) T (y Xw) w = (X T X+ λi d ) - 1 X T y As oppose to w = (X T X) - 1 X T y w = X T α α = (G + λi n ) - 1 y (G: n n Gram matrix) Ed Chang @ Founda'ons of MM 14 6/5/12

Regulariza'on wikipedia 6/5/12 Ed Chang @ Founda'ons of MM 15

SVD Interpreta'on Ed Chang @ Founda'ons of MM 16 6/5/12

PCR: Principal Component Regression SVD Discard components with smallest Eigen coefficients PCA Linear Mul'variate Regression Sum of univariate regressions Ed Chang @ Founda'ons of MM 17 6/5/12

Limita'ons & Treatments High bias Low variance Ed Chang @ Founda'ons of MM 18 6/5/12

Linear Model Ed Chang @ Founda'ons of MM 19 6/5/12

Limita'ons & Treatments High bias Low variance High- dimensional or overfiyng Ridge, Subset, Lasso PCR, PLS In general, Regulariza'on Ed Chang @ Founda'ons of MM 20 6/5/12

Genera've vs. Discrimina've Models Genera've Models Model en're distribu'on One class at a 'me Look for maximum likelihood Discrimina've Models Model class boundaries Ignore distribu'on Support Vector Machines (SVMs) Perhaps be{er for large problems! 6/5/12 Ed Chang @ Founda'ons of MM 21

Maximum Likelihood View ŷ = w0 + Σ wj Xj (j = 1 to d) ŷ = Xw ŷ = Xw + ε ε (noise signals) are independent ε N (ŷ, 2 ) P(ŷ wx) has a normal dist. with Mean at ŷ = wx Variance 2 Ed Chang @ Founda'ons of MM 22 6/5/12

Deriva'on P(ŷ w x) N (ŷ, 2 ) Training Given (x 1,y 1 ) (x 2,y 2 ) (x n,y n ) Infer w by training data By Bayes rule, or Maximum Likelihood Es'mate Ed Chang @ Founda'ons of MM 23 6/5/12

Maximum Likelihood For what w is P(y 1, y 2, y n x 1, x 2, x n, w) maximized? Π P(y i wx i ) maximized? Π exp(- ½(y i - wx i / ) 2 ) maximized? Σ (- ½(y i - wx i / ) 2 maximized? Σ (y i - wx i ) 2 minimized? Ed Chang @ Founda'ons of MM 24 6/5/12

Observa'ons RSS = MAP What if n < d? Gradient Decent (Perceptron) Converges only when instances are linearly separable, or behaves erra'cally Dual Formula'on Ed Chang @ Founda'ons of MM 25 6/5/12

Dual View (Duality) Primal w = (X T X) - 1 X T y (d d) (d n) (n 1) Dual (if (X T X) - 1 exists) w = X T X(X T X) - 2 X T y = X T (X(X T X) - 2 X T y) = X T α w = α i x i, i = 1 to n α = X(X T X) - 2 X T y = G y G: (n d) (d d) (d n) à n n Gram matrix When n < d ((X T X) - 1 does not exists) Restrict (bias) the choice of func'ons Regulariza'on Ed Chang @ Founda'ons of MM 26 6/5/12

Ridge Regression min L λ (w S) = min λ w 2 + (y i f (x i )) 2 As oppose to min (y Xw) T (y Xw) w = (X T X+ λi d ) - 1 X T y As oppose to w = (X T X) - 1 X T y w = X T α α = (G + λi n ) - 1 y (G: n n Gram matrix) Ed Chang @ Founda'ons of MM 27 6/5/12

Primal vs. Dual Primal Dual Training Cost O(d 3 ) O(n 3 ) Classifica'on O(d) O(nd) 6/5/12 Ed Chang @ Founda'ons of MM 28

Primal vs. Dual Primal Dual is the choice Training Cost O(d 3 ) O(n 3 ) Classifica'on O(d) O(nd) 6/5/12 Ed Chang @ Founda'ons of MM 29

Models & Linearity Genera've Models Model en're distribu'on One class at a 'me Look for maximum likelihood Discrimina've Models Model class boundaries Ignore distribu'on Support Vector Machines (SVMs) Perhaps be{er for large problems! 6/5/12 Ed Chang @ Founda'ons of MM 30

Gaussian Mixture Model 6/5/12 Ed Chang @ Founda'ons of MM 31 From: h{p://neural.cs.nthu.edu.tw/jang/matlab/toolbox/dcpr/image/gmmtraindemo2dcovtype01a.gif

Support Vector Machine - Linear 6/5/12 Ed Chang @ Founda'ons of MM 32

Support Vector Machine Nonlinear 6/5/12 Ed Chang @ Founda'ons of MM 33

Decision Tree 6/5/12 Ed Chang @ Founda'ons of MM 34 From: h{p://upload.wikimedia.org/wikipedia/commons/f/ff/decision_tree_model.png

Decision Tree Output 6/5/12 Ed Chang @ Founda'ons of MM 35 h{p://prsysdesign.net/prsd/blog/dectree/dectree1.png

Boosted Decision Tree Mul'ple weak classifiers Strength in numbers Emphasize mistakes Put resources at hard cases Provable Strong classifier Converges 6/5/12 Ed Chang @ Founda'ons of MM 36

AdaBoost Example 6/5/12 Ed Chang @ Founda'ons of MM 37

Machine Learning Approaches Introduc'on Linear Models Large D D >> N Genera've vs. Discrimina've Models Non- Linear Models 6/5/12 Ed Chang @ Founda'ons of MM 38

Classical Model N: Number of training instances N +, N - D: Dimensionality N >> D N E.g., PAC learnability N - N + 6/5/12 Ed Chang @ Founda'ons of MM 39

Emerging MM Applica'ons N < D N + << N - Examples Informa'on Retrieval with relevance feedback Surveillance event detec'on 6/5/12 Ed Chang @ Founda'ons of MM 40

IR à A Classifica'on Problem Ed Chang @ Foundations of MM 41 6/5/12

Apple Search 6/5/12 Ed Chang @ Foundations of MM 42

Relevance Feedback 6/5/12 Ed Chang @ Foundations of MM 43

Fruit 6/5/12 Ed Chang @ Foundations of MM 44

Text- based image search limita4ons... Ed Chang @ Foundations of MM 45 6/5/12

VIMA Visual Search Ed Chang @ Foundations of MM 46 6/5/12

Step #1: Solicit Labels Ed Chang @ Foundations of MM 47 6/5/12

Step #2: Compute Boundary Ed Chang @ Foundations of MM 48 6/5/12

Step #3: Iden'fy Useful Samples Ed Chang @ Foundations of MM 49 6/5/12

Step #4: Solicit More Feedback Ed Chang @ Foundations of MM 50 6/5/12

Step #5: Refine Boundary Ed Chang @ Foundations of MM 51 6/5/12

Step #6: Ranking Ed Chang @ Foundations of MM 52 6/5/12

Observa'ons Iden'fy good samples Collect diversified samples Is linear model sufficient? Ed Chang @ Foundations of MM 53 6/5/12

IR à A Classifica'on Problem Ed Chang @ Foundations of MM 54 6/5/12

Non- Linear Boundary Ed Chang @ Foundations of MM 55 6/5/12

Separa'ng Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 56

Separa'ng Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 57

Maximum Margin Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 58

Linear Model Fits All Data? 6/5/12 Ed Chang @ Founda'ons of MM 59

Linear Model Fits All? 6/5/12 Ed Chang @ Founda'ons of MM 60

How about Joining the Dots? ŷ(x) = 1/k Σ yi, xi Nk(x) K = 1 6/5/12 Ed Chang @ Founda'ons of MM 61

NN with k = 1 6/5/12 Ed Chang @ Founda'ons of MM 62

Nearest Neighbor Four Things Make a NN Memory- Based Learner A distance func'on K: number of neighbors to consider? A weighted func'on (op'onal) How to fit with the local points? 6/5/12 Ed Chang @ Founda'ons of MM 63

Problems Fiyng Noise Jagged Boundaries 6/5/12 Ed Chang @ Founda'ons of MM 64

Solu'ons Fiyng Noise Pick a Larger K? Jagged Boundaries Introducing Kernel as a weigh'ng func'on 6/5/12 Ed Chang @ Founda'ons of MM 65

NN with k = 15 6/5/12 Ed Chang @ Founda'ons of MM 66

NN 6/5/12 Ed Chang @ Founda'ons of MM 67

Solu'ons Fiyng Noise Pick a larger K? Jagged Boundaries Introducing Kernel as a weigh'ng func'on 6/5/12 Ed Chang @ Founda'ons of MM 68

Nearest Neighbor - > Kernel Method Four Things Make a Memory Based Learner A distance func'on K: number of neighbors to consider? All A weighted func'on: RBF kernels How to fit with the local points? Predict weights 6/5/12 Ed Chang @ Founda'ons of MM 69

Kernel Method RBF Weighted Func'on Kernel width holds the key Use cross valida'on to find the op'mal width Fiyng with the Local Points Where NN meets Linear Model 6/5/12 Ed Chang @ Founda'ons of MM 70

LM vs. NN Linear Model f(x) is approximated by a global linear func'on More stable, less flexible Nearest Neighbor K- NN assumes f(x) is well approximated by a locally constant func'on Less stable, more flexible Between LM and NN The other models 6/5/12 Ed Chang @ Founda'ons of MM 71

Where Are We and Where Am I Heading To? LM and NN Kernel Method of Three Views LM view NN view Geometric view 6/5/12 Ed Chang @ Founda'ons of MM 72

Linear Model View Y = β0 + Σ β X Separa'ng Hyperplane Max β =1 C Subject to yi f(xi) C, or yi (β0 +β xi) C 6/5/12 Ed Chang @ Founda'ons of MM 73

Maximum Margin Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 74

Classifier Margin Margin Defined as with of the boundary before hiyng a data object Maximum Margin Tends to minimize classifica'on variance No formal theory for this yet 6/5/12 Ed Chang @ Founda'ons of MM 75

Separa'ng Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 76

M s Mathema'cal Representa'on Plus- plane {x: wx+b = +1} Minus- plane {x: wx+b = - 1} w Plus- plane w(u v) = 0, if u and v on plus- plane w Minus- plane 6/5/12 Ed Chang @ Founda'ons of MM 77

Separa'ng Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 78

M Let x - be any point on minus- plane Let x + be the closest plus- plane- point to x - x + = x - + λw, why The line (x + x - ) minus- plane M = x + - x - 6/5/12 Ed Chang @ Founda'ons of MM 79

M 1. wx - + b = - 1 2. wx + + b = 1 3. x + = x - + λw 4. M = x + - x - 5. w(x - + λw) + b = 1 (from 2 & 3) 6. wx - + b + λww = 1 7. λww = 2 6/5/12 Ed Chang @ Founda'ons of MM 80

M 1. λww = 2 2. λ = 2/ww 3. M = x + - x - = λw = λ w = 2/ w 4. Max M Gradient decent, simulated annealing, EM, Newton s method? 6/5/12 Ed Chang @ Founda'ons of MM 81

Max M Max M = 2/ w Min w /2 Min w 2 /2 subject to y i (x i w+b) 1 i = 1,,N Quadra'c criterion with linear inequality constraints 6/5/12 Ed Chang @ Founda'ons of MM 82

Max M Min w 2 /2 subject to y i (x i w+b) 1 i = 1,,N Lp = min w,b w 2 /2 + Σ i=1..n α i [y i (x i w+b)- 1] w = Σ i=1..n α i y i x i 0 = Σ i=1..n α i y i 6/5/12 Ed Chang @ Founda'ons of MM 83

Wolfe Dual Ld = Σ i=1..n α - 1/2 ΣΣ i,j=1..n α i α j y i y j x i x j Subject to α i 0 α i [y i (x i w+b)- 1] = 0 KKT condi'ons α i > 0, y i (x i w+b) = 1 (Support Vectors) α i = 0, y i (x i w+b) > 1 6/5/12 Ed Chang @ Founda'ons of MM 84

Class Predic'on yq = w xq + b w = Σ i=1..n α i y i x i yq = sign(σ i=1..n α i y i (x i X q ) + b) 6/5/12 Ed Chang @ Founda'ons of MM 85

Non- seperatable Classes Soo Margin Hyperplane Basis Expansion 6/5/12 Ed Chang @ Founda'ons of MM 86

Non- separa'ng Case 6/5/12 Ed Chang @ Founda'ons of MM 87

Soo Margin SVMs Min w 2 /2 subject to y i (x i w+b) 1 i = 1,,N Min w 2 /2 + C ε i x i w+b 1 - ε i if y i = 1 x i w+b - 1 + ε i if y i = - 1 ε i 0 6/5/12 Ed Chang @ Founda'ons of MM 88

Non- separa'ng Case 6/5/12 Ed Chang @ Founda'ons of MM 89

Wolfe Dual Ld = Σ i=1..n α - 1/2 ΣΣ i,j=1..n α i α j y i y j x i x j Subject to C α i 0 Σ α i y i = 0 KKT condi'ons yq = sign (Σ i=1..n α i y i (x i X q ) + b) 6/5/12 Ed Chang @ Founda'ons of MM 90

Basis Func'on 6/5/12 Ed Chang @ Founda'ons of MM 91

Harder 1D Example 6/5/12 Ed Chang @ Founda'ons of MM 92

Basis Func'on Φ(X) = (x, x 2 ) 6/5/12 Ed Chang @ Founda'ons of MM 93

Harder 1D Example 6/5/12 Ed Chang @ Founda'ons of MM 94

Some Basis Func'ons Φ(X) = Σ γ m h m (X) h m (X) R p R Common Func'ons Polynomial Radial basis func'ons Sigmoid func'ons 6/5/12 Ed Chang @ Founda'ons of MM 95

Wolfe Dual Ld = Σ i=1..n α - 1/2 ΣΣ i,j=1..n α i α j y i y j Φ(x i )Φ (x j ) Subject to C α i 0 Σ α i y i = 0 KKT condi'ons yq = sign (Σ i=1..n α i y i (Φ(x i ) Φ(X q )) + b) K(x i, x j ) = Φ(x i ) Φ(Xj) Kernel func'on! 6/5/12 Ed Chang @ Founda'ons of MM 96

Nearest Neighbor View Z, a set of zero mean jointly Gaussian random variables, Each Zi corresponds to one example Xi Cov (zi, zj) = K(xi, xj) y i, the lable of zi, +1 or - 1 P(yi zi) = σ(yi,zi) 6/5/12 Ed Chang @ Founda'ons of MM 97

Training Data 6/5/12 Ed Chang @ Founda'ons of MM 98

General Kernel Classifier [Jaakkola, etc. 99] MAP Classifica'on for x t y t = sign (Σ αi yi K(x t,x i )) K(xi, xj) = Cov (zi, zj) (some similarity func'on) Supervised Training: Compute αi Given X and y, and An error func'on such as J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi) 6/5/12 Ed Chang @ Founda'ons of MM 99

Leave One Out 6/5/12 Ed Chang @ Founda'ons of MM 100

SVMs yt = sign (Σ αi yi K(xt,xi)) (yi xi) training data, αi nonnega've, and kernel K posi've definite αi is obtained by minimizing J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi) F(αi) = αi αi 0, Σyiαi = 0 6/5/12 Ed Chang @ Founda'ons of MM 101

SVMs 6/5/12 Ed Chang @ Founda'ons of MM 102

Important Insight K(xi, xj) = Cov (zi, zj) To design of a kernel is to design a similarity func'on that produces a posi've definite covariance matrix on the training instances 6/5/12 Ed Chang @ Founda'ons of MM 103

Basis Func'on Selec'on Three General Approaches Restric'on methods Limit the class of func'ons Selec'on methods Scan the dic'onary adap'vely (Boos'ng) Regulariza'on methods Use the en're dic'onary but restrict coefficients (Ridge Regression) 6/5/12 Ed Chang @ Founda'ons of MM 104

Overfiyng? Probably Not Because N free parameters (not D) Maximizing margin 6/5/12 Ed Chang @ Founda'ons of MM 105

Summary of ML Introduc'on Linear Models Large D D >> N Genera've vs. Discrimina've Models Nearest Neighbors Non- Linear Models Chapters 10, 11, 12: Large N 6/5/12 Ed Chang @ Founda'ons of MM 106

Reading Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval, E. Y. Chang, Springer, 2011 Chapter #3 Query- Concept Learning Chapter #9 Imbalanced Data Learning Edward Chang Founda'ons of LSMM 107