Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval Lecture #3 Machine Learning Edward Y. Chang Edward Chang Founda'ons of LSMM 1
Edward Chang Foundations of LSMM 2
Machine Learning Approaches Introduc'on Linear Models Large D D >> N Genera've vs. Discrimina've Models Non- Linear Models 6/5/12 Ed Chang @ Founda'ons of MM 3
Sta's'cal Learning Program the computers to learn! Computers improve performance with experience at some task Example: Task: playing checkers Performance: % games it wins Experience: expert players 6/5/12 Ed Chang @ Founda'ons of MM 4
Sta's'cal Learning Task Ŷ = f(u) Represented by some model(s) Implies hypothesis Performance Measured by error func'ons Experience (L) Characterized by training data Algorithm (Φ) 6/5/12 Ed Chang @ Founda'ons of MM 5
Supervised Learning X: Data U: Unlabeled pool L: Labeled pool G: Labels Regression Classifica'on Φ: Learning algorithm f = Φ(L) Ŷ = f(u) 6/5/12 Ed Chang @ Founda'ons of MM 6
Learning Algorithms Φ Linear Model K- NN Kernel Methods Neural Networks Probabilis'c Graphic Models Decision Trees Etc. Ed Chang @ Founda'ons of MM 7 6/5/12
Linear Model Ed Chang @ Founda'ons of MM 8 6/5/12
Linear Model ŷ = w0 + Σ Xj wj (j = 1 to d) X is an n d matrix d: data dimension n: number of training instances ŷ = Xw L(w, S) = RSS(w) = (y Xw) T (y Xw) RSS: Residual Sum of Square L(w, S)/ w = - 2 X T y + 2X T Xw = 0 w = (X T X) - 1 X T y Ed Chang @ Founda'ons of MM 9 6/5/12
Three Challenges D is too large Curse of dimensionality D > N Insufficient samples N is too large Later 6/5/12 Ed Chang @ Founda'ons of MM 10
Gene Profiling Example N = 59 cases, D = 4026 genes 6/5/12 Ed Chang @ Founda'ons of MM 11
Subset Selec'on & Shrinkage Least Square ooen suffers from large variance Shrinkage sets some coefficients to zero Algorithms Forward Stepwise Selec'on Backward Stepwise Selec'on Ridge Regression Ed Chang @ Founda'ons of MM 12 6/5/12
Ridge Regression w = argmin w {Σ n (y i - w 0 - Σ d x ij w j ) 2 + λ Σ d w j 2 } Why would this help? Regulariza'on: remedying an ill- posed model Correlated variables Data prepara'on Normalize input Centralize input (removing w 0 ) w = (X T X+ λi d ) - 1 X T y Ed Chang @ Founda'ons of MM 13 6/5/12
Ridge Regression Tikhonov Regulariza'on min L λ (w S) = min λ w 2 + (y i f (x i )) 2 As oppose to min (y Xw) T (y Xw) w = (X T X+ λi d ) - 1 X T y As oppose to w = (X T X) - 1 X T y w = X T α α = (G + λi n ) - 1 y (G: n n Gram matrix) Ed Chang @ Founda'ons of MM 14 6/5/12
Regulariza'on wikipedia 6/5/12 Ed Chang @ Founda'ons of MM 15
SVD Interpreta'on Ed Chang @ Founda'ons of MM 16 6/5/12
PCR: Principal Component Regression SVD Discard components with smallest Eigen coefficients PCA Linear Mul'variate Regression Sum of univariate regressions Ed Chang @ Founda'ons of MM 17 6/5/12
Limita'ons & Treatments High bias Low variance Ed Chang @ Founda'ons of MM 18 6/5/12
Linear Model Ed Chang @ Founda'ons of MM 19 6/5/12
Limita'ons & Treatments High bias Low variance High- dimensional or overfiyng Ridge, Subset, Lasso PCR, PLS In general, Regulariza'on Ed Chang @ Founda'ons of MM 20 6/5/12
Genera've vs. Discrimina've Models Genera've Models Model en're distribu'on One class at a 'me Look for maximum likelihood Discrimina've Models Model class boundaries Ignore distribu'on Support Vector Machines (SVMs) Perhaps be{er for large problems! 6/5/12 Ed Chang @ Founda'ons of MM 21
Maximum Likelihood View ŷ = w0 + Σ wj Xj (j = 1 to d) ŷ = Xw ŷ = Xw + ε ε (noise signals) are independent ε N (ŷ, 2 ) P(ŷ wx) has a normal dist. with Mean at ŷ = wx Variance 2 Ed Chang @ Founda'ons of MM 22 6/5/12
Deriva'on P(ŷ w x) N (ŷ, 2 ) Training Given (x 1,y 1 ) (x 2,y 2 ) (x n,y n ) Infer w by training data By Bayes rule, or Maximum Likelihood Es'mate Ed Chang @ Founda'ons of MM 23 6/5/12
Maximum Likelihood For what w is P(y 1, y 2, y n x 1, x 2, x n, w) maximized? Π P(y i wx i ) maximized? Π exp(- ½(y i - wx i / ) 2 ) maximized? Σ (- ½(y i - wx i / ) 2 maximized? Σ (y i - wx i ) 2 minimized? Ed Chang @ Founda'ons of MM 24 6/5/12
Observa'ons RSS = MAP What if n < d? Gradient Decent (Perceptron) Converges only when instances are linearly separable, or behaves erra'cally Dual Formula'on Ed Chang @ Founda'ons of MM 25 6/5/12
Dual View (Duality) Primal w = (X T X) - 1 X T y (d d) (d n) (n 1) Dual (if (X T X) - 1 exists) w = X T X(X T X) - 2 X T y = X T (X(X T X) - 2 X T y) = X T α w = α i x i, i = 1 to n α = X(X T X) - 2 X T y = G y G: (n d) (d d) (d n) à n n Gram matrix When n < d ((X T X) - 1 does not exists) Restrict (bias) the choice of func'ons Regulariza'on Ed Chang @ Founda'ons of MM 26 6/5/12
Ridge Regression min L λ (w S) = min λ w 2 + (y i f (x i )) 2 As oppose to min (y Xw) T (y Xw) w = (X T X+ λi d ) - 1 X T y As oppose to w = (X T X) - 1 X T y w = X T α α = (G + λi n ) - 1 y (G: n n Gram matrix) Ed Chang @ Founda'ons of MM 27 6/5/12
Primal vs. Dual Primal Dual Training Cost O(d 3 ) O(n 3 ) Classifica'on O(d) O(nd) 6/5/12 Ed Chang @ Founda'ons of MM 28
Primal vs. Dual Primal Dual is the choice Training Cost O(d 3 ) O(n 3 ) Classifica'on O(d) O(nd) 6/5/12 Ed Chang @ Founda'ons of MM 29
Models & Linearity Genera've Models Model en're distribu'on One class at a 'me Look for maximum likelihood Discrimina've Models Model class boundaries Ignore distribu'on Support Vector Machines (SVMs) Perhaps be{er for large problems! 6/5/12 Ed Chang @ Founda'ons of MM 30
Gaussian Mixture Model 6/5/12 Ed Chang @ Founda'ons of MM 31 From: h{p://neural.cs.nthu.edu.tw/jang/matlab/toolbox/dcpr/image/gmmtraindemo2dcovtype01a.gif
Support Vector Machine - Linear 6/5/12 Ed Chang @ Founda'ons of MM 32
Support Vector Machine Nonlinear 6/5/12 Ed Chang @ Founda'ons of MM 33
Decision Tree 6/5/12 Ed Chang @ Founda'ons of MM 34 From: h{p://upload.wikimedia.org/wikipedia/commons/f/ff/decision_tree_model.png
Decision Tree Output 6/5/12 Ed Chang @ Founda'ons of MM 35 h{p://prsysdesign.net/prsd/blog/dectree/dectree1.png
Boosted Decision Tree Mul'ple weak classifiers Strength in numbers Emphasize mistakes Put resources at hard cases Provable Strong classifier Converges 6/5/12 Ed Chang @ Founda'ons of MM 36
AdaBoost Example 6/5/12 Ed Chang @ Founda'ons of MM 37
Machine Learning Approaches Introduc'on Linear Models Large D D >> N Genera've vs. Discrimina've Models Non- Linear Models 6/5/12 Ed Chang @ Founda'ons of MM 38
Classical Model N: Number of training instances N +, N - D: Dimensionality N >> D N E.g., PAC learnability N - N + 6/5/12 Ed Chang @ Founda'ons of MM 39
Emerging MM Applica'ons N < D N + << N - Examples Informa'on Retrieval with relevance feedback Surveillance event detec'on 6/5/12 Ed Chang @ Founda'ons of MM 40
IR à A Classifica'on Problem Ed Chang @ Foundations of MM 41 6/5/12
Apple Search 6/5/12 Ed Chang @ Foundations of MM 42
Relevance Feedback 6/5/12 Ed Chang @ Foundations of MM 43
Fruit 6/5/12 Ed Chang @ Foundations of MM 44
Text- based image search limita4ons... Ed Chang @ Foundations of MM 45 6/5/12
VIMA Visual Search Ed Chang @ Foundations of MM 46 6/5/12
Step #1: Solicit Labels Ed Chang @ Foundations of MM 47 6/5/12
Step #2: Compute Boundary Ed Chang @ Foundations of MM 48 6/5/12
Step #3: Iden'fy Useful Samples Ed Chang @ Foundations of MM 49 6/5/12
Step #4: Solicit More Feedback Ed Chang @ Foundations of MM 50 6/5/12
Step #5: Refine Boundary Ed Chang @ Foundations of MM 51 6/5/12
Step #6: Ranking Ed Chang @ Foundations of MM 52 6/5/12
Observa'ons Iden'fy good samples Collect diversified samples Is linear model sufficient? Ed Chang @ Foundations of MM 53 6/5/12
IR à A Classifica'on Problem Ed Chang @ Foundations of MM 54 6/5/12
Non- Linear Boundary Ed Chang @ Foundations of MM 55 6/5/12
Separa'ng Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 56
Separa'ng Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 57
Maximum Margin Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 58
Linear Model Fits All Data? 6/5/12 Ed Chang @ Founda'ons of MM 59
Linear Model Fits All? 6/5/12 Ed Chang @ Founda'ons of MM 60
How about Joining the Dots? ŷ(x) = 1/k Σ yi, xi Nk(x) K = 1 6/5/12 Ed Chang @ Founda'ons of MM 61
NN with k = 1 6/5/12 Ed Chang @ Founda'ons of MM 62
Nearest Neighbor Four Things Make a NN Memory- Based Learner A distance func'on K: number of neighbors to consider? A weighted func'on (op'onal) How to fit with the local points? 6/5/12 Ed Chang @ Founda'ons of MM 63
Problems Fiyng Noise Jagged Boundaries 6/5/12 Ed Chang @ Founda'ons of MM 64
Solu'ons Fiyng Noise Pick a Larger K? Jagged Boundaries Introducing Kernel as a weigh'ng func'on 6/5/12 Ed Chang @ Founda'ons of MM 65
NN with k = 15 6/5/12 Ed Chang @ Founda'ons of MM 66
NN 6/5/12 Ed Chang @ Founda'ons of MM 67
Solu'ons Fiyng Noise Pick a larger K? Jagged Boundaries Introducing Kernel as a weigh'ng func'on 6/5/12 Ed Chang @ Founda'ons of MM 68
Nearest Neighbor - > Kernel Method Four Things Make a Memory Based Learner A distance func'on K: number of neighbors to consider? All A weighted func'on: RBF kernels How to fit with the local points? Predict weights 6/5/12 Ed Chang @ Founda'ons of MM 69
Kernel Method RBF Weighted Func'on Kernel width holds the key Use cross valida'on to find the op'mal width Fiyng with the Local Points Where NN meets Linear Model 6/5/12 Ed Chang @ Founda'ons of MM 70
LM vs. NN Linear Model f(x) is approximated by a global linear func'on More stable, less flexible Nearest Neighbor K- NN assumes f(x) is well approximated by a locally constant func'on Less stable, more flexible Between LM and NN The other models 6/5/12 Ed Chang @ Founda'ons of MM 71
Where Are We and Where Am I Heading To? LM and NN Kernel Method of Three Views LM view NN view Geometric view 6/5/12 Ed Chang @ Founda'ons of MM 72
Linear Model View Y = β0 + Σ β X Separa'ng Hyperplane Max β =1 C Subject to yi f(xi) C, or yi (β0 +β xi) C 6/5/12 Ed Chang @ Founda'ons of MM 73
Maximum Margin Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 74
Classifier Margin Margin Defined as with of the boundary before hiyng a data object Maximum Margin Tends to minimize classifica'on variance No formal theory for this yet 6/5/12 Ed Chang @ Founda'ons of MM 75
Separa'ng Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 76
M s Mathema'cal Representa'on Plus- plane {x: wx+b = +1} Minus- plane {x: wx+b = - 1} w Plus- plane w(u v) = 0, if u and v on plus- plane w Minus- plane 6/5/12 Ed Chang @ Founda'ons of MM 77
Separa'ng Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 78
M Let x - be any point on minus- plane Let x + be the closest plus- plane- point to x - x + = x - + λw, why The line (x + x - ) minus- plane M = x + - x - 6/5/12 Ed Chang @ Founda'ons of MM 79
M 1. wx - + b = - 1 2. wx + + b = 1 3. x + = x - + λw 4. M = x + - x - 5. w(x - + λw) + b = 1 (from 2 & 3) 6. wx - + b + λww = 1 7. λww = 2 6/5/12 Ed Chang @ Founda'ons of MM 80
M 1. λww = 2 2. λ = 2/ww 3. M = x + - x - = λw = λ w = 2/ w 4. Max M Gradient decent, simulated annealing, EM, Newton s method? 6/5/12 Ed Chang @ Founda'ons of MM 81
Max M Max M = 2/ w Min w /2 Min w 2 /2 subject to y i (x i w+b) 1 i = 1,,N Quadra'c criterion with linear inequality constraints 6/5/12 Ed Chang @ Founda'ons of MM 82
Max M Min w 2 /2 subject to y i (x i w+b) 1 i = 1,,N Lp = min w,b w 2 /2 + Σ i=1..n α i [y i (x i w+b)- 1] w = Σ i=1..n α i y i x i 0 = Σ i=1..n α i y i 6/5/12 Ed Chang @ Founda'ons of MM 83
Wolfe Dual Ld = Σ i=1..n α - 1/2 ΣΣ i,j=1..n α i α j y i y j x i x j Subject to α i 0 α i [y i (x i w+b)- 1] = 0 KKT condi'ons α i > 0, y i (x i w+b) = 1 (Support Vectors) α i = 0, y i (x i w+b) > 1 6/5/12 Ed Chang @ Founda'ons of MM 84
Class Predic'on yq = w xq + b w = Σ i=1..n α i y i x i yq = sign(σ i=1..n α i y i (x i X q ) + b) 6/5/12 Ed Chang @ Founda'ons of MM 85
Non- seperatable Classes Soo Margin Hyperplane Basis Expansion 6/5/12 Ed Chang @ Founda'ons of MM 86
Non- separa'ng Case 6/5/12 Ed Chang @ Founda'ons of MM 87
Soo Margin SVMs Min w 2 /2 subject to y i (x i w+b) 1 i = 1,,N Min w 2 /2 + C ε i x i w+b 1 - ε i if y i = 1 x i w+b - 1 + ε i if y i = - 1 ε i 0 6/5/12 Ed Chang @ Founda'ons of MM 88
Non- separa'ng Case 6/5/12 Ed Chang @ Founda'ons of MM 89
Wolfe Dual Ld = Σ i=1..n α - 1/2 ΣΣ i,j=1..n α i α j y i y j x i x j Subject to C α i 0 Σ α i y i = 0 KKT condi'ons yq = sign (Σ i=1..n α i y i (x i X q ) + b) 6/5/12 Ed Chang @ Founda'ons of MM 90
Basis Func'on 6/5/12 Ed Chang @ Founda'ons of MM 91
Harder 1D Example 6/5/12 Ed Chang @ Founda'ons of MM 92
Basis Func'on Φ(X) = (x, x 2 ) 6/5/12 Ed Chang @ Founda'ons of MM 93
Harder 1D Example 6/5/12 Ed Chang @ Founda'ons of MM 94
Some Basis Func'ons Φ(X) = Σ γ m h m (X) h m (X) R p R Common Func'ons Polynomial Radial basis func'ons Sigmoid func'ons 6/5/12 Ed Chang @ Founda'ons of MM 95
Wolfe Dual Ld = Σ i=1..n α - 1/2 ΣΣ i,j=1..n α i α j y i y j Φ(x i )Φ (x j ) Subject to C α i 0 Σ α i y i = 0 KKT condi'ons yq = sign (Σ i=1..n α i y i (Φ(x i ) Φ(X q )) + b) K(x i, x j ) = Φ(x i ) Φ(Xj) Kernel func'on! 6/5/12 Ed Chang @ Founda'ons of MM 96
Nearest Neighbor View Z, a set of zero mean jointly Gaussian random variables, Each Zi corresponds to one example Xi Cov (zi, zj) = K(xi, xj) y i, the lable of zi, +1 or - 1 P(yi zi) = σ(yi,zi) 6/5/12 Ed Chang @ Founda'ons of MM 97
Training Data 6/5/12 Ed Chang @ Founda'ons of MM 98
General Kernel Classifier [Jaakkola, etc. 99] MAP Classifica'on for x t y t = sign (Σ αi yi K(x t,x i )) K(xi, xj) = Cov (zi, zj) (some similarity func'on) Supervised Training: Compute αi Given X and y, and An error func'on such as J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi) 6/5/12 Ed Chang @ Founda'ons of MM 99
Leave One Out 6/5/12 Ed Chang @ Founda'ons of MM 100
SVMs yt = sign (Σ αi yi K(xt,xi)) (yi xi) training data, αi nonnega've, and kernel K posi've definite αi is obtained by minimizing J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi) F(αi) = αi αi 0, Σyiαi = 0 6/5/12 Ed Chang @ Founda'ons of MM 101
SVMs 6/5/12 Ed Chang @ Founda'ons of MM 102
Important Insight K(xi, xj) = Cov (zi, zj) To design of a kernel is to design a similarity func'on that produces a posi've definite covariance matrix on the training instances 6/5/12 Ed Chang @ Founda'ons of MM 103
Basis Func'on Selec'on Three General Approaches Restric'on methods Limit the class of func'ons Selec'on methods Scan the dic'onary adap'vely (Boos'ng) Regulariza'on methods Use the en're dic'onary but restrict coefficients (Ridge Regression) 6/5/12 Ed Chang @ Founda'ons of MM 104
Overfiyng? Probably Not Because N free parameters (not D) Maximizing margin 6/5/12 Ed Chang @ Founda'ons of MM 105
Summary of ML Introduc'on Linear Models Large D D >> N Genera've vs. Discrimina've Models Nearest Neighbors Non- Linear Models Chapters 10, 11, 12: Large N 6/5/12 Ed Chang @ Founda'ons of MM 106
Reading Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval, E. Y. Chang, Springer, 2011 Chapter #3 Query- Concept Learning Chapter #9 Imbalanced Data Learning Edward Chang Founda'ons of LSMM 107