Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang

Size: px

Start display at page:

Download "Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang"

Eugenia Oliver
5 years ago
Views:

1 Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval Lecture #3 Machine Learning Edward Y. Chang Edward Chang Founda'ons of LSMM 1

2 Edward Chang Foundations of LSMM 2

3 Machine Learning Approaches Introduc'on Linear Models Large D D >> N Genera've vs. Discrimina've Models Non- Linear Models 6/5/12 Ed Founda'ons of MM 3

4 Sta's'cal Learning Program the computers to learn! Computers improve performance with experience at some task Example: Task: playing checkers Performance: % games it wins Experience: expert players 6/5/12 Ed Founda'ons of MM 4

5 Sta's'cal Learning Task Ŷ = f(u) Represented by some model(s) Implies hypothesis Performance Measured by error func'ons Experience (L) Characterized by training data Algorithm (Φ) 6/5/12 Ed Founda'ons of MM 5

6 Supervised Learning X: Data U: Unlabeled pool L: Labeled pool G: Labels Regression Classifica'on Φ: Learning algorithm f = Φ(L) Ŷ = f(u) 6/5/12 Ed Founda'ons of MM 6

7 Learning Algorithms Φ Linear Model K- NN Kernel Methods Neural Networks Probabilis'c Graphic Models Decision Trees Etc. Ed Founda'ons of MM 7 6/5/12

8 Linear Model Ed Founda'ons of MM 8 6/5/12

9 Linear Model ŷ = w0 + Σ Xj wj (j = 1 to d) X is an n d matrix d: data dimension n: number of training instances ŷ = Xw L(w, S) = RSS(w) = (y Xw) T (y Xw) RSS: Residual Sum of Square L(w, S)/ w = - 2 X T y + 2X T Xw = 0 w = (X T X) - 1 X T y Ed Founda'ons of MM 9 6/5/12

10 Three Challenges D is too large Curse of dimensionality D > N Insufficient samples N is too large Later 6/5/12 Ed Founda'ons of MM 10

11 Gene Profiling Example N = 59 cases, D = 4026 genes 6/5/12 Ed Founda'ons of MM 11

12 Subset Selec'on & Shrinkage Least Square ooen suffers from large variance Shrinkage sets some coefficients to zero Algorithms Forward Stepwise Selec'on Backward Stepwise Selec'on Ridge Regression Ed Founda'ons of MM 12 6/5/12

13 Ridge Regression w = argmin w {Σ n (y i - w 0 - Σ d x ij w j ) 2 + λ Σ d w j 2 } Why would this help? Regulariza'on: remedying an ill- posed model Correlated variables Data prepara'on Normalize input Centralize input (removing w 0 ) w = (X T X+ λi d ) - 1 X T y Ed Founda'ons of MM 13 6/5/12

14 Ridge Regression Tikhonov Regulariza'on min L λ (w S) = min λ w 2 + (y i f (x i )) 2 As oppose to min (y Xw) T (y Xw) w = (X T X+ λi d ) - 1 X T y As oppose to w = (X T X) - 1 X T y w = X T α α = (G + λi n ) - 1 y (G: n n Gram matrix) Ed Founda'ons of MM 14 6/5/12

15 Regulariza'on wikipedia 6/5/12 Ed Founda'ons of MM 15

16 SVD Interpreta'on Ed Founda'ons of MM 16 6/5/12

17 PCR: Principal Component Regression SVD Discard components with smallest Eigen coefficients PCA Linear Mul'variate Regression Sum of univariate regressions Ed Founda'ons of MM 17 6/5/12

18 Limita'ons & Treatments High bias Low variance Ed Founda'ons of MM 18 6/5/12

19 Linear Model Ed Founda'ons of MM 19 6/5/12

20 Limita'ons & Treatments High bias Low variance High- dimensional or overfiyng Ridge, Subset, Lasso PCR, PLS In general, Regulariza'on Ed Founda'ons of MM 20 6/5/12

21 Genera've vs. Discrimina've Models Genera've Models Model en're distribu'on One class at a 'me Look for maximum likelihood Discrimina've Models Model class boundaries Ignore distribu'on Support Vector Machines (SVMs) Perhaps be{er for large problems! 6/5/12 Ed Founda'ons of MM 21

22 Maximum Likelihood View ŷ = w0 + Σ wj Xj (j = 1 to d) ŷ = Xw ŷ = Xw + ε ε (noise signals) are independent ε N (ŷ, 2 ) P(ŷ wx) has a normal dist. with Mean at ŷ = wx Variance 2 Ed Founda'ons of MM 22 6/5/12

23 Deriva'on P(ŷ w x) N (ŷ, 2 ) Training Given (x 1,y 1 ) (x 2,y 2 ) (x n,y n ) Infer w by training data By Bayes rule, or Maximum Likelihood Es'mate Ed Founda'ons of MM 23 6/5/12

24 Maximum Likelihood For what w is P(y 1, y 2, y n x 1, x 2, x n, w) maximized? Π P(y i wx i ) maximized? Π exp(- ½(y i - wx i / ) 2 ) maximized? Σ (- ½(y i - wx i / ) 2 maximized? Σ (y i - wx i ) 2 minimized? Ed Founda'ons of MM 24 6/5/12

25 Observa'ons RSS = MAP What if n < d? Gradient Decent (Perceptron) Converges only when instances are linearly separable, or behaves erra'cally Dual Formula'on Ed Founda'ons of MM 25 6/5/12

26 Dual View (Duality) Primal w = (X T X) - 1 X T y (d d) (d n) (n 1) Dual (if (X T X) - 1 exists) w = X T X(X T X) - 2 X T y = X T (X(X T X) - 2 X T y) = X T α w = α i x i, i = 1 to n α = X(X T X) - 2 X T y = G y G: (n d) (d d) (d n) à n n Gram matrix When n < d ((X T X) - 1 does not exists) Restrict (bias) the choice of func'ons Regulariza'on Ed Founda'ons of MM 26 6/5/12

27 Ridge Regression min L λ (w S) = min λ w 2 + (y i f (x i )) 2 As oppose to min (y Xw) T (y Xw) w = (X T X+ λi d ) - 1 X T y As oppose to w = (X T X) - 1 X T y w = X T α α = (G + λi n ) - 1 y (G: n n Gram matrix) Ed Founda'ons of MM 27 6/5/12

28 Primal vs. Dual Primal Dual Training Cost O(d 3 ) O(n 3 ) Classifica'on O(d) O(nd) 6/5/12 Ed Founda'ons of MM 28

29 Primal vs. Dual Primal Dual is the choice Training Cost O(d 3 ) O(n 3 ) Classifica'on O(d) O(nd) 6/5/12 Ed Founda'ons of MM 29

30 Models & Linearity Genera've Models Model en're distribu'on One class at a 'me Look for maximum likelihood Discrimina've Models Model class boundaries Ignore distribu'on Support Vector Machines (SVMs) Perhaps be{er for large problems! 6/5/12 Ed Founda'ons of MM 30

Gaussian Mixture Model 6/5/12 Ed Chang @ Founda'ons of MM 31 From:

31 Gaussian Mixture Model 6/5/12 Ed Founda'ons of MM 31 From: h{p://neural.cs.nthu.edu.tw/jang/matlab/toolbox/dcpr/image/gmmtraindemo2dcovtype01a.gif

32 Support Vector Machine - Linear 6/5/12 Ed Founda'ons of MM 32

33 Support Vector Machine Nonlinear 6/5/12 Ed Founda'ons of MM 33

34 Decision Tree 6/5/12 Ed Founda'ons of MM 34 From: h{p://upload.wikimedia.org/wikipedia/commons/f/ff/decision_tree_model.png

Decision Tree Output 6/5/12 Ed Chang @ Founda'ons of MM

35 Decision Tree Output 6/5/12 Ed Founda'ons of MM 35 h{p://prsysdesign.net/prsd/blog/dectree/dectree1.png

36 Boosted Decision Tree Mul'ple weak classifiers Strength in numbers Emphasize mistakes Put resources at hard cases Provable Strong classifier Converges 6/5/12 Ed Founda'ons of MM 36

37 AdaBoost Example 6/5/12 Ed Founda'ons of MM 37

38 Machine Learning Approaches Introduc'on Linear Models Large D D >> N Genera've vs. Discrimina've Models Non- Linear Models 6/5/12 Ed Founda'ons of MM 38

39 Classical Model N: Number of training instances N +, N - D: Dimensionality N >> D N E.g., PAC learnability N - N + 6/5/12 Ed Founda'ons of MM 39

40 Emerging MM Applica'ons N < D N + << N - Examples Informa'on Retrieval with relevance feedback Surveillance event detec'on 6/5/12 Ed Founda'ons of MM 40

41 IR à A Classifica'on Problem Ed Foundations of MM 41 6/5/12

42 Apple Search 6/5/12 Ed Foundations of MM 42

43 Relevance Feedback 6/5/12 Ed Foundations of MM 43

44 Fruit 6/5/12 Ed Foundations of MM 44

45 Text- based image search limita4ons... Ed Foundations of MM 45 6/5/12

46 VIMA Visual Search Ed Foundations of MM 46 6/5/12

47 Step #1: Solicit Labels Ed Foundations of MM 47 6/5/12

48 Step #2: Compute Boundary Ed Foundations of MM 48 6/5/12

49 Step #3: Iden'fy Useful Samples Ed Foundations of MM 49 6/5/12

50 Step #4: Solicit More Feedback Ed Foundations of MM 50 6/5/12

51 Step #5: Refine Boundary Ed Foundations of MM 51 6/5/12

52 Step #6: Ranking Ed Foundations of MM 52 6/5/12

53 Observa'ons Iden'fy good samples Collect diversified samples Is linear model sufficient? Ed Foundations of MM 53 6/5/12

54 IR à A Classifica'on Problem Ed Foundations of MM 54 6/5/12

55 Non- Linear Boundary Ed Foundations of MM 55 6/5/12

56 Separa'ng Hyperplane 6/5/12 Ed Founda'ons of MM 56

57 Separa'ng Hyperplane 6/5/12 Ed Founda'ons of MM 57

58 Maximum Margin Hyperplane 6/5/12 Ed Founda'ons of MM 58

59 Linear Model Fits All Data? 6/5/12 Ed Founda'ons of MM 59

60 Linear Model Fits All? 6/5/12 Ed Founda'ons of MM 60

61 How about Joining the Dots? ŷ(x) = 1/k Σ yi, xi Nk(x) K = 1 6/5/12 Ed Founda'ons of MM 61

62 NN with k = 1 6/5/12 Ed Founda'ons of MM 62

63 Nearest Neighbor Four Things Make a NN Memory- Based Learner A distance func'on K: number of neighbors to consider? A weighted func'on (op'onal) How to fit with the local points? 6/5/12 Ed Founda'ons of MM 63

64 Problems Fiyng Noise Jagged Boundaries 6/5/12 Ed Founda'ons of MM 64

65 Solu'ons Fiyng Noise Pick a Larger K? Jagged Boundaries Introducing Kernel as a weigh'ng func'on 6/5/12 Ed Founda'ons of MM 65

66 NN with k = 15 6/5/12 Ed Founda'ons of MM 66

67 NN 6/5/12 Ed Founda'ons of MM 67

68 Solu'ons Fiyng Noise Pick a larger K? Jagged Boundaries Introducing Kernel as a weigh'ng func'on 6/5/12 Ed Founda'ons of MM 68

69 Nearest Neighbor - > Kernel Method Four Things Make a Memory Based Learner A distance func'on K: number of neighbors to consider? All A weighted func'on: RBF kernels How to fit with the local points? Predict weights 6/5/12 Ed Founda'ons of MM 69

70 Kernel Method RBF Weighted Func'on Kernel width holds the key Use cross valida'on to find the op'mal width Fiyng with the Local Points Where NN meets Linear Model 6/5/12 Ed Founda'ons of MM 70

71 LM vs. NN Linear Model f(x) is approximated by a global linear func'on More stable, less flexible Nearest Neighbor K- NN assumes f(x) is well approximated by a locally constant func'on Less stable, more flexible Between LM and NN The other models 6/5/12 Ed Founda'ons of MM 71

72 Where Are We and Where Am I Heading To? LM and NN Kernel Method of Three Views LM view NN view Geometric view 6/5/12 Ed Founda'ons of MM 72

73 Linear Model View Y = β0 + Σ β X Separa'ng Hyperplane Max β =1 C Subject to yi f(xi) C, or yi (β0 +β xi) C 6/5/12 Ed Founda'ons of MM 73

74 Maximum Margin Hyperplane 6/5/12 Ed Founda'ons of MM 74

75 Classifier Margin Margin Defined as with of the boundary before hiyng a data object Maximum Margin Tends to minimize classifica'on variance No formal theory for this yet 6/5/12 Ed Founda'ons of MM 75

76 Separa'ng Hyperplane 6/5/12 Ed Founda'ons of MM 76

77 M s Mathema'cal Representa'on Plus- plane {x: wx+b = +1} Minus- plane {x: wx+b = - 1} w Plus- plane w(u v) = 0, if u and v on plus- plane w Minus- plane 6/5/12 Ed Founda'ons of MM 77

78 Separa'ng Hyperplane 6/5/12 Ed Founda'ons of MM 78

79 M Let x - be any point on minus- plane Let x + be the closest plus- plane- point to x - x + = x - + λw, why The line (x + x - ) minus- plane M = x + - x - 6/5/12 Ed Founda'ons of MM 79

80 M 1. wx - + b = wx + + b = 1 3. x + = x - + λw 4. M = x + - x - 5. w(x - + λw) + b = 1 (from 2 & 3) 6. wx - + b + λww = 1 7. λww = 2 6/5/12 Ed Founda'ons of MM 80

81 M 1. λww = 2 2. λ = 2/ww 3. M = x + - x - = λw = λ w = 2/ w 4. Max M Gradient decent, simulated annealing, EM, Newton s method? 6/5/12 Ed Founda'ons of MM 81

82 Max M Max M = 2/ w Min w /2 Min w 2 /2 subject to y i (x i w+b) 1 i = 1,,N Quadra'c criterion with linear inequality constraints 6/5/12 Ed Founda'ons of MM 82

83 Max M Min w 2 /2 subject to y i (x i w+b) 1 i = 1,,N Lp = min w,b w 2 /2 + Σ i=1..n α i [y i (x i w+b)- 1] w = Σ i=1..n α i y i x i 0 = Σ i=1..n α i y i 6/5/12 Ed Founda'ons of MM 83

84 Wolfe Dual Ld = Σ i=1..n α - 1/2 ΣΣ i,j=1..n α i α j y i y j x i x j Subject to α i 0 α i [y i (x i w+b)- 1] = 0 KKT condi'ons α i > 0, y i (x i w+b) = 1 (Support Vectors) α i = 0, y i (x i w+b) > 1 6/5/12 Ed Founda'ons of MM 84

85 Class Predic'on yq = w xq + b w = Σ i=1..n α i y i x i yq = sign(σ i=1..n α i y i (x i X q ) + b) 6/5/12 Ed Founda'ons of MM 85

86 Non- seperatable Classes Soo Margin Hyperplane Basis Expansion 6/5/12 Ed Founda'ons of MM 86

87 Non- separa'ng Case 6/5/12 Ed Founda'ons of MM 87

88 Soo Margin SVMs Min w 2 /2 subject to y i (x i w+b) 1 i = 1,,N Min w 2 /2 + C ε i x i w+b 1 - ε i if y i = 1 x i w+b ε i if y i = - 1 ε i 0 6/5/12 Ed Founda'ons of MM 88

89 Non- separa'ng Case 6/5/12 Ed Founda'ons of MM 89

90 Wolfe Dual Ld = Σ i=1..n α - 1/2 ΣΣ i,j=1..n α i α j y i y j x i x j Subject to C α i 0 Σ α i y i = 0 KKT condi'ons yq = sign (Σ i=1..n α i y i (x i X q ) + b) 6/5/12 Ed Founda'ons of MM 90

91 Basis Func'on 6/5/12 Ed Founda'ons of MM 91

92 Harder 1D Example 6/5/12 Ed Founda'ons of MM 92

93 Basis Func'on Φ(X) = (x, x 2 ) 6/5/12 Ed Founda'ons of MM 93

94 Harder 1D Example 6/5/12 Ed Founda'ons of MM 94

95 Some Basis Func'ons Φ(X) = Σ γ m h m (X) h m (X) R p R Common Func'ons Polynomial Radial basis func'ons Sigmoid func'ons 6/5/12 Ed Founda'ons of MM 95

96 Wolfe Dual Ld = Σ i=1..n α - 1/2 ΣΣ i,j=1..n α i α j y i y j Φ(x i )Φ (x j ) Subject to C α i 0 Σ α i y i = 0 KKT condi'ons yq = sign (Σ i=1..n α i y i (Φ(x i ) Φ(X q )) + b) K(x i, x j ) = Φ(x i ) Φ(Xj) Kernel func'on! 6/5/12 Ed Founda'ons of MM 96

97 Nearest Neighbor View Z, a set of zero mean jointly Gaussian random variables, Each Zi corresponds to one example Xi Cov (zi, zj) = K(xi, xj) y i, the lable of zi, +1 or - 1 P(yi zi) = σ(yi,zi) 6/5/12 Ed Founda'ons of MM 97

98 Training Data 6/5/12 Ed Founda'ons of MM 98

99 General Kernel Classifier [Jaakkola, etc. 99] MAP Classifica'on for x t y t = sign (Σ αi yi K(x t,x i )) K(xi, xj) = Cov (zi, zj) (some similarity func'on) Supervised Training: Compute αi Given X and y, and An error func'on such as J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi) 6/5/12 Ed Founda'ons of MM 99

100 Leave One Out 6/5/12 Ed Founda'ons of MM 100

101 SVMs yt = sign (Σ αi yi K(xt,xi)) (yi xi) training data, αi nonnega've, and kernel K posi've definite αi is obtained by minimizing J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi) F(αi) = αi αi 0, Σyiαi = 0 6/5/12 Ed Founda'ons of MM 101

102 SVMs 6/5/12 Ed Founda'ons of MM 102

103 Important Insight K(xi, xj) = Cov (zi, zj) To design of a kernel is to design a similarity func'on that produces a posi've definite covariance matrix on the training instances 6/5/12 Ed Founda'ons of MM 103

104 Basis Func'on Selec'on Three General Approaches Restric'on methods Limit the class of func'ons Selec'on methods Scan the dic'onary adap'vely (Boos'ng) Regulariza'on methods Use the en're dic'onary but restrict coefficients (Ridge Regression) 6/5/12 Ed Founda'ons of MM 104

105 Overfiyng? Probably Not Because N free parameters (not D) Maximizing margin 6/5/12 Ed Founda'ons of MM 105

106 Summary of ML Introduc'on Linear Models Large D D >> N Genera've vs. Discrimina've Models Nearest Neighbors Non- Linear Models Chapters 10, 11, 12: Large N 6/5/12 Ed Founda'ons of MM 106

107 Reading Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval, E. Y. Chang, Springer, 2011 Chapter #3 Query- Concept Learning Chapter #9 Imbalanced Data Learning Edward Chang Founda'ons of LSMM 107

COMP 562: Introduction to Machine Learning

COMP 562: Introduction to Machine Learning Lecture 20 : Support Vector Machines, Kernels Mahmoud Mostapha 1 Department of Computer Science University of North Carolina at Chapel Hill mahmoudm@cs.unc.edu