Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang

Similar documents
COMP 562: Introduction to Machine Learning

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Support Vector Machine (SVM) and Kernel Methods

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Support Vector Machine (SVM) and Kernel Methods

Linear & nonlinear classifiers

Support Vector Machine (SVM) and Kernel Methods

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Discriminative Learning and Big Data

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

LMS Algorithm Summary

Linear & nonlinear classifiers

Brief Introduction to Machine Learning

An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

Linear Classification and SVM. Dr. Xin Zhang

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

(Kernels +) Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning Practice Page 2 of 2 10/28/13

STAD68: Machine Learning

Pattern Recognition and Machine Learning

6.036 midterm review. Wednesday, March 18, 15

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines and Kernel Methods

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machine

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Max Margin-Classifier

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

CS 6140: Machine Learning Spring 2016

Linear Methods for Regression. Lijun Zhang

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

CS798: Selected topics in Machine Learning

UVA CS 4501: Machine Learning. Lecture 11b: Support Vector Machine (nonlinear) Kernel Trick and in PracCce. Dr. Yanjun Qi. University of Virginia

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Support Vector Machines

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

Support Vector Machines: Maximum Margin Classifiers

Bias/variance tradeoff, Model assessment and selec+on

Support Vector Machine (continued)

Discriminative Models

Discriminative Models

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Support Vector Machines

Statistical Machine Learning from Data

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines and Kernel Methods

Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Machine Learning 4771

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning And Applications: Supervised Learning-SVM

Linear Regression (continued)

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Perceptron Revisited: Linear Separators. Support Vector Machines

Linear Models for Regression. Sargur Srihari

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Support Vector Machines

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #4 Similarity. Edward Chang

PAC-learning, VC Dimension and Margin-based Bounds

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

CS534 Machine Learning - Spring Final Exam

Neural networks and support vector machines

An Introduction to Statistical and Probabilistic Linear Models

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Support Vector Machine & Its Applications

Regression.

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Machine learning - HT Basis Expansion, Regularization, Validation

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage

FINAL: CS 6375 (Machine Learning) Fall 2014

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Applied Machine Learning Annalisa Marsico

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Lecture 5

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Support Vector Machines for Classification and Regression

Transcription:

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval Lecture #3 Machine Learning Edward Y. Chang Edward Chang Founda'ons of LSMM 1

Edward Chang Foundations of LSMM 2

Machine Learning Approaches Introduc'on Linear Models Large D D >> N Genera've vs. Discrimina've Models Non- Linear Models 6/5/12 Ed Chang @ Founda'ons of MM 3

Sta's'cal Learning Program the computers to learn! Computers improve performance with experience at some task Example: Task: playing checkers Performance: % games it wins Experience: expert players 6/5/12 Ed Chang @ Founda'ons of MM 4

Sta's'cal Learning Task Ŷ = f(u) Represented by some model(s) Implies hypothesis Performance Measured by error func'ons Experience (L) Characterized by training data Algorithm (Φ) 6/5/12 Ed Chang @ Founda'ons of MM 5

Supervised Learning X: Data U: Unlabeled pool L: Labeled pool G: Labels Regression Classifica'on Φ: Learning algorithm f = Φ(L) Ŷ = f(u) 6/5/12 Ed Chang @ Founda'ons of MM 6

Learning Algorithms Φ Linear Model K- NN Kernel Methods Neural Networks Probabilis'c Graphic Models Decision Trees Etc. Ed Chang @ Founda'ons of MM 7 6/5/12

Linear Model Ed Chang @ Founda'ons of MM 8 6/5/12

Linear Model ŷ = w0 + Σ Xj wj (j = 1 to d) X is an n d matrix d: data dimension n: number of training instances ŷ = Xw L(w, S) = RSS(w) = (y Xw) T (y Xw) RSS: Residual Sum of Square L(w, S)/ w = - 2 X T y + 2X T Xw = 0 w = (X T X) - 1 X T y Ed Chang @ Founda'ons of MM 9 6/5/12

Three Challenges D is too large Curse of dimensionality D > N Insufficient samples N is too large Later 6/5/12 Ed Chang @ Founda'ons of MM 10

Gene Profiling Example N = 59 cases, D = 4026 genes 6/5/12 Ed Chang @ Founda'ons of MM 11

Subset Selec'on & Shrinkage Least Square ooen suffers from large variance Shrinkage sets some coefficients to zero Algorithms Forward Stepwise Selec'on Backward Stepwise Selec'on Ridge Regression Ed Chang @ Founda'ons of MM 12 6/5/12

Ridge Regression w = argmin w {Σ n (y i - w 0 - Σ d x ij w j ) 2 + λ Σ d w j 2 } Why would this help? Regulariza'on: remedying an ill- posed model Correlated variables Data prepara'on Normalize input Centralize input (removing w 0 ) w = (X T X+ λi d ) - 1 X T y Ed Chang @ Founda'ons of MM 13 6/5/12

Ridge Regression Tikhonov Regulariza'on min L λ (w S) = min λ w 2 + (y i f (x i )) 2 As oppose to min (y Xw) T (y Xw) w = (X T X+ λi d ) - 1 X T y As oppose to w = (X T X) - 1 X T y w = X T α α = (G + λi n ) - 1 y (G: n n Gram matrix) Ed Chang @ Founda'ons of MM 14 6/5/12

Regulariza'on wikipedia 6/5/12 Ed Chang @ Founda'ons of MM 15

SVD Interpreta'on Ed Chang @ Founda'ons of MM 16 6/5/12

PCR: Principal Component Regression SVD Discard components with smallest Eigen coefficients PCA Linear Mul'variate Regression Sum of univariate regressions Ed Chang @ Founda'ons of MM 17 6/5/12

Limita'ons & Treatments High bias Low variance Ed Chang @ Founda'ons of MM 18 6/5/12

Linear Model Ed Chang @ Founda'ons of MM 19 6/5/12

Limita'ons & Treatments High bias Low variance High- dimensional or overfiyng Ridge, Subset, Lasso PCR, PLS In general, Regulariza'on Ed Chang @ Founda'ons of MM 20 6/5/12

Genera've vs. Discrimina've Models Genera've Models Model en're distribu'on One class at a 'me Look for maximum likelihood Discrimina've Models Model class boundaries Ignore distribu'on Support Vector Machines (SVMs) Perhaps be{er for large problems! 6/5/12 Ed Chang @ Founda'ons of MM 21

Maximum Likelihood View ŷ = w0 + Σ wj Xj (j = 1 to d) ŷ = Xw ŷ = Xw + ε ε (noise signals) are independent ε N (ŷ, 2 ) P(ŷ wx) has a normal dist. with Mean at ŷ = wx Variance 2 Ed Chang @ Founda'ons of MM 22 6/5/12

Deriva'on P(ŷ w x) N (ŷ, 2 ) Training Given (x 1,y 1 ) (x 2,y 2 ) (x n,y n ) Infer w by training data By Bayes rule, or Maximum Likelihood Es'mate Ed Chang @ Founda'ons of MM 23 6/5/12

Maximum Likelihood For what w is P(y 1, y 2, y n x 1, x 2, x n, w) maximized? Π P(y i wx i ) maximized? Π exp(- ½(y i - wx i / ) 2 ) maximized? Σ (- ½(y i - wx i / ) 2 maximized? Σ (y i - wx i ) 2 minimized? Ed Chang @ Founda'ons of MM 24 6/5/12

Observa'ons RSS = MAP What if n < d? Gradient Decent (Perceptron) Converges only when instances are linearly separable, or behaves erra'cally Dual Formula'on Ed Chang @ Founda'ons of MM 25 6/5/12

Dual View (Duality) Primal w = (X T X) - 1 X T y (d d) (d n) (n 1) Dual (if (X T X) - 1 exists) w = X T X(X T X) - 2 X T y = X T (X(X T X) - 2 X T y) = X T α w = α i x i, i = 1 to n α = X(X T X) - 2 X T y = G y G: (n d) (d d) (d n) à n n Gram matrix When n < d ((X T X) - 1 does not exists) Restrict (bias) the choice of func'ons Regulariza'on Ed Chang @ Founda'ons of MM 26 6/5/12

Ridge Regression min L λ (w S) = min λ w 2 + (y i f (x i )) 2 As oppose to min (y Xw) T (y Xw) w = (X T X+ λi d ) - 1 X T y As oppose to w = (X T X) - 1 X T y w = X T α α = (G + λi n ) - 1 y (G: n n Gram matrix) Ed Chang @ Founda'ons of MM 27 6/5/12

Primal vs. Dual Primal Dual Training Cost O(d 3 ) O(n 3 ) Classifica'on O(d) O(nd) 6/5/12 Ed Chang @ Founda'ons of MM 28

Primal vs. Dual Primal Dual is the choice Training Cost O(d 3 ) O(n 3 ) Classifica'on O(d) O(nd) 6/5/12 Ed Chang @ Founda'ons of MM 29

Models & Linearity Genera've Models Model en're distribu'on One class at a 'me Look for maximum likelihood Discrimina've Models Model class boundaries Ignore distribu'on Support Vector Machines (SVMs) Perhaps be{er for large problems! 6/5/12 Ed Chang @ Founda'ons of MM 30

Gaussian Mixture Model 6/5/12 Ed Chang @ Founda'ons of MM 31 From: h{p://neural.cs.nthu.edu.tw/jang/matlab/toolbox/dcpr/image/gmmtraindemo2dcovtype01a.gif

Support Vector Machine - Linear 6/5/12 Ed Chang @ Founda'ons of MM 32

Support Vector Machine Nonlinear 6/5/12 Ed Chang @ Founda'ons of MM 33

Decision Tree 6/5/12 Ed Chang @ Founda'ons of MM 34 From: h{p://upload.wikimedia.org/wikipedia/commons/f/ff/decision_tree_model.png

Decision Tree Output 6/5/12 Ed Chang @ Founda'ons of MM 35 h{p://prsysdesign.net/prsd/blog/dectree/dectree1.png

Boosted Decision Tree Mul'ple weak classifiers Strength in numbers Emphasize mistakes Put resources at hard cases Provable Strong classifier Converges 6/5/12 Ed Chang @ Founda'ons of MM 36

AdaBoost Example 6/5/12 Ed Chang @ Founda'ons of MM 37

Machine Learning Approaches Introduc'on Linear Models Large D D >> N Genera've vs. Discrimina've Models Non- Linear Models 6/5/12 Ed Chang @ Founda'ons of MM 38

Classical Model N: Number of training instances N +, N - D: Dimensionality N >> D N E.g., PAC learnability N - N + 6/5/12 Ed Chang @ Founda'ons of MM 39

Emerging MM Applica'ons N < D N + << N - Examples Informa'on Retrieval with relevance feedback Surveillance event detec'on 6/5/12 Ed Chang @ Founda'ons of MM 40

IR à A Classifica'on Problem Ed Chang @ Foundations of MM 41 6/5/12

Apple Search 6/5/12 Ed Chang @ Foundations of MM 42

Relevance Feedback 6/5/12 Ed Chang @ Foundations of MM 43

Fruit 6/5/12 Ed Chang @ Foundations of MM 44

Text- based image search limita4ons... Ed Chang @ Foundations of MM 45 6/5/12

VIMA Visual Search Ed Chang @ Foundations of MM 46 6/5/12

Step #1: Solicit Labels Ed Chang @ Foundations of MM 47 6/5/12

Step #2: Compute Boundary Ed Chang @ Foundations of MM 48 6/5/12

Step #3: Iden'fy Useful Samples Ed Chang @ Foundations of MM 49 6/5/12

Step #4: Solicit More Feedback Ed Chang @ Foundations of MM 50 6/5/12

Step #5: Refine Boundary Ed Chang @ Foundations of MM 51 6/5/12

Step #6: Ranking Ed Chang @ Foundations of MM 52 6/5/12

Observa'ons Iden'fy good samples Collect diversified samples Is linear model sufficient? Ed Chang @ Foundations of MM 53 6/5/12

IR à A Classifica'on Problem Ed Chang @ Foundations of MM 54 6/5/12

Non- Linear Boundary Ed Chang @ Foundations of MM 55 6/5/12

Separa'ng Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 56

Separa'ng Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 57

Maximum Margin Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 58

Linear Model Fits All Data? 6/5/12 Ed Chang @ Founda'ons of MM 59

Linear Model Fits All? 6/5/12 Ed Chang @ Founda'ons of MM 60

How about Joining the Dots? ŷ(x) = 1/k Σ yi, xi Nk(x) K = 1 6/5/12 Ed Chang @ Founda'ons of MM 61

NN with k = 1 6/5/12 Ed Chang @ Founda'ons of MM 62

Nearest Neighbor Four Things Make a NN Memory- Based Learner A distance func'on K: number of neighbors to consider? A weighted func'on (op'onal) How to fit with the local points? 6/5/12 Ed Chang @ Founda'ons of MM 63

Problems Fiyng Noise Jagged Boundaries 6/5/12 Ed Chang @ Founda'ons of MM 64

Solu'ons Fiyng Noise Pick a Larger K? Jagged Boundaries Introducing Kernel as a weigh'ng func'on 6/5/12 Ed Chang @ Founda'ons of MM 65

NN with k = 15 6/5/12 Ed Chang @ Founda'ons of MM 66

NN 6/5/12 Ed Chang @ Founda'ons of MM 67

Solu'ons Fiyng Noise Pick a larger K? Jagged Boundaries Introducing Kernel as a weigh'ng func'on 6/5/12 Ed Chang @ Founda'ons of MM 68

Nearest Neighbor - > Kernel Method Four Things Make a Memory Based Learner A distance func'on K: number of neighbors to consider? All A weighted func'on: RBF kernels How to fit with the local points? Predict weights 6/5/12 Ed Chang @ Founda'ons of MM 69

Kernel Method RBF Weighted Func'on Kernel width holds the key Use cross valida'on to find the op'mal width Fiyng with the Local Points Where NN meets Linear Model 6/5/12 Ed Chang @ Founda'ons of MM 70

LM vs. NN Linear Model f(x) is approximated by a global linear func'on More stable, less flexible Nearest Neighbor K- NN assumes f(x) is well approximated by a locally constant func'on Less stable, more flexible Between LM and NN The other models 6/5/12 Ed Chang @ Founda'ons of MM 71

Where Are We and Where Am I Heading To? LM and NN Kernel Method of Three Views LM view NN view Geometric view 6/5/12 Ed Chang @ Founda'ons of MM 72

Linear Model View Y = β0 + Σ β X Separa'ng Hyperplane Max β =1 C Subject to yi f(xi) C, or yi (β0 +β xi) C 6/5/12 Ed Chang @ Founda'ons of MM 73

Maximum Margin Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 74

Classifier Margin Margin Defined as with of the boundary before hiyng a data object Maximum Margin Tends to minimize classifica'on variance No formal theory for this yet 6/5/12 Ed Chang @ Founda'ons of MM 75

Separa'ng Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 76

M s Mathema'cal Representa'on Plus- plane {x: wx+b = +1} Minus- plane {x: wx+b = - 1} w Plus- plane w(u v) = 0, if u and v on plus- plane w Minus- plane 6/5/12 Ed Chang @ Founda'ons of MM 77

Separa'ng Hyperplane 6/5/12 Ed Chang @ Founda'ons of MM 78

M Let x - be any point on minus- plane Let x + be the closest plus- plane- point to x - x + = x - + λw, why The line (x + x - ) minus- plane M = x + - x - 6/5/12 Ed Chang @ Founda'ons of MM 79

M 1. wx - + b = - 1 2. wx + + b = 1 3. x + = x - + λw 4. M = x + - x - 5. w(x - + λw) + b = 1 (from 2 & 3) 6. wx - + b + λww = 1 7. λww = 2 6/5/12 Ed Chang @ Founda'ons of MM 80

M 1. λww = 2 2. λ = 2/ww 3. M = x + - x - = λw = λ w = 2/ w 4. Max M Gradient decent, simulated annealing, EM, Newton s method? 6/5/12 Ed Chang @ Founda'ons of MM 81

Max M Max M = 2/ w Min w /2 Min w 2 /2 subject to y i (x i w+b) 1 i = 1,,N Quadra'c criterion with linear inequality constraints 6/5/12 Ed Chang @ Founda'ons of MM 82

Max M Min w 2 /2 subject to y i (x i w+b) 1 i = 1,,N Lp = min w,b w 2 /2 + Σ i=1..n α i [y i (x i w+b)- 1] w = Σ i=1..n α i y i x i 0 = Σ i=1..n α i y i 6/5/12 Ed Chang @ Founda'ons of MM 83

Wolfe Dual Ld = Σ i=1..n α - 1/2 ΣΣ i,j=1..n α i α j y i y j x i x j Subject to α i 0 α i [y i (x i w+b)- 1] = 0 KKT condi'ons α i > 0, y i (x i w+b) = 1 (Support Vectors) α i = 0, y i (x i w+b) > 1 6/5/12 Ed Chang @ Founda'ons of MM 84

Class Predic'on yq = w xq + b w = Σ i=1..n α i y i x i yq = sign(σ i=1..n α i y i (x i X q ) + b) 6/5/12 Ed Chang @ Founda'ons of MM 85

Non- seperatable Classes Soo Margin Hyperplane Basis Expansion 6/5/12 Ed Chang @ Founda'ons of MM 86

Non- separa'ng Case 6/5/12 Ed Chang @ Founda'ons of MM 87

Soo Margin SVMs Min w 2 /2 subject to y i (x i w+b) 1 i = 1,,N Min w 2 /2 + C ε i x i w+b 1 - ε i if y i = 1 x i w+b - 1 + ε i if y i = - 1 ε i 0 6/5/12 Ed Chang @ Founda'ons of MM 88

Non- separa'ng Case 6/5/12 Ed Chang @ Founda'ons of MM 89

Wolfe Dual Ld = Σ i=1..n α - 1/2 ΣΣ i,j=1..n α i α j y i y j x i x j Subject to C α i 0 Σ α i y i = 0 KKT condi'ons yq = sign (Σ i=1..n α i y i (x i X q ) + b) 6/5/12 Ed Chang @ Founda'ons of MM 90

Basis Func'on 6/5/12 Ed Chang @ Founda'ons of MM 91

Harder 1D Example 6/5/12 Ed Chang @ Founda'ons of MM 92

Basis Func'on Φ(X) = (x, x 2 ) 6/5/12 Ed Chang @ Founda'ons of MM 93

Harder 1D Example 6/5/12 Ed Chang @ Founda'ons of MM 94

Some Basis Func'ons Φ(X) = Σ γ m h m (X) h m (X) R p R Common Func'ons Polynomial Radial basis func'ons Sigmoid func'ons 6/5/12 Ed Chang @ Founda'ons of MM 95

Wolfe Dual Ld = Σ i=1..n α - 1/2 ΣΣ i,j=1..n α i α j y i y j Φ(x i )Φ (x j ) Subject to C α i 0 Σ α i y i = 0 KKT condi'ons yq = sign (Σ i=1..n α i y i (Φ(x i ) Φ(X q )) + b) K(x i, x j ) = Φ(x i ) Φ(Xj) Kernel func'on! 6/5/12 Ed Chang @ Founda'ons of MM 96

Nearest Neighbor View Z, a set of zero mean jointly Gaussian random variables, Each Zi corresponds to one example Xi Cov (zi, zj) = K(xi, xj) y i, the lable of zi, +1 or - 1 P(yi zi) = σ(yi,zi) 6/5/12 Ed Chang @ Founda'ons of MM 97

Training Data 6/5/12 Ed Chang @ Founda'ons of MM 98

General Kernel Classifier [Jaakkola, etc. 99] MAP Classifica'on for x t y t = sign (Σ αi yi K(x t,x i )) K(xi, xj) = Cov (zi, zj) (some similarity func'on) Supervised Training: Compute αi Given X and y, and An error func'on such as J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi) 6/5/12 Ed Chang @ Founda'ons of MM 99

Leave One Out 6/5/12 Ed Chang @ Founda'ons of MM 100

SVMs yt = sign (Σ αi yi K(xt,xi)) (yi xi) training data, αi nonnega've, and kernel K posi've definite αi is obtained by minimizing J(α) = - ½ Σ αi αj yi yj K(xi,xj) + Σ F(αi) F(αi) = αi αi 0, Σyiαi = 0 6/5/12 Ed Chang @ Founda'ons of MM 101

SVMs 6/5/12 Ed Chang @ Founda'ons of MM 102

Important Insight K(xi, xj) = Cov (zi, zj) To design of a kernel is to design a similarity func'on that produces a posi've definite covariance matrix on the training instances 6/5/12 Ed Chang @ Founda'ons of MM 103

Basis Func'on Selec'on Three General Approaches Restric'on methods Limit the class of func'ons Selec'on methods Scan the dic'onary adap'vely (Boos'ng) Regulariza'on methods Use the en're dic'onary but restrict coefficients (Ridge Regression) 6/5/12 Ed Chang @ Founda'ons of MM 104

Overfiyng? Probably Not Because N free parameters (not D) Maximizing margin 6/5/12 Ed Chang @ Founda'ons of MM 105

Summary of ML Introduc'on Linear Models Large D D >> N Genera've vs. Discrimina've Models Nearest Neighbors Non- Linear Models Chapters 10, 11, 12: Large N 6/5/12 Ed Chang @ Founda'ons of MM 106

Reading Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval, E. Y. Chang, Springer, 2011 Chapter #3 Query- Concept Learning Chapter #9 Imbalanced Data Learning Edward Chang Founda'ons of LSMM 107