Brief Introduction to Machine Learning

Similar documents
Introduction to Support Vector Machine

Machine Learning. Yuh-Jye Lee. February 12, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Linear & nonlinear classifiers

Linear & nonlinear classifiers

Support Vector Machines

Support Vector Machines.

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines Explained

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machine & Its Applications

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

ML (cont.): SUPPORT VECTOR MACHINES

Lecture Notes on Support Vector Machine

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

CS145: INTRODUCTION TO DATA MINING

Linear Classification and SVM. Dr. Xin Zhang

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Mining Classification Knowledge

Introduction to Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Perceptron Revisited: Linear Separators. Support Vector Machines

Modelli Lineari (Generalizzati) e SVM

Support Vector and Kernel Methods

Discriminative Models

Statistical Machine Learning from Data

COMS 4771 Introduction to Machine Learning. Nakul Verma

Sequential Minimal Optimization (SMO)

Max Margin-Classifier

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Kernel Methods and Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

SUPPORT VECTOR MACHINE

Support Vector Machines

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Discriminative Models

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines

Neural networks and support vector machines

Support Vector Machines, Kernel SVM

Linear Learning Machines

Support Vector Machine

6.036 midterm review. Wednesday, March 18, 15

ESS2222. Lecture 4 Linear model

Support Vector Machines

Support vector machines Lecture 4

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Support Vector Machine. Industrial AI Lab.

Support Vector Machines. Maximizing the Margin

Pattern Recognition 2018 Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machine. Natural Language Processing Lab lizhonghua

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

18.9 SUPPORT VECTOR MACHINES

Introduction to Machine Learning Midterm Exam

CS 188: Artificial Intelligence. Outline

Support Vector Machine

Support Vector Machines for Classification and Regression

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

PAC-learning, VC Dimension and Margin-based Bounds

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Non-linear Support Vector Machines

TDT4173 Machine Learning

EE 511 Online Learning Perceptron

HOMEWORK 4: SVMS AND KERNELS

Introduction to Support Vector Machines

Machine Learning Practice Page 2 of 2 10/28/13

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Machine Learning 4771

Statistical Pattern Recognition

Incorporating detractors into SVM classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machine (continued)

Support Vector Machines

Transcription:

Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49

1 Introduction 2 Binary Classification 3 Support Vector Machine 4 Nonlinear Support Vector Machine 5 Instance-based learning 2 / 49

What is Machine Learning? Representation + Optimization + Evaluation Pedro Domingos, A few useful things to know about machine learning, Communications of the ACM, Vol. 55 Issue 10, 78-87, October 2012 3 / 49

AlphaGo 4 / 49

Mayhem Wins DARPA Cyber Grand Challenge 5 / 49

The Master Algorithm 6 / 49

The Master Algorithm 7 / 49

The Plan of My Lecture Only forcus on Supervised Learning Will give you four basic algorithms Online Perceptron Algorithm Support Vector Machines k-nearest Neighbor Naive Bayes Classifier Basic Concept of Learning Theorey Evaluation 8 / 49

Binary Classification Problem (A Fundamental Problem in Data Mining) Find a decision function (classifier) to discriminate two categories data sets. Supervised learning in Machine Learning Decision Tree, Deep Neural Network, k-nn and Support Vector Machines, etc. Discrimination Analysis in Statistics Fisher Linear Discriminator Successful applications: Marketing, Bioinformatics, Fraud detection 9 / 49

Binary Classification Problem Given a training dataset S = {(x i, y i ) x i R n, y i { 1, 1}, i = 1,..., l} x i A + y i = 1 & x i A y i = 1 Main Goal: Predict the unseen class label for new data Find a function f : R n R by learning from data f (x) 0 x A + and f (x) < 0 x A The simplest function is linear: f (x) = w x + b 10 / 49

Binary Classification Problem Linearly Separable Case x w + b = 0 w x w + b = +1 A+ A- Benign Malignant x w + b = 1 11 / 49

People of ACM: David Blei, (Sept. 9, 2014) The recipient of the 2013 ACM- Infosys Foundation Award in the Computing Sciences, he is joining Columbia University this fall as a Professor of Statistics and Computer Science, and will become a member of Columbia s Institute for Data Sciences and Engineering. 12 / 49

What is the most important recent innovation in machine learning? [A]: One of the main recent innovations in ML research has been that we (the ML community) can now scale up our algorithms to massive data, and I think that this has fueled the modern renaissance of ML ideas in industry. The main idea is called stochastic optimization, which is an adaptation of an old algorithm invented by statisticians in the 1950s. 13 / 49

What is the most important recent innovation in machine learning? [A]: In short, many machine learning problems can be boiled down to trying to find parameters that maximize (or minimize) a function. A common way to do this is gradient ascent, iteratively following the steepest direction to climb a function to its top. This technique requires repeatedly calculating the steepest direction, and the problem is that this calculation can be expensive. Stochastic optimization lets us use cheaper approximate calculations. It has transformed modern machine learning. 14 / 49

Linear Learning Machines The simplest function is linear: f (x) = w x + b Finding this simplest function via an on-line and mistake-driven procedure Update the weight vector and bias when there is a misclassified point 15 / 49

Perceptron Algorithm (Primal Form) Rosenblatt, 1956 Given a training dataset S, and initial weight vector w 0 = 0 and the bias b 0 = 0 Repeat: for i = 1 to l if y i ( w k x i + b k ) 0 then w k+1 w k + ηy i x i b k+1 b k + ηy i R 2 R = max x i 1 i l k k + 1 end if Until no mistakes made within the for loop Return: k, (w k, b k ). What is k? 16 / 49

y i ( w k+1 x i + b k+1 ) > y i ( w k x i ) + b k? w k+1 w k + ηy i x i and b k+1 b k + ηy i R 2 y i ( w k+1 x i + b k+1 ) = y i ( (w k + ηy i x i ) x i + b k + ηy i R 2 ) R = max 1 i l x i = y i ( w k x i + b k ) + y i (ηy i ( x i x i + R 2 )) = y i ( w k x i + b k ) + η( x i x i + R 2 ) 17 / 49

y i ( w k+1 x i + b k+1 ) > y i ( w k x i ) + b k? w k+1 w k + ηy i x i and b k+1 b k + ηy i R 2 y i ( w k+1 x i + b k+1 ) = y i ( (w k + ηy i x i ) x i + b k + ηy i R 2 ) R = max 1 i l x i = y i ( w k x i + b k ) + y i (ηy i ( x i x i + R 2 )) = y i ( w k x i + b k ) + η( x i x i + R 2 ) 18 / 49

Perceptron Algorithm Stop in Finite Steps Theorem(Novikoff) Let S be a non-trivial training set, and let R = max 1 i l x i Suppose that there exists a vector w opt such that w opt = 1 and y i ( w opt x i + b opt ) γ for 1 i l. Then the number of mistakes made by the on-line perceptron algorithm on S is almost ( 2R γ )2. 19 / 49

Perceptron Algorithm (Dual Form) w = l α i y i x i i=1 Given a linearly separable training set S and α = 0, α R l, b = 0, R = max x i. 1 i l Repeat: for i = 1 to l if y i ( l α j y j x j x i + b) 0 then j=1 α i α i + 1 ; b b + y i R 2 end if end for Until no mistakes made within the for loop return: (α, b) 20 / 49

What We Got in the Dual Form of Perceptron Algorithm? The number of updates equals: l α i = α 1 ( 2R γ )2 i=1 α i > 0 implies that the training point (x i, y i ) has been misclassified in the training process at least once. α i = 0 implies that removing the training point (x i, y i ) will not affect the final results. The training data only appear in the algorithm through the entries of the Gram matrix,g R l l which is defined below: G ij = x i, x j 21 / 49

Outline 1 Introduction 2 Binary Classification 3 Support Vector Machine 4 Nonlinear Support Vector Machine 5 Instance-based learning 22 / 49

Binary Classification Problem Linearly Separable Case x w + b = 0 w x w + b = +1 A+ A- Benign Malignant x w + b = 1 23 / 49

Support Vector Machines Maximizing the Margin between Bounding Planes w x w + b = +1 x w + b = 1 A- A+ 2 w 2 = Margin 24 / 49

Why Use Support Vector Machines? Powerful tools for Data Mining SVM classifier is an optimally defined surface SVMs have a good geometric interpretation SVMs can be generated very efficiently Can be extended from linear to nonlinear case Typically nonlinear in the input space Linear in a higher dimensional feature space Implicitly defined by a kernel function Have a sound theoretical foundation Based on Statistical Learning Theory 25 / 49

Why We Maximize the Margin? (Based on Statistical Learning Theory) The Structural Risk Minimization (SRM): The expected risk will be less than or equal to empirical risk (training error)+ VC (error) bound w 2 VC bound min VC bound min 1 2 w 2 2 max Margin 26 / 49

Summary the Notations Let S = {(x 1, y 1 ), (x 2, y 2 ),..., (x l, y l ) be a training dataset and represented by matrices (x 1 ) (x 2 ) y 1 0 A =. Rl n, D =..... R l l (x l ) 0 y l A i w + b +1, for D ii = +1 A i w + b 1, for D ii = 1, equivalent to D(Aw + 1b) 1, where 1 = [1, 1,..., 1] R l 27 / 49

Support Vector Classification (Linearly Separable Case, Primal) The hyperplane (w, b) is determined by solving the minimization problem: 1 min (w,b) R n+1 2 w 2 2 D(Aw + 1b) 1, It realizes the maximal margin hyperplane with geometric margin γ = 1 w 2 28 / 49

Support Vector Classification (Linearly Separable Case, Dual Form) The dual problem of previous MP: max α R l 1 α 1 2 α DAA Dα subject to 1 Dα = 0, α 0 Applying the KKT optimality conditions, we have w = A Dα. But where is b? Don t forget 0 α D(Aw + 1b) 1 0 29 / 49

Dual Representation of SVM (Key of Kernel Methods: w = A Dα = l i=1 y i α i A i ) The hypothesis is determined by (α, b ) Remember : A i = x i h(x) = sgn( x A Dα + b ) l = sgn( y i αi x i x + b ) i=1 = sgn( α i >0 y i α i x i x + b ) 30 / 49

Soft Margin SVM (Nonseparable Case) If data are not linearly separable Primal problem is infeasible Dual problem is unbounded above Introduce the slack variable for each training point y i (w x i + b) 1 ξ i, ξ i 0, i The inequality system is always feasible e.g. w = 0, b = 0, ξ = 1 31 / 49

w x w + b = +1 ξ i x w + b = 1 A- ξ j A+ 2 w 2 = Margin 32 / 49

Robust Linear Programming Preliminary Approach to SVM min w,b,ξ 1 ξ s.t. D(Aw + 1b) + ξ 1 (LP) ξ 0 where ξ is nonnegative slack(error) vector The term 1 ξ, 1-norm measure of error vector, is called the training error For the linearly separable case, at solution of(lp): ξ = 0 33 / 49

Support Vector Machine Formulations (Two Different Measures of Training Error) 2-Norm Soft Margin: 1 min (w,b,ξ) R n+1+l 2 w 2 2 + C 2 ξ 2 2 1-Norm Soft Margin (Conventional SVM) D(Aw + 1b) + ξ 1 1 min (w,b,ξ) R n+1+l 2 w 2 2 + C1 ξ D(Aw + 1b) + ξ 1 ξ 0 34 / 49

Tuning Procedure How to determine C? Tuning Procedure How to determine C? C overfitting The final value of parameter is one with the maximum testing set correctness! The final value of parameter is one with the maximum testing set correctness! 35 / 49

1-Norm SVM (Different Measure of Margin) 1-Norm SVM: min (w,b,ξ) R n+1+l w 1 +C1 ξ D(Aw + 1b) + ξ 1 ξ 0 Equivalent to: min (s,w,b,ξ) R 2n+1+l 1s + C1 ξ D(Aw + 1b) + ξ 1 s w s ξ 0 Good for feature selection and similar to the LASSO 36 / 49

Outline 1 Introduction 2 Binary Classification 3 Support Vector Machine 4 Nonlinear Support Vector Machine 5 Instance-based learning 37 / 49

Two-spiral Dataset Two-spiral Dataset (94 white Dots & 94 Red Dots) (94 White Dots & 94 Red Dots) 38 / 49

Learning in Feature Space (Could Simplify the Classification Task) Learning in a high dimensional space could degrade generalization performance This phenomenon is called curse of dimensionality By using a kernel function, that represents the inner product of training example in feature space, we never need to explicitly know the nonlinear map Even do not know the dimensionality of feature space There is no free lunch Deal with a huge and dense kernel matrix Reduced kernel can avoid this difficulty 39 / 49

Φ X F Feature map nonlinear pattern in data space approximate linear pattern in feature space 40 / 49

Linear Machine in Feature Space Let φ : X F be a nonlinear map from the input space to some feature space The classifier will be in the form(primal): f (x) = ( Make it in the dual form: f (x) = (? w j φ j (x)) + b j=1 l α i y i φ(x i ) φ(x) ) + b i=1 41 / 49

Kernel:Represent Inner Product in Feature Space Definition: A kernel is a function K : X X R such that for all x, z X where φ : X F The classifier will become: K(x, z) = φ(x) φ(z) f (x) = ( l α i y i K(x i, x)) + b i=1 42 / 49

Outline 1 Introduction 2 Binary Classification 3 Support Vector Machine 4 Nonlinear Support Vector Machine 5 Instance-based learning 43 / 49

Instance-based Learning Fundamental philosophy: Two instances that are close to each other or similar to each other they should share with the same label Also known as memory-based learning since what they do is store the training instances in a lookup table and interpolate from these. It requires memory of O(N) Given an input similar ones should be found and finding them requires computation of O(N) Such methods are also called lazy learning algorithms. Because they do NOT compute a model when they are given a training set but postpone the computation of the model until they are given a new test instance (query point) 44 / 49

k-nearest Neighbors Classifier Given a query point x o, we find the k training points x ( i), i = 1, 2,..., k closest in distance to x o Then classify using majority vote among these k neighbors. Choose k as an odd number will avoid the tie. Ties are broken at random If all attributes (features) are real-valued, we can use Euclidean distance. That is d(x, x o ) = x x o 2 If the attribute values are discrete, we can use Hamming distance, which counts the number of nonmatching attributes d(x, x o ) = n 1(x j xj o ) j=1 45 / 49

1-Nearest Neighbor Decision Boundary (Voronoi) 46 / 49

Distance Measure Using different distance measurements will give very different results in k-nn algorithm. Be careful when you compute the distance We might need to normalize the scale between different attributes. For example, yearly income vs. daily spend Typically we first standardize each of the attributes to have mean zero and variance 1 ˆx j = x j µ j σ j 47 / 49

Learning Distance Measure Finding a distance function d(x i, x j ) such that if x i and x j are belong to the class the distance is small and if they are belong to the different classes the distance is large. Euclidean distance: x i x j 2 2 = (x i x j ) (x i x j ) Mahalanobis distance: d(x i, x j ) = (x i x j ) M(x i x j ) where M is a positive semi-definited matrix. (x i x j ) M(x i x j ) = (x i x j ) L L(x i x j ) = (Lx i Lx j ) (Lx i Lx j ) The matrix L can be with the size k n and k << n 48 / 49

Reference C. J. C Burges. A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, Vol. 2, No. 2, (1998) 121-167. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines, Cambridge University Press,(2000). 49 / 49