CS145: INTRODUCTION TO DATA MINING

Similar documents
CS249: ADVANCED DATA MINING

CS6220: DATA MINING TECHNIQUES

Perceptron Revisited: Linear Separators. Support Vector Machines

SUPPORT VECTOR MACHINE

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Linear Classification and SVM. Dr. Xin Zhang

Support Vector Machine & Its Applications

Jeff Howbert Introduction to Machine Learning Winter

Announcements - Homework

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

CS6220: DATA MINING TECHNIQUES

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines

Chapter 6 Classification and Prediction (2)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (continued)

CS145: INTRODUCTION TO DATA MINING

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines

Linear & nonlinear classifiers

CS6220: DATA MINING TECHNIQUES

Lecture 10: A brief introduction to Support Vector Machine

SVMs, Duality and the Kernel Trick

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

CS6220: DATA MINING TECHNIQUES

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear & nonlinear classifiers

ML (cont.): SUPPORT VECTOR MACHINES

Pattern Recognition 2018 Support Vector Machines

Max Margin-Classifier

Machine Learning. Support Vector Machines. Manfred Huber

Lecture Notes on Support Vector Machine

CS6220: DATA MINING TECHNIQUES

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines and Kernel Methods

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Introduction to SVM and RVM

Introduction to Support Vector Machines

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines

Kernel Methods and Support Vector Machines

Support Vector Machines, Kernel SVM

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Support vector machines Lecture 4

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CS6220: DATA MINING TECHNIQUES

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Lecture Support Vector Machine (SVM) Classifiers

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Kernels and the Kernel Trick. Machine Learning Fall 2017

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Support Vector Machine

Support Vector Machines Explained

Midterm: CS 6375 Spring 2015 Solutions

18.9 SUPPORT VECTOR MACHINES

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

A Tutorial on Support Vector Machine

Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Learning with kernels and SVM

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

COMS 4771 Introduction to Machine Learning. Nakul Verma

Support Vector Machines

Introduction to Machine Learning Midterm Exam

Statistical Methods for NLP

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

HOMEWORK 4: SVMS AND KERNELS

CS798: Selected topics in Machine Learning

Introduction to Support Vector Machines

Support Vector Machines for Classification and Regression

Support Vector Machines. Maximizing the Margin

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Brief Introduction to Machine Learning

Support Vector Machine

Polyhedral Computation. Linear Classifiers & the SVM

Warm up: risk prediction with logistic regression

Support Vector Machines.

Support Vector Machines

Constrained Optimization and Support Vector Machines

Statistical Pattern Recognition

Transcription:

CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017

Homework 1 Announcements Due end of the day of this Thursday (11:59pm) Reminder of late submission policy original score * E.g., if you are t = 12 hours late, maximum of half score will be obtained; if you are 24 hours late, 0 score will be given. 2

Methods to Learn: Last Lecture Vector Data Set Data Sequence Data Text Data Classification Clustering Logistic Regression; Decision Tree; KNN SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models Naïve Bayes for Text PLSA Prediction Frequent Pattern Mining Similarity Search Linear Regression GLM* Apriori; FP growth GSP; PrefixSpan DTW 3

Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering Logistic Regression; Decision Tree; KNN SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models Naïve Bayes for Text PLSA Prediction Frequent Pattern Mining Similarity Search Linear Regression GLM* Apriori; FP growth GSP; PrefixSpan DTW 4

Support Vector Machine Introduction Linear SVM Non-linear SVM Scalability Issues* Summary 5

Math Review Vector x = x 1, x 2,, x n Subtracting two vectors: x = b a Dot product a b = a i b i Geometric interpretation: projection If a and b are orthogonal, a b = 0 6

Math Review (Cont.) Plane/Hyperplane a 1 x 1 + a 2 x 2 + + a n x n = c Line (n=2), plane (n=3), hyperplane (higher dimensions) Normal of a plane n = a 1, a 2,, a n a vector which is perpendicular to the surface 7

Math Review (Cont.) Define a plane using normal n = a, b, c and a point (x 0, y 0, z 0 ) in the plane: a, b, c x 0 x, y 0 y, z 0 z = 0 ax + by + cz = ax 0 + by 0 + cz 0 (= d) Distance from a point (x 0, y 0, z 0 ) to a plane ax + by + cz = d x 0 x, y 0 y, z 0 z ax 0 +by 0 +cz 0 d a 2 +b 2 +c 2 a,b,c a,b,c = 8

Linear Classifier Given a training dataset x i, y N i i=1 A separating hyperplane can be written as a linear combination of attributes W X + b = 0 where W={w 1, w 2,, w n } is a weight vector and b a scalar (bias) For 2-D it can be written as Classification: w 0 + w 1 x 1 + w 2 x 2 = 0 w 0 + w 1 x 1 + w 2 x 2 > 0 => y i = +1 w 0 + w 1 x 1 + w 2 x 2 0 => y i = 1 9

Recall Is the decision boundary for logistic regression linear? Is the decision boundary for decision tree linear? 10

Simple Linear Classifier: Perceptron Loss function: max{0, y i w T x i } 11

More on Sign Function 12

Example 13

Support Vector Machine Introduction Linear SVM Non-linear SVM Scalability Issues* Summary 14

Can we do better? Which hyperplane to choose? 15

SVM Margins and Support Vectors Small Margin Large Margin Support Vectors 16

SVM When Data Is Linearly Separable m Let data D be (X 1, y 1 ),, (X D, y D ), where X i is the set of training tuples associated with the class labels y i There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH) 17

SVM Linearly Separable A separating hyperplane can be written as W X + b = 0 The hyperplane defining the sides of the margin, e.g.,: H 1 : w 0 + w 1 x 1 + w 2 x 2 1 for y i = +1, and H 2 : w 0 + w 1 x 1 + w 2 x 2 1 for y i = 1 Any training tuples that fall on hyperplanes H 1 or H 2 (i.e., the sides defining the margin) are support vectors This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints Quadratic Programming (QP) Lagrangian multipliers 18

Maximum Margin Calculation w: decision hyperplane normal vector x i : data point i y i : class of data point i (+1 or -1) w T x b + b = -1 ρ w T x a + b = 1 margin: ρ = 2 w Hint: what is the distance between x a and w T x + b = -1 w T x + b = 0 19

SVM as a Quadratic Programming QP Objective: Find w and b such that ρ = 2 maximized; Constraints: For all {(x i, y i )} w T x i + b 1 if y i =1; w T x i + b -1 if y i = -1 w is A better form Objective: Find w and b such that Φ(w) =½ w T w is minimized; Constraints: for all {(x i,y i )}: y i (w T x i + b) 1 20

Solve QP This is now optimizing a quadratic function subject to linear constraints Quadratic optimization problems are a wellknown class of mathematical programming problem, and many (intricate) algorithms exist for solving them (with many special ones built for SVMs) The solution involves constructing a dual problem where a Lagrange multiplier α i is associated with every constraint in the primary problem: 21

Lagrange Formulation 22

Primal Form and Dual Form Primal Objective: Find w and b such that Φ(w) =½ w T w is minimized; Constraints: for all {(x i,y i )}: y i (w T x i + b) 1 Equivalent under some conditions: KKT conditions Objective: Find α 1 α n such that Q(α) =Σα i - ½ΣΣα i α j y i y j x it x j is maximized and Dual Constraints (1) Σα i y i = 0 (2) α i 0 for all α i More derivations: http://cs229.stanford.edu/notes/cs229-notes3.pdf 23

The Optimization Problem Solution The solution has the form: w =Σα i y i x i b= y k - w T x k for any x k such that α k 0 Each non-zero α i indicates that corresponding x i is a support vector. Then the classifying function will have the form: f(x) = Σα i y i x it x + b Notice that it relies on an inner product between the test point x and the support vectors x i We will return to this later. Also keep in mind that solving the optimization problem involved computing the inner products x it x j between all pairs of training points. 24

Sec. 15.2.1 Soft Margin Classification If the training data is not linearly separable, slack variables ξ i can be added to allow misclassification of difficult or noisy examples. Allow some errors Let some points be moved to where they belong, at a cost Still, try to minimize training set errors, and to place hyperplane far from each class (large margin) ξ i ξ j 25

Soft Margin Classification Sec. 15.2.1 Mathematically The old formulation: Find w and b such that Φ(w) =½ w T w is minimized and for all {(x i,y i )} y i (w T x i + b) 1 The new formulation incorporating slack variables: Find w and b such that Φ(w) =½ w T w + CΣξ i is minimized and for all {(x i,y i )} y i (w T x i + b) 1- ξ i and ξ i 0 for all i Parameter C can be viewed as a way to control overfitting A regularization term (L1 regularization) 26

Sec. 15.2.1 Soft Margin Classification Solution The dual problem for soft margin classification: Find α 1 α N such that Q(α) =Σα i - ½ΣΣα i α j y i y j x it x j is maximized and (1) Σα i y i = 0 (2) 0 α i C for all α i Neither slack variables ξ i nor their Lagrange multipliers appear in the dual problem! Again, x i with non-zero α i will be support vectors. If 0<α i <C, ξ i =0 If α i =C, ξ i >0 Solution to the problem is: w = Σα i y i x i b= y k - w T x k for any x k such that 0<α k <C w is not needed explicitly for classification! f(x) = Σα i y i x it x + b 27

Sec. 15.1 Classification with SVMs Given a new point x, we can score its projection onto the hyperplane normal: I.e., compute score: w T x + b = Σα i y i x it x + b Decide class based on whether < or > 0 Can set confidence threshold t. Score > t: yes Score < -t: no Else: don t know -1 0 1 28

Sec. 15.2.1 Linear SVMs: Summary The classifier is a separating hyperplane. The most important training points are the support vectors; they define the hyperplane. Quadratic optimization algorithms can identify which training points x i are support vectors with non-zero Lagrangian multipliers α i. Both in the dual formulation of the problem and in the solution, training points appear only inside inner products: Find α 1 α N such that Q(α) =Σα i - ½ΣΣα i α j y i y j x it x j is maximized and (1) Σα i y i = 0 (2) 0 α i C for all α i f(x) = Σα i y i x it x + b 29

Support Vector Machine Introduction Linear SVM Non-linear SVM Scalability Issues* Summary 30

Sec. 15.2.3 Non-linear SVMs Datasets that are linearly separable (with some noise) work out great: 0 x But what are we going to do if the dataset is just too hard? 0 x How about mapping data to a higher-dimensional space: x 2 0 x 31

Sec. 15.2.3 Non-linear SVMs: Feature spaces General idea: the original feature space can always be mapped to some higherdimensional feature space where the training set is separable: Φ: x φ(x) 32

Sec. 15.2.3 The Kernel Trick The linear classifier relies on an inner product between vectors K(x i,x j )=x it x j If every data point is mapped into high-dimensional space via some transformation Φ: x φ(x), the inner product becomes: K(x i,x j )= φ(x i ) T φ(x j ) A kernel function is some function that corresponds to an inner product in some expanded feature space. Example: 2-dimensional vectors x=[x 1 x 2 ]; let K(x i,x j )=(1 + x it x j ) 2, Need to show that K(x i,x j )= φ(x i ) T φ(x j ): K(x i,x j )=(1 + x it x j ) 2 = 1+ x i12 x j1 2 + 2 x i1 x j1 x i2 x j2 + x i22 x j2 2 + 2x i1 x j1 + 2x i2 x j2 = = [1 x i1 2 2 x i1 x i2 x i2 2 2x i1 2x i2 ] T [1 x j1 2 2 x j1 x j2 x j2 2 2x j1 2x j2 ] = φ(x i ) T φ(x j ) where φ(x) = [1 x 1 2 2 x 1 x 2 x 2 2 2x 1 2x 2 ] 33

SVM: Different Kernel functions Instead of computing the dot product on the transformed data, it is math. equivalent to applying a kernel function K(X i, X j ) to the original data, i.e., K(X i, X j ) = Φ(X i ) T Φ(X j ) Typical Kernel Functions *SVM can also be used for classifying multiple (> 2) classes and for regression analysis (with additional parameters) 34

Sec. 15.2.1 Non-linear SVM Replace inner-product with kernel functions Optimization problem Find α 1 α N such that Q(α) =Σα i - ½ΣΣα i α j y i y j K(x i,x j ) is maximized and (1) Σα i y i = 0 (2) 0 α i C for all α i Decision boundary f(x) = Σα i y i K(x i,x) + b 35

Support Vector Machine Introduction Linear SVM Non-linear SVM Scalability Issues* Summary 36

*Scaling SVM by Hierarchical Micro-Clustering SVM is not scalable to the number of data objects in terms of training time and memory usage H. Yu, J. Yang, and J. Han, Classifying Large Data Sets Using SVM with Hierarchical Clusters, KDD'03) CB-SVM (Clustering-Based SVM) Given limited amount of system resources (e.g., memory), maximize the SVM performance in terms of accuracy and the training speed Use micro-clustering to effectively reduce the number of points to be considered At deriving support vectors, de-cluster micro-clusters near candidate vector to ensure high classification accuracy 37

*CF-Tree: Hierarchical Micro-cluster Read the data set once, construct a statistical summary of the data (i.e., hierarchical clusters) given a limited amount of memory Micro-clustering: Hierarchical indexing structure provide finer samples closer to the boundary and coarser samples farther from the boundary 38

*Selective Declustering: Ensure High Accuracy CF tree is a suitable base structure for selective declustering De-cluster only the cluster E i such that D i R i < D s, where D i is the distance from the boundary to the center point of E i and R i is the radius of E i Decluster only the cluster whose subclusters have possibilities to be the support cluster of the boundary Support cluster : The cluster whose centroid is a support vector 39

*CB-SVM Algorithm: Outline Construct two CF-trees from positive and negative data sets independently Need one scan of the data set Train an SVM from the centroids of the root entries De-cluster the entries near the boundary into the next level The children entries de-clustered from the parent entries are accumulated into the training set with the non-declustered parent entries Train an SVM again from the centroids of the entries in the training set Repeat until nothing is accumulated 40

*Accuracy and Scalability on Synthetic Dataset Experiments on large synthetic data sets shows better accuracy than random sampling approaches and far more scalable than the original SVM algorithm 41

Support Vector Machine Introduction Linear SVM Non-linear SVM Scalability Issues* Summary 42

Summary Support Vector Machine Linear classifier; support vectors; kernel SVM 43

SVM Related Links SVM Website: http://www.kernel-machines.org/ Representative implementations LIBSVM: an efficient implementation of SVM, multi-class classifications, nu-svm, one-class SVM, including also various interfaces with java, python, etc. SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only in C SVM-torch: another recent implementation also written in C From classification to regression and ranking: http://www.dainf.ct.utfpr.edu.br/~kaestner/mineracao/hwanjoyusvmtutorial.pdf 44