COMS 4771 Introduction to Machine Learning. Nakul Verma

Similar documents
Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Support Vector Machines

Max Margin-Classifier

9 Classification. 9.1 Linear Classifiers

Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear and Logistic Regression. Dr. Xiaowei Huang

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Jeff Howbert Introduction to Machine Learning Winter

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

18.9 SUPPORT VECTOR MACHINES

Linear & nonlinear classifiers

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Support Vector Machines. Machine Learning Fall 2017

Review: Support vector machines. Machine learning techniques and image analysis

Kernel Methods and Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

CS145: INTRODUCTION TO DATA MINING

Machine Learning : Support Vector Machines

6.036 midterm review. Wednesday, March 18, 15

Multiclass Classification-1

The Perceptron algorithm

The Perceptron Algorithm

Discriminative Models

SVMs, Duality and the Kernel Trick

Neural networks and support vector machines

The Perceptron Algorithm, Margins

Discriminative Models

Introduction to Support Vector Machines

Mining Classification Knowledge

COMS 4771 Regression. Nakul Verma

MIRA, SVM, k-nn. Lirong Xia

Support vector machines Lecture 4

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Nonlinear Classification

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines.

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Linear & nonlinear classifiers

Kernel Methods. Charles Elkan October 17, 2007

Neural Networks DWML, /25

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Support Vector Machine (continued)

CPSC 340: Machine Learning and Data Mining

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

TDT4173 Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Brief Introduction to Machine Learning

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Statistical Pattern Recognition

CS 188: Artificial Intelligence Spring Announcements

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

PAC-learning, VC Dimension and Margin-based Bounds

Logistic Regression Logistic

Linear Discrimination Functions

CS 188: Artificial Intelligence Spring Announcements

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Learning from Examples

CSE 546 Final Exam, Autumn 2013

Machine Learning Lecture 5

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

(Kernels +) Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

6.867 Machine learning

Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

CS 188: Artificial Intelligence. Outline

SUPPORT VECTOR MACHINE

Lecture 10: A brief introduction to Support Vector Machine

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Support Vector Machine (SVM) and Kernel Methods

More about the Perceptron

SGN (4 cr) Chapter 5

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

CS798: Selected topics in Machine Learning

Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Announcements. Proposals graded

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Kernelized Perceptron Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine. Industrial AI Lab.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Lecture 4: Perceptrons and Multilayer Perceptrons

Modelli Lineari (Generalizzati) e SVM

Perceptron (Theory) + Linear Regression

18.6 Regression and Classification with Linear Models

A Course in Machine Learning

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Learning Theory Continued

Transcription:

COMS 4771 Introduction to Machine Learning Nakul Verma

Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday

Last time Generative vs. Discriminative Classifiers Nearest Neighbor (NN) classification Optimality of k-nn Coping with drawbacks of k-nn Decision Trees The notion of overfitting in machine learning

A Closer Look Classification x O Knowing the boundary is enough for classification

Linear Decision Boundary Height male data female data Weight Assume binary classification y= {-1,+1} (What happens in multi-class case?)

Learning Linear Decision Boundaries g = decision boundary d=1 case: general: f = linear classifier +1 if g(x) 0-1 if g(x) < 0 # of parameters to learn in R d?

Dealing with w 0 =. bias homogeneous lifting

The Linear Classifier popular nonlinearities non-linear linear Σ i w i x i threshold sigmoid 1 x 1 x 2 x d bias A basic computational unit in a neuron

Can Be Combined to Make a Network Amazing fact: Can approximate any smooth function! x 1 x 2 x 3 x d An artificial neural network

How to Learn the Weights? Given labeled training data (bias included): Want:, which minimizes the training error, i.e. How do we minimize? Cannot use the standard technique (take derivate and examine the stationary points). Why? Unfortunately: NP-hard to solve or even approximate!

Finding Weights (Relaxed Assumptions) Can we approximate the weights if we make reasonable assumptions? What if the training data is linearly separable?

Linear Separablity Say there is a linear decision boundary which can perfectly separate the training data distance of the closest point to the boundary (margin γ )

Finding Weights Given: labeled training data S = Want to determine: is there a which satisfies (for all i) i.e., is the training data linearly separable? Since there are d+1 variables and S constraints, it is possible to solve efficiently it via a (constraint) optimization program. (How?) Can find it in a much simpler way!

The Perceptron Algorithm Given: labelled training data S = Initialize = 0 For t = 1,2,3, If exists s.t. (terminate when no such training sample exists)

Perceptron Algorithm: Geometry

Perceptron Algorithm: Geometry

The Perceptron Algorithm Input: labelled training data S = Initialize = 0 For t = 1,2,3, If exists s.t. (terminate when no such training sample exists) Question: Does the perceptron algorithm terminates? If so, when?

Perceptron Algorithm: Guarantee Theorem (Perceptron mistake bound): Assume there is a (unit length) that can separate the training sample S with margin γ Let R = Then, the perceptron algorithm will make at most mistakes. Thus, the algorithm will terminate in T rounds! umm but what about the generalization or the test error?

Proof Key quantity to analyze: How far is from? Suppose the perceptron algorithm makes a mistake in iteration t, then

Proof (contd.) for all iterations t So, after T rounds Therefore:

What Good is a Mistake Bound? It s an upper bound on the number of mistakes made by an online algorithm on an arbitrary sequence of examples i.e. no i.i.d. assumption and not loading all the data at once! Online algorithms with small mistake bounds can be used to develop classifiers with good generalization error!

Other Simple Variants on the Perceptron Voted perceptron Average perceptron Winnow

Linear Classification Linear classification simple, but when is real-data (even approximately) linearly separable?

What about non-linear decision boundaries? Non linear decision boundaries are common: x

Generalizing Linear Classification Suppose we have the following training data: d=2 case: say, the decision boundary is some sort of ellipse e.g. circle of radius r: separable via a circular decision boundary not linear in!

But g is Linear in some Space! non linear in x 1 & x 2 linear in χ 1 & χ 2 So if we apply a feature transformation on our data: Then g becomes linear in φ - transformed feature space!

Feature Transformation Geometrically

Feature Transform for Quadratic Boundaries R 2 case: (generic quadratic boundary) feature transformation: R d case: (generic quadratic boundary) This captures all pairwise interactions between variables feature transformation:

Data is Linearly Separable in some Space! Theorem: Given n labeled points y i = {-1,+1}, there exists a feature transform where the data points are linearly separable. (this feature transform is sometimes called the Kernel transform) the proof is almost trivial!

Proof Given n points, consider the mapping into R n : 0 0 y i 0 (zero in all coordinates except in coordinate i) 0 Then, the decision boundary induced by linear weighting perfectly separates the input data!

Transforming the Data into Kernel Space Pros: Any problem becomes linearly separable! Cons: What about computation? Generic kernel transform is Ω(n) Some useful kernel transforms map the input space into infinite dimensional space! What about model complexity? Generalization performance typically degrades with model complexity

The Kernel Trick (to Deal with Computation) Explicitly working in generic Kernel space takes time Ω(n) But the dot product between two data points in kernel space can be computed relatively quickly can compute fast Example: quadratic kernel transform for data in R d explicit transform O(d 2 ) dot products O(d) RBF (radial basis function) kernel transform for data in R d explicit transform infinite dimension! dot products O(d)

The Kernel Trick The trick is to perform classification in such a way that it only accesses the data in terms of dot products (so it can be done quicker) Example: the `kernel Perceptron Recall: Equivalently α i = # of time mistake was made on x k Thus, classification becomes Only accessing data in terms of dot products!

The Kernel Trick: for Perceptron classification in original space: If we were working in the transformed Kernel space, it would have been Algorithm: Initialize = 0 For t = 1,2,3,, T If exists s.t. implicitly working in non-linear kernel space!

The Kernel Trick: Significance dot products are a measure of similarity Can be replaced by any userdefined measure of similarity! So, we can work in any user-defined non-linear space implicitly without the potentially heavy computational cost

What We Learned Decision boundaries for classification Linear decision boundary (linear classification) The Perceptron algorithm Mistake bound for the perceptron Generalizing to non-linear boundaries (via Kernel space) Problems become linear in Kernel space The Kernel trick to speed up computation

Questions?

Next time Support Vector Machines (SVMs)!

Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday