Lecture 4: Linear predictors and the Perceptron

Size: px
Start display at page:

Download "Lecture 4: Linear predictors and the Perceptron"

Transcription

1 Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 4 1 / 34

2 Inductive Bias Inductive bias is critical to prevent overfitting. Inductive bias = a relatively simple hypothesis class H. What if we don t know which H is suitable for our learning problem? Choose a good representation (relevant features) Use a general purpose hypothesis class. One of the most popular choices: Linear predictors. Kontorovich and Sabato (BGU) Lecture 4 2 / 34

3 Linear predictors Recall that in many learning problems X = R d. Each example is a vector with d coordinates (features). In binary classification problems, Y consists of two labels. In linear prediction, H is the class of all linear separators. Illustration with d = 2: Kontorovich and Sabato (BGU) Lecture 4 3 / 34

4 Why restrict to linear predictors? If the algorithm could choose any separation, we could get overfitting: Labels of unseen examples would not be predicted correctly. E.g. if using a squiggly line. Kontorovich and Sabato (BGU) Lecture 4 4 / 34

5 Preventing overfitting Linear predictors prevent overfitting: In dimension d, if training sample size is Θ(d), then training error true prediction error. Dimension d number of features. So, if we use linear predictors with enough training samples, does this guarantee low prediction error? Recall the No-Free-Lunch theorem... Kontorovich and Sabato (BGU) Lecture 4 5 / 34

6 Preventing overfitting is not enough No overfitting: error on training sample true prediction error on distribution. Can still have high error on training sample Kontorovich and Sabato (BGU) Lecture 4 6 / 34

7 Preventing overfitting is not enough Recall the decomposition of the prediction error err(ĥs, D): Approximation error: err app := inf err(h, D) h H Estimation error err est := err(ĥs, D) inf err(h, D). h H No overfitting estimation error is low. But we also need low approximation error: if all linear predictors have high error, no sample size will work. In practice, this is often the case with linear predictors. If good features are chosen to represent the examples! Kontorovich and Sabato (BGU) Lecture 4 7 / 34

8 Formalizing linear predictors To formalize linear predictors we will use inner products. Definition: For vectors x, z R d, x, z := d i=1 x(i)z(i). The length of a vector x R d can be defined by its inner product: x = d x(i) 2 = x, x. i=1 The angle between two vectors is defined by their inner product: cos(θ) = (Large value = small angle.) x, z x z. Inner products are commutative: x, z = z, x. Inner products are linear: If a, b R and x, x, z R d, a x, z = a x, z, x + x, z = x, z + x, z. Kontorovich and Sabato (BGU) Lecture 4 8 / 34

9 Formalizing linear predictors Call the labels Y = { 1, +1}. In 2 dimensions: the line is a x(1) + b = x(2) The linear prediction rule is: { +1 a x(1) + b x(2) y = 1 a x(1) + b < x(2). Kontorovich and Sabato (BGU) Lecture 4 9 / 34

10 Formalizing linear predictors In two dimensions: y = { +1 a x(1) + b x(2) 1 a x(1) + b < x(2). Define a vector w = (w(1), w(2)) = (a, 1). Can rewrite this as: y = sign(w(1) x(1) + w(2) x(2) + b) = sign( w, x + b). For a vector w R d and a number b R, define the linear predictor h w,b : x R d, b is called the bias of the predictor. h w,b (x) := sign( w, x + b). Hypothesis class of all linear predictors in dimension d: H d L := {h w,b w R d, b R}. Kontorovich and Sabato (BGU) Lecture 4 10 / 34

11 Formalizing linear predictors h w,b (x) := sign( w, x + b), H d L := {h w,b w R d, b R}. In 3 dimensions, the linear boundary w, x + b = 0 is a plane. In higher dimensions, it is a hyperplane. The vector w is the normal to the hyperplane b / w is the distance from the origin to the hyperplane. w b / w Kontorovich and Sabato (BGU) Lecture 4 11 / 34

12 The bias b is not needed Suppose we have a classification problem with X = R d For every example x R d, define x R d+1 : x = (x(1),..., x(d)) = x := (x(1),..., x(d), 1) For every linear predictor with a bias h w,b on R d, define a linear predictor h w without a bias on R d+1 : w := (w(1),..., w(d), b). We get h w,b (x) = h w (x ), for all x, w, b: h w (x ) = sign( x, w ) = sign( x, w + b) = h w,b (x). Conclusion: by adding a coordinate which is always 1, we can discard the bias term. Kontorovich and Sabato (BGU) Lecture 4 12 / 34

13 Removing the bias term From 1 dimension with a bias: To two dimensions without a bias: Linear predictors without a bias are called homogeneous. Linear predictors are also called halfspaces. Kontorovich and Sabato (BGU) Lecture 4 13 / 34

14 Implementing the ERM for linear predictors Implementing ERM: Find a linear predictor h w with a minimal empirical error: err(h w, S) 1 m {i sign( x i, w ) y i }. This problem is NP-hard There are workarounds (later in the course). Today: efficient algorithm if problem is realizable. Definition D is realizable by H if there exists some h H such that err(h, D) = 0. Kontorovich and Sabato (BGU) Lecture 4 14 / 34

15 ERM in the realizable case Definition D is realizable by H if there exists some h H such that err(h, D) = 0. For any x i in the training sample, y i = h (x i ). So, min h H err(h, S) err(h, S) = 0. ERM in the realizable case: find some h H such that err(h, S) = 0. For linear predictors: find h w,b that separates the positive and negative labels in the training sample. Can be done efficiently We will see two efficient methods. For linear predictors: realizable = separable. Kontorovich and Sabato (BGU) Lecture 4 15 / 34

16 ERM for separable linear predictors: Linear Programming A linear program (LP) is a problem of the following form: maximize w R d u, w subject to Aw v. w R d : a vector we wish to find. u R d, v R m, A R m d. The values of u, v, A define the specific linear program. LPs can be solved efficiently Many solvers are available. In Matlab: w = linprog(-u,-a,-v). Kontorovich and Sabato (BGU) Lecture 4 16 / 34

17 ERM for separable linear predictors: Linear Programming Linear Program maximize w R d u, w subject to Aw v ERM for the separable case: Find a linear predictor with zero error on the training sample {(x i, y i )} i m. Recall y i { 1, +1}. Our goal: This is equivalent to: Find w R d s.t. i m, sign( x i, w ) = y i. Find w R d s.t. i m, y i x i, w > 0. Problem: in the linear program we have a weak inequality, not strict >. If we use here, w = 0 satisfies the constraints Kontorovich and Sabato (BGU) Lecture 4 17 / 34

18 ERM for separable linear predictors: Linear Programming Our goal: Linear Program maximize w R d u, w subject to Aw v Find w R d s.t. i m, y i x i, w > 0. Need to change our the strict inequality to a weak one. If the problem is separable, there exists a solution. Name one of the solutions w. Denote γ := min i y i x i, w. Note γ > 0. Define w = w /γ. For all i m, y i x i, w = y i x i, w /γ 1. Conclusion: There is a predictor w R d such that i m, y i x i, w 1. Also, any predictor that satisfies this is a good solution. Kontorovich and Sabato (BGU) Lecture 4 18 / 34

19 ERM for separable linear predictors: Linear Programming Linear Program maximize w R d u, w subject to Our goal can be re-written as: Aw v Find w R d s.t. i m, y i x i, w 1. Turn this into the form of a linear program: u = (0,..., 0) (nothing is maximized), v = (1,..., 1) Row i of the matrix A is yi x i (y i x i (1),..., y i x i (d)). The linear program: maximize w R d 0 y 1 x 1 (1),..., y 1 x 1 (d) 1 subject to... w.... y m x m (1),..., y m x m (d) 1 Kontorovich and Sabato (BGU) Lecture 4 19 / 34

20 ERM for separable linear predictors: Linear Programming The LP approach is very easy to implement But it can be slow. And it fails completely if there is even one bad label. Kontorovich and Sabato (BGU) Lecture 4 20 / 34

21 The Perceptron The Perceptron algorithm was invented in 1958 by Rosenblatt. This version is called the Batch Perceptron. Goal remains as before: Find a linear predictor with zero error on the training sample {(x i, y i )} i m. Perceptron Idea: Work in rounds Start with a default predictor In each round, look at a single training example If current predictor is wrong on this example, move predictor in the right direction. Stop when the predictor assigns all examples with the correct label. Kontorovich and Sabato (BGU) Lecture 4 21 / 34

22 The Perceptron Batch Perceptron input A training sample S = {(x 1, y 1 ),..., (x m, y m )} output w R d such that i m, h w (x i ) = y i. 1: w (1) (0,..., 0) 2: while i s.t. y i w (t), x i 0 do 3: w (t+1) w (t) + y i x i 4: t t + 1 5: end while 6: Return w (t). Why does the update rule make sense? y i w (t+1), x i = y i w (t) + y i x i, x i = y i w (t), x i + y 2 i x i, x i = y i w (t), x i + x i 2. Each update moves y i w (t), x i closer to being positive. Kontorovich and Sabato (BGU) Lecture 4 22 / 34

23 The Perceptron Illustration in two dimensional space (d = 2) - on the board. The separator tilts in the right direction in each update The same example can be repeated several times. Does this always work? How many updates does it take to get an error-free separator? Kontorovich and Sabato (BGU) Lecture 4 23 / 34

24 The separation margin Intuitively, separation is easier if positive and negative points are far apart. far apart there is a separator which is far from all points. w distance from point Claim We will show that the Perceptron is indeed faster in this case. First, let s make this formal. w,x w is the distance between x R d and the separator defined by w. Kontorovich and Sabato (BGU) Lecture 4 24 / 34

25 The separation margin Claim: distance between x and the hyperplane defined by w is w,x w. Define w := w/ w. Then w = w, w = 1. The hyperplane is H = {v v, w = 0} = {v v, w = 0}. The distance between the hyperplane and x is := min v H x v. Take v = x w, x w. Then v H, because v, w = x, w ( w, x w), w = x, w w, x w, w = 0. The distance is at most x v. x v = w, x w = w, x w = w, x = Also, for any u H: w, x w. x u 2 = x v + v u 2 = (x v) + (v u), (x v) + (v u) = x v x v, v u + v u 2 And x v, v u = w, x w, v u = 0. w, x 2 w x v, v u. Kontorovich and Sabato (BGU) Lecture 4 25 / 34

26 The separation margin Denote R := max i x i. We will normalize by this value. The minimal normalized distance of any x i in S from w is called the margin of w. γ(w) := 1 R min w, x i i m w. w Kontorovich and Sabato (BGU) Lecture 4 26 / 34

27 The separation margin Which separators have a large margin? For any α > 0, w R d, α w defines the same separator, with the same margin. We can look at separators w such that min i m y i w, x i = 1. This does not lose generality! Then (R := max i x i ) γ(w) = 1 R w Small norm w and small R = large margin γ(w). Kontorovich and Sabato (BGU) Lecture 4 27 / 34

28 The Perceptron: Guarantee Theorem (Theorem 9.1 in course book) Assume that S = ((x 1, y 1 ),..., (x m, y m )) is separable. Then 1. When the Perceptron stops and returns w (t), i m, y i w (t), x i > Define R := max i [m] x i. B := min{ w i m, y i w, x i 1}. The Perceptron performs at most (RB) 2 updates. Part (1) is trivial: Perceptron never stops unless this holds. γ(w) = 1 R w Let γ S be the largest margin achievable on S. Then γ S = 1 RB. Number of updates 1/γ 2 S. Kontorovich and Sabato (BGU) Lecture 4 28 / 34

29 The Perceptron: Proving the theorem We will show that if the Perceptron runs for at least T iterations, then T (RB) 2. So total number of iterations is at most (RB) 2. Let w such that y i w, x i 1 and w = B. We will keep track of two quantities: w (t) and w, w (t). We will show that the norm grows slowly, while the inner product grows fast. More precisely: w (t+1) tr w, w (t+1) t. Recall that larger w,w (t+1) w w (t+1) = smaller angle between w, w (t). Reminder: Cauchy-Schwarz inequality. For all u, v R d, So we will conclude: u, v u v. T w, w (T +1) w w (T +1) B T R = T (RB) 2. Kontorovich and Sabato (BGU) Lecture 4 29 / 34

30 The Perceptron: Proving the theorem Upper bounding the norm w (T +1) : In iteration t, let i be the example that was used to update w (t). Recall the Perceptron update: w (t+1) w (t) + y i x i We have y i w (t), x i 0. So w (1) 2 = 0. w (t+1) 2 = w (t) + y i x i 2 By induction, w (T +1) 2 TR 2 So w (T +1) T R. = w (t) 2 + 2y i w (t), x i + y 2 i x i 2 w (t) 2 + R 2. Kontorovich and Sabato (BGU) Lecture 4 30 / 34

31 The Perceptron: Proving the theorem Lower bounding the inner product w, w (T +1) w (1) = (0,..., 0) = w, w (1) = 0. Recall the Perceptron update: w (t+1) w (t) + y i x i In each iteration w, w (t) is increased by at least one: w, w (t+1) w, w (t) = w, w (t+1) w (t) = w, y i x i = y i w, x i (from the definition of w ) 1 After T iterations: T w, w (T +1) = ( w, w (t+1) w, w (t) ) T. t=1 This means that our w gets closer to w at each iteration Kontorovich and Sabato (BGU) Lecture 4 31 / 34

32 The Perceptron: Proving the theorem We showed: w (T +1) T R w, w (T +1) T. Using Cauchy-Schwarz: T w, w (T +1) w w (T +1) B T R Conclusion: T (RB) 2. We showed The Perceptron runs for at most (RB) 2 iterations. When it stops, the separator it returns separates the examples in S. γ S := the best possible margin on S, then 1 RB = γ S. So, number of iterations: O(1/γ 2 S ). Kontorovich and Sabato (BGU) Lecture 4 32 / 34

33 Perceptron properties Processes one example at a time: low working-memory Number of updates depends on margin. If margin is very small, Perceptron might take an Ω(2 d ) time to converge. In practice, in many natural problems, margin is large and Perceptron is faster than LP. What if the training sample is not separable? LP will completely fail The Perceptron can still run, but will not terminate on its own No guarantee for the Perceptron in this case. Kontorovich and Sabato (BGU) Lecture 4 33 / 34

34 Linear predictors: Intermediate summary Linear predictors are very popular, because If the sample size is Θ(d) (e.g. 10 times the dimensions), the training error and the true error will probably be similar For many natural problems, there are linear predictors with low error. Computing the ERM for a linear predictor is NP-hard. But in the realizable/separable case, there are efficient algorithms: Using linear programming; The Batch Perceptron algorithm, if the margin is not too small. Next: What to do when the problem is not separable. Kontorovich and Sabato (BGU) Lecture 4 34 / 34

Lecture 3: Empirical Risk Minimization

Lecture 3: Empirical Risk Minimization Lecture 3: Empirical Risk Minimization Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 3 1 / 11 A more general approach We saw the learning algorithms Memorize and

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2 Where are we? The Perceptron

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Empirical Risk Minimization Algorithms

Empirical Risk Minimization Algorithms Empirical Risk Minimization Algorithms Tirgul 2 Part I November 2016 Reminder Domain set, X : the set of objects that we wish to label. Label set, Y : the set of possible labels. A prediction rule, h:

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

More about the Perceptron

More about the Perceptron More about the Perceptron CMSC 422 MARINE CARPUAT marine@cs.umd.edu Credit: figures by Piyush Rai and Hal Daume III Recap: Perceptron for binary classification Classifier = hyperplane that separates positive

More information

Lecture 15: Random Projections

Lecture 15: Random Projections Lecture 15: Random Projections Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 15 1 / 11 Review of PCA Unsupervised learning technique Performs dimensionality reduction

More information

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015 Perceptron Subhransu Maji CMPSCI 689: Machine Learning 3 February 2015 5 February 2015 So far in the class Decision trees Inductive bias: use a combination of small number of features Nearest neighbor

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Linear Classifiers. Michael Collins. January 18, 2012

Linear Classifiers. Michael Collins. January 18, 2012 Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

The Perceptron Algorithm 1

The Perceptron Algorithm 1 CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt. 1 The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt. The algorithm applies only to single layer models

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

CSC321 Lecture 4 The Perceptron Algorithm

CSC321 Lecture 4 The Perceptron Algorithm CSC321 Lecture 4 The Perceptron Algorithm Roger Grosse and Nitish Srivastava January 17, 2017 Roger Grosse and Nitish Srivastava CSC321 Lecture 4 The Perceptron Algorithm January 17, 2017 1 / 1 Recap:

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7. Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

GRADIENT DESCENT AND LOCAL MINIMA

GRADIENT DESCENT AND LOCAL MINIMA GRADIENT DESCENT AND LOCAL MINIMA 25 20 5 15 10 3 2 1 1 2 5 2 2 4 5 5 10 Suppose for both functions above, gradient descent is started at the point marked red. It will walk downhill as far as possible,

More information

Single layer NN. Neuron Model

Single layer NN. Neuron Model Single layer NN We consider the simple architecture consisting of just one neuron. Generalization to a single layer with more neurons as illustrated below is easy because: M M The output units are independent

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

Linear Classifiers and the Perceptron

Linear Classifiers and the Perceptron Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

Classification with Perceptrons. Reading:

Classification with Perceptrons. Reading: Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions Chapter ML:VI VI. Neural Networks Perceptron Learning Gradient Descent Multilayer Perceptron Radial asis Functions ML:VI-1 Neural Networks STEIN 2005-2018 The iological Model Simplified model of a neuron:

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang / CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors Furong Huang / furongh@cs.umd.edu What we know so far Decision Trees What is a decision tree, and how to induce it from

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Training the linear classifier

Training the linear classifier 215, Training the linear classifier A natural way to train the classifier is to minimize the number of classification errors on the training data, i.e. choosing w so that the training error is minimized.

More information

1 Learning Linear Separators

1 Learning Linear Separators 10-601 Machine Learning Maria-Florina Balcan Spring 2015 Plan: Perceptron algorithm for learning linear separators. 1 Learning Linear Separators Here we can think of examples as being from {0, 1} n or

More information

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning 0.1 Linear Functions So far, we have been looking at Linear Functions { as a class of functions which can 1 if W1 X separate some data and not

More information

Day 3: Classification, logistic regression

Day 3: Classification, logistic regression Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised

More information

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37 COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation

More information

EE 511 Online Learning Perceptron

EE 511 Online Learning Perceptron Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Kevin Jamison EE 511 Online Learning Perceptron Instructor: Hanna Hajishirzi hannaneh@washington.edu

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

MIRA, SVM, k-nn. Lirong Xia

MIRA, SVM, k-nn. Lirong Xia MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University Multi-class SVMs Lecture 17: Aykut Erdem April 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday April 23, 2016. Project progress reports are due April 21, 2016 2 days

More information

Linear classifiers Lecture 3

Linear classifiers Lecture 3 Linear classifiers Lecture 3 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin ML Methodology Data: labeled instances, e.g. emails marked spam/ham

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Machine Learning Linear Models

Machine Learning Linear Models Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 3: Learning Theory Sanjeev Arora Elad Hazan Admin Exercise 1 due next Tue, in class Enrolment Recap We have seen: AI by introspection

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1 Neural Networks Xiaoin Zhu erryzhu@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison slide 1 Terminator 2 (1991) JOHN: Can you learn? So you can be... you know. More human. Not

More information

Linear Classifiers and the Perceptron Algorithm

Linear Classifiers and the Perceptron Algorithm Linear Classifiers and the Perceptron Algorithm 36350, Data Mining 10 November 2008 Contents 1 Linear Classifiers 1 2 The Perceptron Algorithm 3 1 Linear Classifiers Notation: x is a vector of realvalued

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 13 Review 1 / 13 Alternate View of PCA: Minimizing

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

Binary Classification / Perceptron

Binary Classification / Perceptron Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1 90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression

More information

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

More information

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning and Data Mining. Linear classification. Kalev Kask Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 3 THE PERCEPTRON Learning Objectives: Describe the biological motivation behind the perceptron. Classify learning algorithms based on whether they are error-driven

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Linear models and the perceptron algorithm

Linear models and the perceptron algorithm 8/5/6 Preliminaries Linear models and the perceptron algorithm Chapters, 3 Definition: The Euclidean dot product beteen to vectors is the expression dx T x = i x i The dot product is also referred to as

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

1 Learning Linear Separators

1 Learning Linear Separators 8803 Machine Learning Theory Maria-Florina Balcan Lecture 3: August 30, 2011 Plan: Perceptron algorithm for learning linear separators. 1 Learning Linear Separators Here we can think of examples as being

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science

More information

The Transpose of a Vector

The Transpose of a Vector 8 CHAPTER Vectors The Transpose of a Vector We now consider the transpose of a vector in R n, which is a row vector. For a vector u 1 u. u n the transpose is denoted by u T = [ u 1 u u n ] EXAMPLE -5 Find

More information

Learning Linear Detectors

Learning Linear Detectors Learning Linear Detectors Instructor - Simon Lucey 16-423 - Designing Computer Vision Apps Today Detection versus Classification Bayes Classifiers Linear Classifiers Examples of Detection 3 Learning: Detection

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information