Lecture 4: Linear predictors and the Perceptron
|
|
- Mercy Day
- 5 years ago
- Views:
Transcription
1 Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 4 1 / 34
2 Inductive Bias Inductive bias is critical to prevent overfitting. Inductive bias = a relatively simple hypothesis class H. What if we don t know which H is suitable for our learning problem? Choose a good representation (relevant features) Use a general purpose hypothesis class. One of the most popular choices: Linear predictors. Kontorovich and Sabato (BGU) Lecture 4 2 / 34
3 Linear predictors Recall that in many learning problems X = R d. Each example is a vector with d coordinates (features). In binary classification problems, Y consists of two labels. In linear prediction, H is the class of all linear separators. Illustration with d = 2: Kontorovich and Sabato (BGU) Lecture 4 3 / 34
4 Why restrict to linear predictors? If the algorithm could choose any separation, we could get overfitting: Labels of unseen examples would not be predicted correctly. E.g. if using a squiggly line. Kontorovich and Sabato (BGU) Lecture 4 4 / 34
5 Preventing overfitting Linear predictors prevent overfitting: In dimension d, if training sample size is Θ(d), then training error true prediction error. Dimension d number of features. So, if we use linear predictors with enough training samples, does this guarantee low prediction error? Recall the No-Free-Lunch theorem... Kontorovich and Sabato (BGU) Lecture 4 5 / 34
6 Preventing overfitting is not enough No overfitting: error on training sample true prediction error on distribution. Can still have high error on training sample Kontorovich and Sabato (BGU) Lecture 4 6 / 34
7 Preventing overfitting is not enough Recall the decomposition of the prediction error err(ĥs, D): Approximation error: err app := inf err(h, D) h H Estimation error err est := err(ĥs, D) inf err(h, D). h H No overfitting estimation error is low. But we also need low approximation error: if all linear predictors have high error, no sample size will work. In practice, this is often the case with linear predictors. If good features are chosen to represent the examples! Kontorovich and Sabato (BGU) Lecture 4 7 / 34
8 Formalizing linear predictors To formalize linear predictors we will use inner products. Definition: For vectors x, z R d, x, z := d i=1 x(i)z(i). The length of a vector x R d can be defined by its inner product: x = d x(i) 2 = x, x. i=1 The angle between two vectors is defined by their inner product: cos(θ) = (Large value = small angle.) x, z x z. Inner products are commutative: x, z = z, x. Inner products are linear: If a, b R and x, x, z R d, a x, z = a x, z, x + x, z = x, z + x, z. Kontorovich and Sabato (BGU) Lecture 4 8 / 34
9 Formalizing linear predictors Call the labels Y = { 1, +1}. In 2 dimensions: the line is a x(1) + b = x(2) The linear prediction rule is: { +1 a x(1) + b x(2) y = 1 a x(1) + b < x(2). Kontorovich and Sabato (BGU) Lecture 4 9 / 34
10 Formalizing linear predictors In two dimensions: y = { +1 a x(1) + b x(2) 1 a x(1) + b < x(2). Define a vector w = (w(1), w(2)) = (a, 1). Can rewrite this as: y = sign(w(1) x(1) + w(2) x(2) + b) = sign( w, x + b). For a vector w R d and a number b R, define the linear predictor h w,b : x R d, b is called the bias of the predictor. h w,b (x) := sign( w, x + b). Hypothesis class of all linear predictors in dimension d: H d L := {h w,b w R d, b R}. Kontorovich and Sabato (BGU) Lecture 4 10 / 34
11 Formalizing linear predictors h w,b (x) := sign( w, x + b), H d L := {h w,b w R d, b R}. In 3 dimensions, the linear boundary w, x + b = 0 is a plane. In higher dimensions, it is a hyperplane. The vector w is the normal to the hyperplane b / w is the distance from the origin to the hyperplane. w b / w Kontorovich and Sabato (BGU) Lecture 4 11 / 34
12 The bias b is not needed Suppose we have a classification problem with X = R d For every example x R d, define x R d+1 : x = (x(1),..., x(d)) = x := (x(1),..., x(d), 1) For every linear predictor with a bias h w,b on R d, define a linear predictor h w without a bias on R d+1 : w := (w(1),..., w(d), b). We get h w,b (x) = h w (x ), for all x, w, b: h w (x ) = sign( x, w ) = sign( x, w + b) = h w,b (x). Conclusion: by adding a coordinate which is always 1, we can discard the bias term. Kontorovich and Sabato (BGU) Lecture 4 12 / 34
13 Removing the bias term From 1 dimension with a bias: To two dimensions without a bias: Linear predictors without a bias are called homogeneous. Linear predictors are also called halfspaces. Kontorovich and Sabato (BGU) Lecture 4 13 / 34
14 Implementing the ERM for linear predictors Implementing ERM: Find a linear predictor h w with a minimal empirical error: err(h w, S) 1 m {i sign( x i, w ) y i }. This problem is NP-hard There are workarounds (later in the course). Today: efficient algorithm if problem is realizable. Definition D is realizable by H if there exists some h H such that err(h, D) = 0. Kontorovich and Sabato (BGU) Lecture 4 14 / 34
15 ERM in the realizable case Definition D is realizable by H if there exists some h H such that err(h, D) = 0. For any x i in the training sample, y i = h (x i ). So, min h H err(h, S) err(h, S) = 0. ERM in the realizable case: find some h H such that err(h, S) = 0. For linear predictors: find h w,b that separates the positive and negative labels in the training sample. Can be done efficiently We will see two efficient methods. For linear predictors: realizable = separable. Kontorovich and Sabato (BGU) Lecture 4 15 / 34
16 ERM for separable linear predictors: Linear Programming A linear program (LP) is a problem of the following form: maximize w R d u, w subject to Aw v. w R d : a vector we wish to find. u R d, v R m, A R m d. The values of u, v, A define the specific linear program. LPs can be solved efficiently Many solvers are available. In Matlab: w = linprog(-u,-a,-v). Kontorovich and Sabato (BGU) Lecture 4 16 / 34
17 ERM for separable linear predictors: Linear Programming Linear Program maximize w R d u, w subject to Aw v ERM for the separable case: Find a linear predictor with zero error on the training sample {(x i, y i )} i m. Recall y i { 1, +1}. Our goal: This is equivalent to: Find w R d s.t. i m, sign( x i, w ) = y i. Find w R d s.t. i m, y i x i, w > 0. Problem: in the linear program we have a weak inequality, not strict >. If we use here, w = 0 satisfies the constraints Kontorovich and Sabato (BGU) Lecture 4 17 / 34
18 ERM for separable linear predictors: Linear Programming Our goal: Linear Program maximize w R d u, w subject to Aw v Find w R d s.t. i m, y i x i, w > 0. Need to change our the strict inequality to a weak one. If the problem is separable, there exists a solution. Name one of the solutions w. Denote γ := min i y i x i, w. Note γ > 0. Define w = w /γ. For all i m, y i x i, w = y i x i, w /γ 1. Conclusion: There is a predictor w R d such that i m, y i x i, w 1. Also, any predictor that satisfies this is a good solution. Kontorovich and Sabato (BGU) Lecture 4 18 / 34
19 ERM for separable linear predictors: Linear Programming Linear Program maximize w R d u, w subject to Our goal can be re-written as: Aw v Find w R d s.t. i m, y i x i, w 1. Turn this into the form of a linear program: u = (0,..., 0) (nothing is maximized), v = (1,..., 1) Row i of the matrix A is yi x i (y i x i (1),..., y i x i (d)). The linear program: maximize w R d 0 y 1 x 1 (1),..., y 1 x 1 (d) 1 subject to... w.... y m x m (1),..., y m x m (d) 1 Kontorovich and Sabato (BGU) Lecture 4 19 / 34
20 ERM for separable linear predictors: Linear Programming The LP approach is very easy to implement But it can be slow. And it fails completely if there is even one bad label. Kontorovich and Sabato (BGU) Lecture 4 20 / 34
21 The Perceptron The Perceptron algorithm was invented in 1958 by Rosenblatt. This version is called the Batch Perceptron. Goal remains as before: Find a linear predictor with zero error on the training sample {(x i, y i )} i m. Perceptron Idea: Work in rounds Start with a default predictor In each round, look at a single training example If current predictor is wrong on this example, move predictor in the right direction. Stop when the predictor assigns all examples with the correct label. Kontorovich and Sabato (BGU) Lecture 4 21 / 34
22 The Perceptron Batch Perceptron input A training sample S = {(x 1, y 1 ),..., (x m, y m )} output w R d such that i m, h w (x i ) = y i. 1: w (1) (0,..., 0) 2: while i s.t. y i w (t), x i 0 do 3: w (t+1) w (t) + y i x i 4: t t + 1 5: end while 6: Return w (t). Why does the update rule make sense? y i w (t+1), x i = y i w (t) + y i x i, x i = y i w (t), x i + y 2 i x i, x i = y i w (t), x i + x i 2. Each update moves y i w (t), x i closer to being positive. Kontorovich and Sabato (BGU) Lecture 4 22 / 34
23 The Perceptron Illustration in two dimensional space (d = 2) - on the board. The separator tilts in the right direction in each update The same example can be repeated several times. Does this always work? How many updates does it take to get an error-free separator? Kontorovich and Sabato (BGU) Lecture 4 23 / 34
24 The separation margin Intuitively, separation is easier if positive and negative points are far apart. far apart there is a separator which is far from all points. w distance from point Claim We will show that the Perceptron is indeed faster in this case. First, let s make this formal. w,x w is the distance between x R d and the separator defined by w. Kontorovich and Sabato (BGU) Lecture 4 24 / 34
25 The separation margin Claim: distance between x and the hyperplane defined by w is w,x w. Define w := w/ w. Then w = w, w = 1. The hyperplane is H = {v v, w = 0} = {v v, w = 0}. The distance between the hyperplane and x is := min v H x v. Take v = x w, x w. Then v H, because v, w = x, w ( w, x w), w = x, w w, x w, w = 0. The distance is at most x v. x v = w, x w = w, x w = w, x = Also, for any u H: w, x w. x u 2 = x v + v u 2 = (x v) + (v u), (x v) + (v u) = x v x v, v u + v u 2 And x v, v u = w, x w, v u = 0. w, x 2 w x v, v u. Kontorovich and Sabato (BGU) Lecture 4 25 / 34
26 The separation margin Denote R := max i x i. We will normalize by this value. The minimal normalized distance of any x i in S from w is called the margin of w. γ(w) := 1 R min w, x i i m w. w Kontorovich and Sabato (BGU) Lecture 4 26 / 34
27 The separation margin Which separators have a large margin? For any α > 0, w R d, α w defines the same separator, with the same margin. We can look at separators w such that min i m y i w, x i = 1. This does not lose generality! Then (R := max i x i ) γ(w) = 1 R w Small norm w and small R = large margin γ(w). Kontorovich and Sabato (BGU) Lecture 4 27 / 34
28 The Perceptron: Guarantee Theorem (Theorem 9.1 in course book) Assume that S = ((x 1, y 1 ),..., (x m, y m )) is separable. Then 1. When the Perceptron stops and returns w (t), i m, y i w (t), x i > Define R := max i [m] x i. B := min{ w i m, y i w, x i 1}. The Perceptron performs at most (RB) 2 updates. Part (1) is trivial: Perceptron never stops unless this holds. γ(w) = 1 R w Let γ S be the largest margin achievable on S. Then γ S = 1 RB. Number of updates 1/γ 2 S. Kontorovich and Sabato (BGU) Lecture 4 28 / 34
29 The Perceptron: Proving the theorem We will show that if the Perceptron runs for at least T iterations, then T (RB) 2. So total number of iterations is at most (RB) 2. Let w such that y i w, x i 1 and w = B. We will keep track of two quantities: w (t) and w, w (t). We will show that the norm grows slowly, while the inner product grows fast. More precisely: w (t+1) tr w, w (t+1) t. Recall that larger w,w (t+1) w w (t+1) = smaller angle between w, w (t). Reminder: Cauchy-Schwarz inequality. For all u, v R d, So we will conclude: u, v u v. T w, w (T +1) w w (T +1) B T R = T (RB) 2. Kontorovich and Sabato (BGU) Lecture 4 29 / 34
30 The Perceptron: Proving the theorem Upper bounding the norm w (T +1) : In iteration t, let i be the example that was used to update w (t). Recall the Perceptron update: w (t+1) w (t) + y i x i We have y i w (t), x i 0. So w (1) 2 = 0. w (t+1) 2 = w (t) + y i x i 2 By induction, w (T +1) 2 TR 2 So w (T +1) T R. = w (t) 2 + 2y i w (t), x i + y 2 i x i 2 w (t) 2 + R 2. Kontorovich and Sabato (BGU) Lecture 4 30 / 34
31 The Perceptron: Proving the theorem Lower bounding the inner product w, w (T +1) w (1) = (0,..., 0) = w, w (1) = 0. Recall the Perceptron update: w (t+1) w (t) + y i x i In each iteration w, w (t) is increased by at least one: w, w (t+1) w, w (t) = w, w (t+1) w (t) = w, y i x i = y i w, x i (from the definition of w ) 1 After T iterations: T w, w (T +1) = ( w, w (t+1) w, w (t) ) T. t=1 This means that our w gets closer to w at each iteration Kontorovich and Sabato (BGU) Lecture 4 31 / 34
32 The Perceptron: Proving the theorem We showed: w (T +1) T R w, w (T +1) T. Using Cauchy-Schwarz: T w, w (T +1) w w (T +1) B T R Conclusion: T (RB) 2. We showed The Perceptron runs for at most (RB) 2 iterations. When it stops, the separator it returns separates the examples in S. γ S := the best possible margin on S, then 1 RB = γ S. So, number of iterations: O(1/γ 2 S ). Kontorovich and Sabato (BGU) Lecture 4 32 / 34
33 Perceptron properties Processes one example at a time: low working-memory Number of updates depends on margin. If margin is very small, Perceptron might take an Ω(2 d ) time to converge. In practice, in many natural problems, margin is large and Perceptron is faster than LP. What if the training sample is not separable? LP will completely fail The Perceptron can still run, but will not terminate on its own No guarantee for the Perceptron in this case. Kontorovich and Sabato (BGU) Lecture 4 33 / 34
34 Linear predictors: Intermediate summary Linear predictors are very popular, because If the sample size is Θ(d) (e.g. 10 times the dimensions), the training error and the true error will probably be similar For many natural problems, there are linear predictors with low error. Computing the ERM for a linear predictor is NP-hard. But in the realizable/separable case, there are efficient algorithms: Using linear programming; The Batch Perceptron algorithm, if the margin is not too small. Next: What to do when the problem is not separable. Kontorovich and Sabato (BGU) Lecture 4 34 / 34
Lecture 3: Empirical Risk Minimization
Lecture 3: Empirical Risk Minimization Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 3 1 / 11 A more general approach We saw the learning algorithms Memorize and
More informationThe Perceptron algorithm
The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationThe Perceptron Algorithm
The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2 Where are we? The Perceptron
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationEmpirical Risk Minimization Algorithms
Empirical Risk Minimization Algorithms Tirgul 2 Part I November 2016 Reminder Domain set, X : the set of objects that we wish to label. Label set, Y : the set of possible labels. A prediction rule, h:
More informationOnline Learning, Mistake Bounds, Perceptron Algorithm
Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which
More informationMore about the Perceptron
More about the Perceptron CMSC 422 MARINE CARPUAT marine@cs.umd.edu Credit: figures by Piyush Rai and Hal Daume III Recap: Perceptron for binary classification Classifier = hyperplane that separates positive
More informationLecture 15: Random Projections
Lecture 15: Random Projections Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 15 1 / 11 Review of PCA Unsupervised learning technique Performs dimensionality reduction
More informationPerceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015
Perceptron Subhransu Maji CMPSCI 689: Machine Learning 3 February 2015 5 February 2015 So far in the class Decision trees Inductive bias: use a combination of small number of features Nearest neighbor
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationLinear Classifiers. Michael Collins. January 18, 2012
Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationThe Perceptron Algorithm 1
CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationThe perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.
1 The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt. The algorithm applies only to single layer models
More informationPattern Recognition and Machine Learning. Perceptrons and Support Vector machines
Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationCSC321 Lecture 4 The Perceptron Algorithm
CSC321 Lecture 4 The Perceptron Algorithm Roger Grosse and Nitish Srivastava January 17, 2017 Roger Grosse and Nitish Srivastava CSC321 Lecture 4 The Perceptron Algorithm January 17, 2017 1 / 1 Recap:
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
More informationLinear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.
Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also
More informationFrom Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018
From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationLogistic regression and linear classifiers COMS 4771
Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationGRADIENT DESCENT AND LOCAL MINIMA
GRADIENT DESCENT AND LOCAL MINIMA 25 20 5 15 10 3 2 1 1 2 5 2 2 4 5 5 10 Suppose for both functions above, gradient descent is started at the point marked red. It will walk downhill as far as possible,
More informationSingle layer NN. Neuron Model
Single layer NN We consider the simple architecture consisting of just one neuron. Generalization to a single layer with more neurons as illustrated below is easy because: M M The output units are independent
More informationBrief Introduction to Machine Learning
Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector
More informationCS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims
CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target
More informationLinear Classifiers and the Perceptron
Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More information10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers
Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationActive Learning: Disagreement Coefficient
Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which
More informationClassification with Perceptrons. Reading:
Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationChapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions
Chapter ML:VI VI. Neural Networks Perceptron Learning Gradient Descent Multilayer Perceptron Radial asis Functions ML:VI-1 Neural Networks STEIN 2005-2018 The iological Model Simplified model of a neuron:
More informationCS6375: Machine Learning Gautam Kunapuli. Support Vector Machines
Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationCMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /
CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors Furong Huang / furongh@cs.umd.edu What we know so far Decision Trees What is a decision tree, and how to induce it from
More informationA short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie
A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab
More informationTraining the linear classifier
215, Training the linear classifier A natural way to train the classifier is to minimize the number of classification errors on the training data, i.e. choosing w so that the training error is minimized.
More information1 Learning Linear Separators
10-601 Machine Learning Maria-Florina Balcan Spring 2015 Plan: Perceptron algorithm for learning linear separators. 1 Learning Linear Separators Here we can think of examples as being from {0, 1} n or
More informationCS 446: Machine Learning Lecture 4, Part 2: On-Line Learning
CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning 0.1 Linear Functions So far, we have been looking at Linear Functions { as a class of functions which can 1 if W1 X separate some data and not
More informationDay 3: Classification, logistic regression
Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised
More informationCOMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37
COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation
More informationEE 511 Online Learning Perceptron
Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Kevin Jamison EE 511 Online Learning Perceptron Instructor: Hanna Hajishirzi hannaneh@washington.edu
More informationMinimax risk bounds for linear threshold functions
CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More information18.6 Regression and Classification with Linear Models
18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next
More informationSerious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions
BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design
More informationMIRA, SVM, k-nn. Lirong Xia
MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationMulti-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University
Multi-class SVMs Lecture 17: Aykut Erdem April 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday April 23, 2016. Project progress reports are due April 21, 2016 2 days
More informationLinear classifiers Lecture 3
Linear classifiers Lecture 3 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin ML Methodology Data: labeled instances, e.g. emails marked spam/ham
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationMachine Learning Linear Models
Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 3: Learning Theory Sanjeev Arora Elad Hazan Admin Exercise 1 due next Tue, in class Enrolment Recap We have seen: AI by introspection
More informationSupport Vector Machines. Machine Learning Fall 2017
Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationNeural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1
Neural Networks Xiaoin Zhu erryzhu@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison slide 1 Terminator 2 (1991) JOHN: Can you learn? So you can be... you know. More human. Not
More informationLinear Classifiers and the Perceptron Algorithm
Linear Classifiers and the Perceptron Algorithm 36350, Data Mining 10 November 2008 Contents 1 Linear Classifiers 1 2 The Perceptron Algorithm 3 1 Linear Classifiers Notation: x is a vector of realvalued
More informationLecture 6. Regression
Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron
More informationSupport Vector Machines
EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable
More informationMachine Learning (CSE 446): Learning as Minimizing Loss; Least Squares
Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 13 Review 1 / 13 Alternate View of PCA: Minimizing
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationMachine Learning. Lecture 6: Support Vector Machine. Feng Li.
Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)
More informationBinary Classification / Perceptron
Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationPreliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1
90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression
More informationLecture 4: Perceptrons and Multilayer Perceptrons
Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons
More informationMachine Learning and Data Mining. Linear classification. Kalev Kask
Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More informationA Course in Machine Learning
A Course in Machine Learning Hal Daumé III 3 THE PERCEPTRON Learning Objectives: Describe the biological motivation behind the perceptron. Classify learning algorithms based on whether they are error-driven
More informationLinear Classifiers as Pattern Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationLinear models and the perceptron algorithm
8/5/6 Preliminaries Linear models and the perceptron algorithm Chapters, 3 Definition: The Euclidean dot product beteen to vectors is the expression dx T x = i x i The dot product is also referred to as
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More information1 Learning Linear Separators
8803 Machine Learning Theory Maria-Florina Balcan Lecture 3: August 30, 2011 Plan: Perceptron algorithm for learning linear separators. 1 Learning Linear Separators Here we can think of examples as being
More informationDeep Learning for Computer Vision
Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science
More informationThe Transpose of a Vector
8 CHAPTER Vectors The Transpose of a Vector We now consider the transpose of a vector in R n, which is a row vector. For a vector u 1 u. u n the transpose is denoted by u T = [ u 1 u u n ] EXAMPLE -5 Find
More informationLearning Linear Detectors
Learning Linear Detectors Instructor - Simon Lucey 16-423 - Designing Computer Vision Apps Today Detection versus Classification Bayes Classifiers Linear Classifiers Examples of Detection 3 Learning: Detection
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More information