EE 511 Online Learning Perceptron

Similar documents
CS 188: Artificial Intelligence. Outline

MIRA, SVM, k-nn. Lirong Xia

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

CS 188: Artificial Intelligence Fall 2011

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

CS 5522: Artificial Intelligence II

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

CS 343: Artificial Intelligence

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

CS 188: Artificial Intelligence. Machine Learning

Statistical NLP Spring A Discriminative Approach

Support vector machines Lecture 4

Warm up: risk prediction with logistic regression

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression Logistic

Linear discriminant functions

Logistic Regression. Machine Learning Fall 2018

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Perceptron (Theory) + Linear Regression

Generative v. Discriminative classifiers Intuition

Machine Learning for NLP

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Introduction to Logistic Regression and Support Vector Machine

Stochastic Gradient Descent

Machine Learning for NLP

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

CSE446: Naïve Bayes Winter 2016

Linear Classifiers. Michael Collins. January 18, 2012

The Perceptron Algorithm

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear Classifiers and the Perceptron

Logistic Regression & Neural Networks

CPSC 340: Machine Learning and Data Mining

Generative v. Discriminative classifiers Intuition

Machine Learning Practice Page 2 of 2 10/28/13

ECE 5984: Introduction to Machine Learning

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Linear classifiers Lecture 3

Bias-Variance Tradeoff

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

ECE521 week 3: 23/26 January 2017

Brief Introduction to Machine Learning

Generative v. Discriminative classifiers Intuition

ESS2222. Lecture 4 Linear model

More about the Perceptron

Machine Learning Lecture 7

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

6.036 midterm review. Wednesday, March 18, 15

Deep Learning for Computer Vision

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Linear & nonlinear classifiers

Ad Placement Strategies

Support Vector Machines

Announcements - Homework

Naïve Bayes classification

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Kernelized Perceptron Support Vector Machines

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

ECE 5424: Introduction to Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Loss Functions, Decision Theory, and Linear Models

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Support Vector Machine. Industrial AI Lab.

Machine Learning Lecture 5

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Introduction to Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

6.867 Machine learning

Ad Placement Strategies

Logistic regression and linear classifiers COMS 4771

VBM683 Machine Learning

CSE446: Clustering and EM Spring 2017

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

ECE 5984: Introduction to Machine Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Transcription:

Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Kevin Jamison EE 511 Online Learning Perceptron Instructor: Hanna Hajishirzi hannaneh@washington.edu

Who needs probabilities? Previously: model data with distributions Joint: P(X,Y) e.g. Naïve Bayes -- later Conditional: P(Y X) e.g. Logistic Regression But wait, why probabilities? Lets try to be errordriven! mpg cylinders displacemen horsepower weight acceleration modelyear make good 4 97 75 2265 18.2 77 asia bad 6 199 90 2648 15 70 ameri bad 4 121 110 2600 12.8 77 europ bad 8 350 175 4100 13 73 ameri bad 6 198 95 3102 16.5 74 ameri bad 4 108 94 2379 16.5 73 asia bad 4 113 95 2228 14 71 asia bad 8 302 139 3570 12.8 78 ameri : : : : : : : : : : : : : : : : : : : : : : : : good 4 120 79 2625 18.6 82 ameri bad 8 455 225 4425 10 70 ameri good 4 107 86 2464 15.5 76 europ bad 5 131 103 2830 15.9 78 europ

Generative vs. Discriminative Generative classifiers: A joint probability model with evidence variables Query model for causes given evidence Discriminative classifiers: No generative model, no Bayes rule, maybe no probabilities at all! Try to predict the label Y directly from X Robust, accurate with varied features Loosely: mistake driven rather than model driven

Discriminative vs. generative Generative model (The artist) 0.1 0.05 0 0 10 20 30 40 50 60 70 x = data Discriminative model (The lousy painter) 1 0.5 0 0 10 20 30 40 50 60 70 x = data Classification function 1-1 0 10 20 30 40 50 60 70 80 x = data

Linear Classifiers Inputs are feature values Each feature has a weight Sum is the activation activation w (x) = X i w i x i = w x If the activation is: Positive, output class 1 Negative, output class 2 x Σ 2 w >0? 3 x 1 x 3 w 1 w 2

Example: Spam Imagine 3 features (spam is positive class): free (number of occurrences of free ) money (occurrences of money ) BIAS (intercept, always has value 1) free money BIAS : 1 free : 1 money : 1... BIAS : -3 free : 4 money : 2... wx > 0 è SPAM!!!

Binary Decision Rule In the space of feature vectors Examples are points Any weight vector is a hyperplane One side corresponds to y=+1 Other corresponds to y=-1 money 2 +1 = SPAM 1 BIAS : -3 free : 4 money : 2... -1 = HAM 0 0 1 w x =0 free

Binary Perceptron Algorithm Start with zero weights For each training instance: Classify with current weights If correct (i.e., y=y*), no change! If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is -1.

Binary Perceptron Algorithm Start with zero weights: w=0 For t=1..t (T passes over data) For i=1..n: (each training example) Classify with current weights y = sign(w x i ) sign(x) is +1 if x>0, else -1 If correct (i.e., y=y i ), no change! If wrong: update w = w + y i x i w +( 1)x i x i

Examples: Perceptron Separable Case http://isl.ira.uka.de/neuralnetcourse/2004/vl_11_5/perceptron.html

Examples: Perceptron Inseparable Case http://isl.ira.uka.de/neuralnetcourse/2004/vl_11_5/perceptron.html

For t=1..t, i=1..n: y = sign(w x i ) if y y i x 1 x 2 y 3 2 1-2 2-1 -2-3 1 w = w + y i x i x 2 =x 1 x 2 x 1 Initial: w = [0,0] t=1,i=1 [0,0][3,2] = 0, sign(0)=-1 w = [0,0] + [3,2] = [3,2] t=1,i=2 [3,2][-2,2]=-2, sign(-2)=-1 t=1,i=3 [3,2][-2,-3]=-12, sign(-12)=-1 w = [3,2] + [-2,-3] = [1,-1] t=2,i=1 [1,-1][3,2]=1, sign(1)=1 t=2,i=2 [1,-1][-2,2]=-4, sign(-4)=-1 t=2,i=3 [1,-1][-2,-3]=1, sign(1)=1 Converged!!! y=w 1 x 1 +w 2 x 2 à y=x 1 +-x 2 So, at y=0 à x 2 =x 1

Multiclass Decision Rule If we have more than two classes: Have a weight vector for each class: Calculate an activation for each class activation w (x, y) =w y x Highest activation wins Example: y is {1,2,3} We are fitting three planes: w 1, w 2, w 3 Predict i when w i x is highest w 3 x w 1 x y = arg max (activation w(x, y)) y w 2 x

Example win the vote x BIAS : 1 win : 1 game : 0 vote : 1 the : 1... BIAS : -2 win : 4 game : 4 vote : 0 the : 0... BIAS : 1 win : 2 game : 0 vote : 4 the : 0... BIAS : 2 win : 0 game : 2 vote : 0 the : 0... x w SPORTS = 2 x w POLITICS = 7 x w TECH = 2 POLITICS wins!!!

The Multi-class Perceptron Alg. Start with zero weights For t=1..t, i=1..n (T times over data) Classify with current weights y = arg max y If correct (y=y i ), no change! w y x i If wrong: subtract features from weights for predicted class w y and add them to weights for correct class w y = w y = w y + x i i w y i x i x i w y w yi x i x i w y i + x i w y i

Perceptron vs. logistic regression Update rule Logistic regression: Need all of training data Perceptron: if mistake, w = w + y i x i Update only with one example. Ideal for online algorithms.

Linearly Separable (binary case) The data is linearly separable with margin γ, if: 9w.8t.y t (w x t ) >0 For y t =1 w x t For y t =-1 w x t apple

Mistake Bound for Perceptron Assume data is separable with margin γ: kxk 2 = s X i x 2 i 9w s.t. kw k 2 = 1 and 8t.y t (w x t ) Also assume there is a number R such that: 8t.kx t k 2 apple R Theorem: The number of mistakes (parameter updates) made by the perceptron is bounded: mistakes apple R2 2 Constant with respect to # of examples!

Perceptron Convergence (by Induction) Let w k be the weights after the k-th update (mistake), we will show that: Therefore: k 2 2 applekw k k 2 2 apple kr 2 k apple R2 2 Because R and γ are fixed constants that do not change as you learn, there are a finite number of updates! Proof does each bound separately (next two slides)

Lower bound Perceptron update: w = w + y t x t Remember our margin assumption: 9w s.t. kw k 2 = 1 and 8t.y t (w x t ) Now, by the definition of the perceptron update, for k-th mistake on t-th training example: w k+1 w =(w k + y t x t ) w = w k w + y t (w x t ) w k w + So, by induction with w 0 =0, for all k: k apple w k w applekw k k 2 k 2 2 applekw k k 2 2 Because: w k w applekw k k 2 kw k 2 and kw k 2 =1

Upper Bound Perceptron update: w = w + y t x t Data Assumption: 8t.kx t k 2 apple R By the definition of the Perceptron update, for k-th mistake on t-th training example: kw k+1 k 2 2 = kw k + y t x t k 2 2 = kw k k 2 2 +(y t ) 2 kx t k 2 2 +2y t x t w k applekw k k 2 2 + R 2 apple R 2 because (y t ) 2 =1andkx t k 2 apple R So, by induction with w 0 =0 have, for all k: kw k k 2 2 apple kr 2 < 0 because Perceptron made error (y t has different sign than x t w t )

Perceptron Convergence (by Induction) Let w k be the weights after the k-th update (mistake), we will show that: Therefore: k 2 2 applekw k k 2 2 apple kr 2 k apple R2 2 Because R and γ are fixed constants that do not change as you learn, there are a finite number of updates! If there is a linear separator, Perceptron will find it!!!

From Logistic Regression Perceptron update when y is {-1,1}: to the Perceptron: w = w + y j x j 2 easy steps! Logistic Regression: (in vector notation): y is {0,1} w = w + X j Perceptron: when y is {0,1}: w = w +[y j [y j P (y j x j,w)]x j sign 0 (w x j )]x j sign 0 (x) = +1 if x>0 and 0 otherwise Differences? Drop the Σ j over training examples: online vs. batch learning Drop the dist n: probabilistic vs. error driven learning

Properties of Perceptrons Separability: some parameters get the training set perfectly correct Convergence: if the training is separable, perceptron will eventually converge (binary case) Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability Separable Non-Separable mistakes apple R2 2

Problems with the Perceptron Noise: if the data isn t separable, weights might thrash Averaging weight vectors over time can help (averaged perceptron) Mediocre generalization: finds a barely separating solution Overtraining: test / held-out accuracy usually rises, then falls Overtraining is a kind of overfitting

Linear Separators Which of these linear separators is optimal?

Support Vector Machines Maximizing the margin: good according to intuition, theory, practice Support vector machines (SVMs) find the separator with max margin SVM 8i, y w y x i w y x i +1

Two Views of Logistic Regression: Classification (more to come later in Parameters from gradient ascent course!) Parameters: linear, probabilistic model, and discriminative Training Data Held-Out Data Training: gradient ascent (usually batch), regularize to stop overfitting The perceptron: Parameters from reactions to mistakes Parameters: discriminative interpretation Training: go through the data until heldout accuracy maxes out Test Data