Linear classification models: Perceptron. CS534-Machine learning

Similar documents
Machine Learning. What is a good Decision Boundary? Support Vector Machines

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 6, 2015

Machine Learning. Support Vector Machines. Eric Xing. Lecture 4, August 12, Reading: Eric CMU,

Recap: the SVM problem

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Excess Error, Approximation Error, and Estimation Error

Ensemble Methods: Boosting

Online Classification: Perceptron and Winnow

Support Vector Machines CS434

Pattern Classification

Which Separator? Spring 1

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 8, 2015

Multilayer Perceptron (MLP)

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Discriminative classifier: Logistic Regression. CS534-Machine Learning

I. Decision trees II. Ensamble methods: Mixtures of experts

What is LP? LP is an optimization technique that allocates limited resources among competing activities in the best possible manner.

Lecture Notes on Linear Regression

Classification learning II

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Generative classification models

Lecture 10 Support Vector Machines. Oct

Generalized Linear Methods

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Introduction to the Introduction to Artificial Neural Network

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Evaluation of classifiers MLPs

Kernel Methods and SVMs Extension

10-701/ Machine Learning, Fall 2005 Homework 3

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

Support Vector Machines

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Xiangwen Li. March 8th and March 13th, 2001

Structured Perceptrons & Structural SVMs

XII.3 The EM (Expectation-Maximization) Algorithm

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Computational and Statistical Learning theory Assignment 4

Support Vector Machines

Support Vector Machines

COMP th April, 2007 Clement Pang

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

1 Definition of Rademacher Complexity

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

CSC 411 / CSC D11 / CSC C11

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Solutions to selected problems from homework 1.

Natural Language Processing and Information Retrieval

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Intro to Visual Recognition

Multigradient for Neural Networks for Equalizers 1

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Multi-layer neural networks

AE/ME 339. K. M. Isaac. 8/31/2004 topic4: Implicit method, Stability, ADI method. Computational Fluid Dynamics (AE/ME 339) MAEEM Dept.

Kristin P. Bennett. Rensselaer Polytechnic Institute

Solving Fuzzy Linear Programming Problem With Fuzzy Relational Equation Constraint

SDMML HT MSc Problem Sheet 4

Machine Learning. Measuring Distance. several slides from Bryan Pardo

Homework Assignment 3 Due in class, Thursday October 15

Evaluation for sets of classes

Support Vector Machines CS434

Gradient Descent Learning and Backpropagation

1 Review From Last Time

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Supporting Information

Lecture 3: Dual problems and Kernels

1 Convex Optimization

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

COS 511: Theoretical Machine Learning

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

18-660: Numerical Methods for Engineering Design and Optimization

Feature Selection: Part 1

CHAPTER 7 CONSTRAINED OPTIMIZATION 1: THE KARUSH-KUHN-TUCKER CONDITIONS

CS 548: Computer Vision Machine Learning - Part 1. Spring 2016 Dr. Michael J. Reale

MCM-based Uncertainty Evaluations practical aspects and critical issues

LECTURE :FACTOR ANALYSIS

Rectilinear motion. Lecture 2: Kinematics of Particles. External motion is known, find force. External forces are known, find motion

Bruce A. Draper & J. Ross Beveridge, January 25, Geometric Image Manipulation. Lecture #1 January 25, 2013

CHAPTER 6 CONSTRAINED OPTIMIZATION 1: K-T CONDITIONS

ML4NLP Introduction to Classification

Maximal Margin Classifier

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Logistic Regression Maximum Likelihood Estimation

System in Weibull Distribution

Learning Theory: Lecture Notes

STAT 511 FINAL EXAM NAME Spring 2001

Least Squares Fitting of Data

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

CSE 546 Midterm Exam, Fall 2014(with Solution)

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Least Squares Fitting of Data

Linear Classification, SVMs and Nearest Neighbors

Multipoint Analysis for Sibling Pairs. Biostatistics 666 Lecture 18

1 The Mistake Bound Model

CS 3750 Machine Learning Lecture 6. Monte Carlo methods. CS 3750 Advanced Machine Learning. Markov chain Monte Carlo

Semi-Supervised Learning

Multilayer neural networks

Boostrapaggregating (Bagging)

Transcription:

Lnear classfcaton odels: Perceptron CS534-Machne learnng

Classfcaton proble Gven npt, the goal s to predct, hch s a categorcal varable s called the class label s the featre vector Eaple: : onthl ncoe and ban savng aont; : rs or not rs : reve tet for a prodct : sentent postve, negatve or netral

Lnear Classfer We ll be begn th the splest chose: lnear classfers - - - - - - - 1

Wh lnear odel?

Bnar classfcaton: General Setp Gven a set of tranng eaples 1, 1,, nn, nn, here each RR dd, {1,1} Learn a lnear fncton gg, = 0 1 1 dd dd Gven an eaple = 1,, dd TT : predct = 1 f gg, 0 predct = 0 otherse Copactl the classfer can be represented as: = sgn 0 1 1 dd dd = sgn TT here = 0, 1,, dd TT, and = 1, 1,, dd TT Goal: fnd a good that nzes soe loss fncton JJ

0/1 Loss J 1 n = T 0 /1 Lsgn, n = 1 here LL, = 0 hen =, otherse LL, = 1 3 staes staes 1 staes 0 staes Isse: does not prodce sefl gradent snce the srface of JJ 0/1 s pece-se flat

0/1 loss Perceptron crteron Perceptron Loss J n 1 T = a0, p n = 1 If predcton s correct, TT > 0, a 0, TT = 0 If ncorrect, TT 0, a 0, TT = TT > 0 A lnear fncton of npt featres JJ pp s pecese lnear Has a nce gradent leadng to the solton regon

Stochastc Gradent Descent The objectve fncton conssts of a s over data ponts--- Stochastc Gradent Descent pdates the paraeter after observng each eaple otherse 0 f 0 a0, a0, 1 1 > = = = = T n T J J n J Update Rle After observng,, f t s a stae

Onlne Perceptron Stochastc gradent descent Let 0,0,0,...,0 Repeat ntl convergence for ever tranng eaple = 1,..., n : T f 0

When an error s ade, oves the eght n a drecton that corrects the error Decson bondar 1 - - Decson bondar 1 - - 3 Decson bondar 3 Red ponts belong to the postve class, ble ponts belong to the negatve class

Convergence Theore Bloc, 196, Novoff, 196 Gven tranng eaple seqence 1, 1,,,... N, N. If, D, and then the nber of, = 1and T γ > 0 for all, staes that the perceptron algorth aes s at ost D / γ. Note that s the Ecldean nor of a vector.

Proof = 1 e have be the th stae, Let γ D = e can set s an arbtrar scalng factor, Becase 1 D γ γ = = = = D D D becase, 0 becase, D becase, ] [ 1 Let be a solton vector, e no then s also a solton

Proof cont. 1 1 D D D = = / 0 γ γ D D D D = B ndcton on

Margn γγ s referred to as the argn Mn dstance fro data ponts to the decson bondar Bgger argn -> easer the classfcaton proble Bgger argn -> ore confdence n or predcton Ths concept ll be tlzed n later ethods: spport vector achnes

Batch Perceptron Algorth Gven : tranng eaples Let 0,0,0,...,0 repeat{ delta 0,0,0,...,0 for = 1to n{,, = 1,..., n T f 0 : delta delta } delta delta / n λ delta }ntl delta < ε

Onlne VS. Batch Perceptron Batch learnng learns fro a batch of eaples collectvel Onlne learnng learns fro one eaple at a te Both learnng echanss are sefl n practce Onlne Perceptron s senstve to the order tranng eaples are receved In batch tranng, the correctons are acclated and appled at once In onlne tranng, each correcton s appled edatel once a stae s encontered, hch ll change the decson bondar, ths dfferent staes abe encontered for onlne and batch tranng Onlne tranng perfors stochastc gradent descent, an approaton to the real gradent descent sed b the batch tranng

Not lnearl separable case In sch cases the algorth ll never converge! Ho to f? Loo for decson bondar that ae as fe staes as possble NP-hard!

Fng the Perceptron Idea one: onl go throgh the data once, or a fed nber of tes Let 0,0,0,...,0 Repeat for T tes for each tranng eaple : T f 0 At least ths stops Proble: the fnal ght not be good e.g. the last pdate cold be on a total otler

Voted Perceptron Keep nteredate hpotheses and have the vote [Frend and Schapre 1998] Let 0,0,0,...,0 c 0 = 0, n = 0 Repeat for T tes for each tranng eaple : f else 0 n 1 n = n 1 c n c n T = 0 = c n n 1 The otpt ll be a collecton of lnear separators 0 1,, MM along th ther srvval te cc 0, cc 1,, cc MM The cc s can be veed as easres of the relablt of the s For classfcaton, tae a eghted vote aong all separators: ŷ = sgn{ N c n n= 0 sgn T n }

Average Perceptron Voted perceptron reqres storng all nterttent eghts Large eor conspton Slo predcton te Average perceptron ŷ = sgn{ N c n n= 0 Tae the eghted average of all the nterttent eghts Can be pleented b antanng an rnnng average, no need to store all eghts Fast predcton te T n }

Fnal Dscsson Perceptron learns ŷ = f drectl a dscrnatve ethod Gradent descent to optze the perceptron loss Onlne verson perfors stochastc gradent descent Garanteed to converge n fnte steps f lnearl separable The pper bond on the nber of correctons needed s nversel proportonal to the argn of the optal decson bondar If not lnearl separable, voted or average perceptrons can be sed Hper-paraeter: the nber of epochs T Ver large T cold stll lead to overfttng

Beond the Basc Perceptron

Strctred Predcton th Perceptrons S S VP VP PP PP NP? VP NP N V P D N V N P D N Te fles le an arro Te fles le an arro S S VP V S NP NP V NP N N V D N V V V D N Te fles le an arro Te fles le an arro Based on Jason Esner's notes

A general proble Gven soe npt An eal, a sentence Consder a set of canddate otpts Classfcatons for sall nber: often jst Taggngs of eponentall an Parses of eponentall an Translatons of eponentall an Want to fnd the best, gven Based on Jason Esner's notes Strctred predcton

Scorng b Lnear Models Gven soe npt Consder a set of canddate otpts Defne a scorng fncton score, Lnear fncton: A s of featre eghts o pc the featres! Weght of featre learned or set b hand Ranges over all featres, e.g., =5 nbered featres or = see Det Non naed featres Choose that azes score, Based on Jason Esner's notes Whether, has featre 0 or 1 Or ho an tes t fres 0 Or ho strongl t fres real #

Scorng b Lnear Models Gven soe npt Consder a set of canddate otpts Defne a scorng fncton score, Lnear fncton: A s of featre eghts o pc the featres! learned or set b hand Ths lnear decson rle s soetes called a perceptron. It s a strctred perceptron f t does strctred predcton nber of canddates s nbonded, e.g., gros th. Choose that azes score, Based on Jason Esner's notes

Perceptron Tranng Algorth ntalze θ sall to the zero vector repeat: Pc a tranng eaple, Model predcts * that azes score,* Update eghts b a step of sze ε > 0: θ = θ ε f, f,* If odel predcton as rong *, then e st have score, score,* nstead of > as e ant. Eqvalentl, θ f, θ f,* Eqvalentl, θ f, - f,* 0 bt e ant t postve. Or pdate ncreases t b ε f, f,* 0 Based on Jason Esner's notes 7

Perceptron for Strctred Predcton What e see here s the sae as the reglar perceptron Slar convergence garantee The challenge s the nference part Fndng the that azes the score for gven Cannot resort to brte-force eneraton Mch research goes nto Ho to devse proper featres and effcent algorths for nference Ho to perfor approate nference Ho to learn hen nference s approate