Machine Learning A Geometric Approach

Similar documents
Machine Learning A Geometric Approach

Structured Prediction

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

The Perceptron Algorithm

Perceptron (Theory) + Linear Regression

Introduction to Machine Learning

Neural networks and support vector machines

COMS 4771 Introduction to Machine Learning. Nakul Verma

CS 188: Artificial Intelligence. Outline

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Introduction To Artificial Neural Networks

Linear discriminant functions

L5 Support Vector Classification

Neural Networks and Deep Learning

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

More about the Perceptron

The Perceptron Algorithm, Margins

ECE 5424: Introduction to Machine Learning

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

The Perceptron. Volker Tresp Summer 2016

Kernelized Perceptron Support Vector Machines

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Brief Introduction to Machine Learning

6.036 midterm review. Wednesday, March 18, 15

Simple Neural Nets For Pattern Classification

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Logistic Regression Logistic

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Linear Classifiers. Michael Collins. January 18, 2012

EE 511 Online Learning Perceptron

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

The Perceptron algorithm

Support Vector Machine II

Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Deep Learning for Computer Vision

Linear models and the perceptron algorithm

HOMEWORK 4: SVMS AND KERNELS

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Classifiers and the Perceptron

VBM683 Machine Learning

Artifical Neural Networks

Support Vector Machine I

CSC242: Intro to AI. Lecture 21

Data Mining Part 5. Prediction

Machine Learning for NLP

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Perceptron & Neural Networks

Course 395: Machine Learning - Lectures

CS 188: Artificial Intelligence Fall 2011

Introduction to Logistic Regression and Support Vector Machine

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

Warm up: risk prediction with logistic regression

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Artificial Neural Networks

Classification with Perceptrons. Reading:

Neural networks. Chapter 20, Section 5 1

The Perceptron Algorithm

18.9 SUPPORT VECTOR MACHINES

Lecture 10: Linear Discriminant Functions Perceptron. Aykut Erdem March 2016 Hacettepe University

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Support Vector Machines

Dan Roth 461C, 3401 Walnut

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

The Perceptron Algorithm 1

Unit III. A Survey of Neural Network Model

CPSC 340: Machine Learning and Data Mining

CSC321 Lecture 5: Multilayer Perceptrons

Support Vector Machines. Machine Learning Fall 2017

Single layer NN. Neuron Model

TDT4173 Machine Learning

Machine Learning (CSE 446): Neural Networks

Numerical Learning Algorithms

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Artificial Neural Networks The Introduction

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

MIRA, SVM, k-nn. Lirong Xia

Machine Learning Lecture 10

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Neural Networks (Part 1) Goals for the lecture

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Transcription:

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU)

Perceptron Frank Rosenblatt

deep learning multilayer perceptron linear regression perceptron SVM CRF structured perceptron

Brief History of Perceptron 1959 osenblatt invention batch +max margin online 1962 Novikoff proof +kernels 1969* Minsky/Papert book killed it +soft-margin minibatch conservative updates DEAD inseparable case 1997 Cortes/Vapnik SVM online approx. max margin 1999 Freund/Schapire voted/avg: revived subgradient descent 2003 Crammer/Singer MIRA minibatch 2007--2010* Singer group Pegasos 2006 Singer group aggressive *mentioned in lectures but optional (others papers all covered in detail) AT&T Research 2002 Collins structured 2005* McDonald/Crammer/Pereira structured MIRA ex-at&t and students

Neurons Soma (CPU) Cell body - combines signals Dendrite (input bus) Combines the inputs from several other nerve cells Synapse (interface) Interface and parameter store between neurons Axon (output cable) May be up to 1m long and will transport the activation signal to neurons at different locations

Neurons x 1 x 2 x 3... x n w 1 w n synaptic weights output X f(x) = σ( w i x i = hw, xi) i

Frank Rosenblatt s Perceptron

Multilayer Perceptron (Neural Net)

Perceptron w/ bias Weighted linear x 1 x 2 x 3... x n combination Nonlinear decision function Linear offset (bias) synaptic weights w 1 output w n f(x) = (hw, xi + b) Linear separating hyperplanes Learning: w and b

Perceptron w/o bias Weighted linear x 0 = 1 x 1 x 2 x 3... x n combination w 0 w 1 w n Nonlinear decision function synaptic weights No Linear offset (bias): hyperplane through the origin output f(x) = (hw, xi + b) Linear separating hyperplanes Learning: w

Augmented Space 1 O can separate in 2D from the origin can t separate in 1D from the origin can separate in 3D from the origin can t separate in 2D from the origin

Perceptron Ham Spam

The Perceptron w/o bias initialize w = 0 and b =0 repeat if y i [hw, x i i + b] apple 0 then w w + y i x i and b b + y i end if until all classified correctly Nothing happens if classified correctly Weight vector is linear combination Classifier is linear combination of w = X i2i y i x i inner products f(x) = X i2i y i hx i,xi + b σ( )

The Perceptron w/ bias initialize w = 0 and b =0 repeat if y i [hw, x i i + b] apple 0 then w w + y i x i and b b + y i end if until all classified correctly Nothing happens if classified correctly Weight vector is linear combination Classifier is linear combination of w = X i2i y i x i inner products f(x) = X i2i y i hx i,xi + b σ( )

Demo x i w (bias=0)

Demo

Demo

Demo

Convergence Theorem If there exists some oracle unit vector u : kuk = 1 y i (u x i ) for all i then the perceptron converges to a linear separator after a number of updates bounded by R 2 / 2 where R = maxkx i k i Dimensionality independent Order independent (but order matters in output) Dataset size independent Scales with difficulty of problem

Geometry of the Proof part 1: progress (alignment) on oracle projection assume w i is the weight vector before the ith update (on hx i, y i i) and assume initial w 0 = 0 w i+1 = w i + y i x i u w i+1 = u w i + y i (u x i ) u w i+1 u w i + u w i+1 i projection on u increases! y i (u x i ) x i w i+1 for all i (more agreement w/ oracle) u : kuk = 1 kw i+1 k = kukkw i+1 k u w i+1 i w i

Geometry of the Proof part 2: bound the norm of the weight vector w i+1 = w i + y i x i kw i+1 k 2 = kw i + y i x i k 2 = kw i k 2 + kx i k 2 + 2y i (w i x i ) apple kw i k 2 + R 2 mistake on x_i apple ir 2 (radius) x i w i+1 Combine with part 1 kw i+1 k = kukkw i+1 k u w i+1 i u : kuk = 1 i apple R 2 / 2 w i

Convergence Bound R 2 / 2 w is independent of: dimensionality number of examples starting weight vector order of examples constant learning rate and is dependent of: but test accuracy is dependent of: order of examples (shuffling helps) variable learning rate (1/total#error helps) can you still prove convergence? separation difficulty feature scale

Hardness margin vs. size hard easy

XOR XOR - not linearly separable Nonlinear separation is trivial Caveat from Perceptrons (Minsky & Papert, 1969) Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).

Brief History of Perceptron 1959 osenblatt invention batch +max margin online 1962 Novikoff proof +kernels 1969* Minsky/Papert book killed it +soft-margin minibatch conservative updates DEAD inseparable case 1997 Cortes/Vapnik SVM online approx. max margin 1999 Freund/Schapire voted/avg: revived subgradient descent 2003 Crammer/Singer MIRA minibatch 2007--2010* Singer group Pegasos 2006 Singer group aggressive *mentioned in lectures but optional (others papers all covered in detail) AT&T Research 2002 Collins structured 2005* McDonald/Crammer/Pereira structured MIRA ex-at&t and students

Extensions of Perceptron Problems with Perceptron doesn t converge with inseparable data update might often be too bold doesn t optimize margin is sensitive to the order of examples Ways to alleviate these problems voted perceptron and average perceptron MIRA (margin-infused relaxation algorithm)

Voted/Avged Perceptron motivation: updates on later examples taking over! voted perceptron (Freund and Schapire, 1999) record the weight vector after each example in D (not just after each update) and vote on a new example using D models shown to have better generalization power averaged perceptron (from the same paper) an approximation of voted perceptron just use the average of all weight vectors can be implemented efficiently

Voted Perceptron

Voted/Avged Perceptron (low dim - less separable) test error

Voted/Avged Perceptron (high dim - more separable) test error

Averaged Perceptron voted perceptron is not scalable and does not output a single model avg perceptron is an approximation of voted perceptron initialize w = 0 and w 0 b =0 repeat c c +1 if y i [hw, x i i + b] apple 0 then w w + y i x i and b b + y i wend 0 ifw 0 + w until all classified correctly output w 0 /c after each example, not each update 32

Efficient Implementation of Averaging naive implementation (running sum) doesn t scale very clever trick from Daume (2006, PhD thesis) initialize w = 0 and w a b =0 repeat c c +1 if y i [hw, x i i + b] apple 0 then w w + y i x i and b b + y i end w a if w a + cy i x i until all classified correctly output w w a /c w t w (0) = w (1) = w (2) = w (3) = w (1) w (1) w (2) w (1) w (2) w (3) t w w (4) = w (1) w (2) w (3) w (4) 33

MIRA perceptron often makes too bold updates but hard to tune learning rate the smallest update to correct the mistake? w i+1 = w i + y i w i x i kx i k 2 x i easy to show: y i (w i+1 x i ) = y i (w i + y i margin-infused relaxation algorithm (MIRA) w i x i kx i k 2 x i ) x i = 1 kx i k w i x i 1 kx i k perceptron overcorrects this mistake w i x i perceptron MIRA 1 w i x i kx i k

Perceptron x i perceptron w perceptron undercorrects this mistake (bias=0)

MIRA min kw 0 wk 2 w 0 s.t. w 0 x 1 minimal change to ensure margin MIRA x i MIRA makes sure after update, dotproduct w x_i = 1 MIRA 1-step SVM perceptron margin of 1/ x_i w perceptron undercorrects this mistake (bias=0)

Aggressive MIRA aggressive version of MIRA also update if correct but margin isn t big enough functional margin: y i (w x i ) geometric margin: y i (w x i ) kwk update if functional margin is <=p (0<=p<1) update rule is same as MIRA called p-aggressive MIRA (MIRA: p=0) larger p leads to a larger geometric margin but slower convergence

Aggressive MIRA p=0.2 perceptron p=0.9

Demo perceptron vs. 0.2-aggressive vs. 0.9-aggressive p=0.2 perceptron p=0.9

Demo perceptron vs. 0.2-aggressive vs. 0.9-aggressive why does this dataset so slow to converge? perceptron: 22, p=0.2: 87, p=0.9: 2,518 epochs answer: margin shrinks in augmented space! 1 small margin in 2D O big margin in 1D

Demo perceptron vs. 0.2-aggressive vs. 0.9-aggressive why does this dataset so fast to converge? perceptron: 3, p=0.2: 1, p=0.9: 5 epochs answer: margin shrinks in augmented space! 1 ok margin in 2D O big margin in 1D

What if data is not separable in practice, data is almost always inseparable wait, what does that mean?? perceptron cycling theorem (1970) weights will remain bounded and not diverge use dev set for when to stop (prevents overfitting) higher-order features by combining atomic ones kernels => separable in higher dimensions

Solving XOR (x 1,x 2 ) (x 1,x 2,x 1 x 2 ) XOR not linearly separable Mapping into 3 dimensions makes it easily solvable

Useful Engineering Tips: averaging, shuffling, variable learning rate, fixing feature scale averaging helps significantly; MIRA helps a tiny little bit perceptron < MIRA < avg. perceptron avg. MIRA shuffling the data helps hugely if classes were ordered shuffling before each epoch helps a little bit variable learning rate often helps a little 1/(total#updates) or 1/(total#examples) helps any requirement in order to converge? how to prove convergence now? centering of each feature dim helps why? => R smaller, margin bigger unit variance also helps (why?) 0-mean, 1-var => each feature a unit Gaussian 1 O 1 O

Useful Engineering Tips: categorical=>binary, feature bucketing (binning/quantization) Age, Workclass, Education, Marital_Status, Occupation, Race, Sex, Hours, Country, Target 40, Private, Doctorate, Married-civ-spouse, Prof-specialty, White, Male, 60, United-States, >50K HW1 Adult income dataset: <=50K, or >50K? 2 numerical features age and hours-per-week option 1: treat them as numerical features but is older and more hours always better? option 2: treat them as binary features e.g., age=22, hours=38,... option 3: bin them (e.g., age=0-25, hours=41-60...) 7 categorical features: convert to binary features country, race, occupation, etc. e.g., country:united_states, education:doctorate,... optional: you can probably add a numerical feature edu_level perceptron: ~20% error, avg. perceptron: ~16% error

Brief History of Perceptron 1959 osenblatt invention batch +max margin online 1962 Novikoff proof +kernels 1969* Minsky/Papert book killed it +soft-margin minibatch conservative updates DEAD inseparable case 1997 Cortes/Vapnik SVM online approx. max margin 1999 Freund/Schapire voted/avg: revived subgradient descent 2003 Crammer/Singer MIRA minibatch 2007--2010* Singer group Pegasos 2006 Singer group aggressive *mentioned in lectures but optional (others papers all covered in detail) AT&T Research 2002 Collins structured 2005* McDonald/Crammer/Pereira structured MIRA ex-at&t and students