Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Similar documents
Machine Learning for NLP

Support vector machines Lecture 4

Kernelized Perceptron Support Vector Machines

MIRA, SVM, k-nn. Lirong Xia

Machine Learning for NLP

Online Passive- Aggressive Algorithms

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

More about the Perceptron

Warm up: risk prediction with logistic regression

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Training the linear classifier

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Support Vector Machines

CS 188: Artificial Intelligence. Outline

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

9 Classification. 9.1 Linear Classifiers

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear & nonlinear classifiers

Introduction to Logistic Regression and Support Vector Machine

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

The Perceptron Algorithm 1

Binary Classification / Perceptron

EE 511 Online Learning Perceptron

Support Vector Machine. Industrial AI Lab.

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning A Geometric Approach

Support Vector Machines. Machine Learning Fall 2017

Multiclass Classification-1

Linear Regression (continued)

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Neural networks and support vector machines

1 Machine Learning Concepts (16 points)

The Perceptron Algorithm

Support Vector Machines

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Least Mean Squares Regression

Linear classifiers: Overfitting and regularization

Perceptron (Theory) + Linear Regression

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Introduction to Support Vector Machines

Linear & nonlinear classifiers

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Statistical Machine Learning from Data

Logistic Regression Logistic

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Machine Learning

Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine

Classification with Perceptrons. Reading:

6.036 midterm review. Wednesday, March 18, 15

Optimization and Gradient Descent

Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent

Minimax risk bounds for linear threshold functions

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear discriminant functions

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Support Vector Machine (SVM) and Kernel Methods

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Neural networks and optimization

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CSC242: Intro to AI. Lecture 21

Support Vector Machines

Machine Learning Lecture 5

Lecture 4: Training a Classifier

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

SVM optimization and Kernel methods

Machine Learning Basics

The Perceptron Algorithm, Margins

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Deep Learning for Computer Vision

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 4: Training a Classifier

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Linear Models for Classification

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Empirical Risk Minimization Algorithms

Lecture 4: Linear predictors and the Perceptron

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Machine Learning Lecture 6 Note

ECS171: Machine Learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

CS260: Machine Learning Algorithms

The Perceptron. Volker Tresp Summer 2014

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Linear Classifiers and the Perceptron

1 Review of Winnow Algorithm

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Lecture 3: Multiclass Classification

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Single layer NN. Neuron Model

Transcription:

Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2

Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by Lyle Ungar) Note: supplemental material for today is supplemental; not required!

Why do online learning? Batch learning can be expensive for big datasets How expensive is it to compute (X T X) -1 for X? Tricky to parallelize A) n 3 B) p 3 C) np 2 D) n 2 p 4

Why do online learning? Batch learning can be expensive for big datasets How hard is it to compute (X T X) -1? np 2 to form X T X p 3 to invert Tricky to parallelize inversion Online methods are easy in a map-reduce environment They are often clever versions of stochastic gradient descent Have you seen map-reduce/hadoop? A) Yes B) No 5

Online linear regression Minimize Err = Σ i (y i w T x i ) 2 Using stochastic gradient descent Where we look at each observation (x i,y i ) sequentially and decrease its error Err i = (y i w T x i ) 2 LMS (Least Mean Squares) algorithm w i+1 = w i η/2 derr i /dw i derr i /dw i = - 2 (y i w it x i ) x i = - 2 r i x i w i+1 = w i + η r i x i Note that I is the index for both the iteration and the observation, since there is one update per observation How do you pick the learning rate η? 6

Online linear regression LMS (Least Mean Squares) algorithm w i+1 = w i + η r i x i Converges for 0 < η < λ max Where λ max is the largest eigenvalue of the covariance matrix X T X Convergence rate is inversely proportional to λ max /λ min (ratio of extreme eigenvalues of X T X) 7

Online learning methods Least mean squares (LMS) Online regression -- L 2 error Perceptron Online SVM -- Hinge loss 8

Perceptron Learning Algorithm If you were wrong, make w look more like x What do we do if error is zero? Of course, this only converges for linearly separable data 9

Perceptron Learning Algorithm For each observation (y i, x i ) w i+1 = w i + η r i x i Where r i = y i sign(w it x i ) and η = ½ I.e., if we get it right: no change if we got it wrong: w i+1 = w i + y i x i 10

Perceptron Update If the prediction at x 1 is wrong, what is the true label y 1? How do you update w? 11

Perceptron Update Example II w = w + (-1) x 12

Properties of the Simple Perceptron You can prove that If it s possible to separate the data with a hyperplane (i.e. if it s linearly separable), then the algorithm will converge to that hyperplane. And it will converge such that the number of mistakes M it makes is bounded by M < R 2 /γ where R = max i x i 2 γ > y i w *T x i size of biggest x > 0 if separable 13

Properties of the Simple Perceptron But what if it isn t separable? Then perceptron is unstable and bounces around 14

Voted Perceptron Works just like a regular perceptron, except you keep track of all the intermediate models you created When you want to classify something, you let each of the many models vote on the answer and take the majority Often implemented after a burn-in period 15

Properties of Voted Perceptron Simple! Much better generalization performance than regular perceptron Almost as good as SVMs Can use the kernel trick Training is as fast as regular perceptron But run-time is slower Since we need n models 16

Averaged Perceptron Return as your final model the average of all your intermediate models Approximation to voted perceptron Again extremely simple! And can use kernels Nearly as fast to train and exactly as fast to run as regular perceptron 17

Many possible Perceptrons If point x i is misclassified w i+1 = w i + η y i x i Different ways of picking learning rate η Standard perceptron: η = 1 Guaranteed to converge to the correct answer in a finite time if the points are separable (but oscillates otherwise) Pick η to maximize the margin (w it x i ) in some fashion Can get bounds on error even for non-separable case 18

Can we do a better job of picking η? Perceptron: For each observation (y i, x i ) w i+1 = w i + η r i x i where r i = y i sign(w it x i ) and η = ½ Let s use the fact that we are actually trying to minimize a loss function 19

Passive Aggressive Perceptron Minimize the hinge loss at each observation L(w i ; x i,y i ) = 0 if y i w it x i >= 1 (loss 0 if correct with margin > 1) 1 y i w it x i else Pick w i+1 to be as close as possible to w i while still setting the hinge loss to zero If point x i is correctly classified with a margin of at least 1 no change Otherwise w i+1 = w i + η y i x i where η = L(w i ; x i,y i )/ x i 2 Can prove bounds on the total hinge loss 20

Passive-Aggressive = MIRA 21

Margin-Infused Relaxed Algorithm (MIRA) Multiclass; each class has a prototype vector Note that the prototype w is like a feature vector x Classify an instance by choosing the class whose prototype vector is most similar to the instance Has the greatest dot product with the instance During training, when updating make the smallest change to the prototype vectors which guarantees correct classification by a minimum margin passive aggressive 22

What you should know LMS Online regression Perceptrons Online SVM Large margin / hinge loss Has nice mistake bounds (for separable case) See wiki In practice use averaged perceptrons Passive Aggressive perceptrons and MIRA Change w just enough to set it s hinge loss to zero. What we didn t cover: feature selection 23