Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Similar documents
Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

ECS171: Machine Learning

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

The Perceptron Algorithm 1

Lecture 6. Notes on Linear Algebra. Perceptron

CS260: Machine Learning Algorithms

Deep Learning for Computer Vision

Perceptron (Theory) + Linear Regression

Classification with Perceptrons. Reading:

Least Mean Squares Regression

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Linear discriminant functions

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Logic and machine learning review. CS 540 Yingyu Liang

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

The Perceptron. Volker Tresp Summer 2014

Stochastic Gradient Descent

Neural Networks and the Back-propagation Algorithm

Answers Machine Learning Exercises 4

CPSC 340: Machine Learning and Data Mining

1 Learning Linear Separators

1 Machine Learning Concepts (16 points)

Supervised Learning. George Konidaris

Warm up: risk prediction with logistic regression

1 Learning Linear Separators

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear Models for Classification

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

9 Classification. 9.1 Linear Classifiers

HOMEWORK 4: SVMS AND KERNELS

CSC321 Lecture 4: Learning a Classifier

Logistic Regression. COMP 527 Danushka Bollegala

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Course 395: Machine Learning - Lectures

CSC321 Lecture 4: Learning a Classifier

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Optimization and Gradient Descent

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning (CS 567) Lecture 3

CSCI567 Machine Learning (Fall 2018)

Machine Learning

Midterm: CS 6375 Spring 2015 Solutions

Machine Learning

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Max Margin-Classifier

Least Mean Squares Regression. Machine Learning Fall 2018

Based on the original slides of Hung-yi Lee

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Week 5: Logistic Regression & Neural Networks

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

CS229 Supplemental Lecture notes

Binary Classification / Perceptron

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

18.6 Regression and Classification with Linear Models

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Probabilistic Graphical Models for Image Analysis - Lecture 1

The Perceptron. Volker Tresp Summer 2016

Generative v. Discriminative classifiers Intuition

Machine Learning (CS 567) Lecture 5

Lecture 4: Perceptrons and Multilayer Perceptrons

CSC321 Lecture 9: Generalization

Introduction to Machine Learning

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

ECE521 Lecture7. Logistic Regression

Neural Networks: Backpropagation

Introduction to Neural Networks

Bias-Variance Tradeoff

Linear classifiers: Overfitting and regularization

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

CSC321 Lecture 4 The Perceptron Algorithm

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Neural Networks and Deep Learning

CS446: Machine Learning Spring Problem Set 4

Online Learning and Sequential Decision Making

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

More about the Perceptron

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Multilayer Perceptron

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Linear Classifiers. Michael Collins. January 18, 2012

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

SGD and Deep Learning

The Perceptron algorithm

Lecture 2 Machine Learning Review

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Machine Learning (CSE 446): Neural Networks

Transcription:

Machine Learning Basics Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang

Perceptron

Overview Previous lectures: (Principle for loss function) MLE to derive loss Example: linear regression; some linear classification models This lecture: (Principle for optimization) local improvement Example: Perceptron; SGD

Task (w ) T x = 0 (w ) T x > 0 Class +1 w (w ) T x < 0 Class -1

Attempt Given training data x i, y i : 1 i n i.i.d. from distribution D Hypothesis f w x y = +1 if w T x > 0 y = 1 if w T x < 0 = w T x Prediction: y = sign(f w x ) = sign(w T x) Goal: minimize classification error

Perceptron Algorithm Assume for simplicity: all x i has length 1 Perceptron: figure from the lecture note of Nina Balcan

Intuition: correct the current mistake If mistake on a positive example w T t+1 x = w t + x T x = w T t x + x T x = w T t x + 1 If mistake on a negative example w T t+1 x = w t x T x = w T t x x T x = w T t x 1

The Perceptron Theorem Suppose there exists w that correctly classifies x i, y i W.L.O.G., all x i and w have length 1, so the minimum distance of any example to the decision boundary is γ = min i w T x i Then Perceptron makes at most 1 γ 2 mistakes

The Perceptron Theorem Suppose there exists w that correctly classifies x i, y i W.L.O.G., all x i and w have length 1, so the minimum distance of any example to the decision boundary is γ = min i w T x i Need not be i.i.d.! Then Perceptron makes at most 1 γ 2 mistakes Do not depend on n, the length of the data sequence!

Analysis First look at the quantity w t T w Claim 1: w T t+1 w w T t w + γ Proof: If mistake on a positive example x w T t+1 w = w t + x T w = w T t w + x T w w T t w + γ If mistake on a negative example w T t+1 w = w t x T w = w T t w x T w w T t w + γ

Analysis Next look at the quantity w t Negative since we made a mistake on x Claim 2: w t+1 2 wt 2 + 1 Proof: If mistake on a positive example x w t+1 2 = wt + x 2 = w t 2 + x 2 + 2wt T x

Analysis: putting things together Claim 1: w T t+1 w w T t w + γ 2 2 Claim 2: w t+1 wt + 1 After M mistakes: T w M+1 w γm w M+1 M T w M+1 w w M+1 So γm M, and thus M 1 γ 2

Intuition Claim 1: w T t+1 w w T t w + γ Claim 2: w t+1 2 wt 2 + 1 The correlation gets larger. Could be: 1. w t+1 gets closer to w 2. w t+1 gets much longer Rules out the bad case 2. w t+1 gets much longer

Some side notes on Perceptron

History Figure from Pattern Recognition and Machine Learning, Bishop

Note: connectionism vs symbolism Symbolism: AI can be achieved by representing concepts as symbols Example: rule-based expert system, formal grammar Connectionism: explain intellectual abilities using connections between neurons (i.e., artificial neural networks) Example: perceptron, larger scale neural networks

Symbolism example: Credit Risk Analysis Example from Machine learning lecture notes by Tom Mitchell

Connectionism example Neuron/perceptron Figure from Pattern Recognition and machine learning, Bishop

Note: connectionism v.s. symbolism Formal theories of logical reasoning, grammar, and other higher mental faculties compel us to think of the mind as a machine for rulebased manipulation of highly structured arrays of symbols. What we know of the brain compels us to think of human information processing in terms of manipulation of a large unstructured set of numbers, the activity levels of interconnected neurons. ---- The Central Paradox of Cognition (Smolensky et al., 1992)

Note: online vs batch Batch: Given training data x i, y i : 1 i n, typically i.i.d. Online: data points arrive one by one 1. The algorithm receives an unlabeled example x i 2. The algorithm predicts a classification of this example. 3. The algorithm is then told the correct answer y i, and update its model

Stochastic gradient descent (SGD)

Gradient descent Minimize loss L θ, where the hypothesis is parametrized by θ Gradient descent Initialize θ 0 θ t+1 = θ t η t L θ t

Stochastic gradient descent (SGD) Suppose data points arrive one by one L θ = 1 σ n n t=1 l(θ, x t, y t ), but we only know l(θ, x t, y t ) at time t Idea: simply do what you can based on local information Initialize θ 0 θ t+1 = θ t η t l(θ t, x t, y t )

Example 1: linear regression Find f w x = w T x that minimizes L f w = 1 σ n n t=1 w T x t y 2 t l w, x t, y t = 1 n wt x t y t 2 w t+1 = w t η t l w t, x t, y t = w t 2η t n w t T x t y t x t

Example 2: logistic regression Find w that minimizes L w = 1 n yt=1 logσ(w T x t ) 1 n log[1 σ w T x t ] y t = 1 L w = 1 n t logσ(y t w T x t ) l w, x t, y t = 1 n logσ(y tw T x t )

Example 2: logistic regression Find w that minimizes l w, x t, y t = 1 n logσ(y tw T x t ) w t+1 = w t η t l w t, x t, y t = w t + η t n σ a 1 σ a σ(a) y t x t Where a = y t w t T x t

Example 3: Perceptron Hypothesis: y = sign(w T x) Define hinge loss l w, x t, y t = y t w T x t I[mistake on x t ] L w = t y t w T x t I[mistake on x t ] w t+1 = w t η t l w t, x t, y t = w t + η t y t x t I[mistake on x t ]

Example 3: Perceptron Hypothesis: y = sign(w T x) w t+1 = w t η t l w t, x t, y t = w t + η t y t x t I[mistake on x t ] Set η t = 1. If mistake on a positive example w t+1 = w t + y t x t = w t + x If mistake on a negative example w t+1 = w t + y t x t = w t x

Pros & Cons Pros: Widely applicable Easy to implement in most cases Guarantees for many losses Good performance: error/running time/memory etc. Cons: No guarantees for non-convex opt (e.g., those in deep learning) Hyper-parameters: initialization, learning rate

Mini-batch Instead of one data point, work with a small batch of b points (x tb+1, y tb+1 ),, (x tb+b, y tb+b ) Update rule θ t+1 = θ t η t 1 b 1 i b l θ t, x tb+i, y tb+i Other variants: variance reduction etc.

Homework

Homework 1 Assignment online Course website: http://www.cs.princeton.edu/courses/archive/spring16/cos495/ Piazza: https://piazza.com/princeton/spring2016/cos495 Due date: Feb 17 th (one week) Submission Math part: hand-written/print; submit to TA (Office: EE, C319B) Coding part: in Matlab/Python; submit the.m/.py file on Piazza

Homework 1 Grading policy: every late day reduces the attainable credit for the exercise by 10%. Collaboration: Discussion on the problem sets is allowed Students are expected to finish the homework by himself/herself The people you discussed with on assignments should be clearly detailed: before the solution to each question, list all people that you discussed with on that particular question.