Perceptron (Theory) + Linear Regression

Similar documents
Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

MLE/MAP + Naïve Bayes

The Perceptron Algorithm, Margins

Perceptron & Neural Networks

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

CPSC 340: Machine Learning and Data Mining

The Perceptron Algorithm

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Logistic Regression Logistic

Linear Classifiers and the Perceptron

MLE/MAP + Naïve Bayes

Linear Regression (continued)

Warm up: risk prediction with logistic regression

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

The Perceptron Algorithm 1

Stochastic Gradient Descent

The Perceptron algorithm

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

EE 511 Online Learning Perceptron

Answers Machine Learning Exercises 4

Linear & nonlinear classifiers

ECS171: Machine Learning

Jelinek Summer School Machine Learning

More about the Perceptron

Machine Learning for NLP

Optimization and Gradient Descent

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Hidden Markov Models

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Hidden Markov Models

Linear Classifiers: Expressiveness

Machine Learning A Geometric Approach

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Discriminative Models

ECS171: Machine Learning

CSC321 Lecture 4 The Perceptron Algorithm

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

CS 188: Artificial Intelligence. Outline

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning for NLP

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

COMS 4771 Introduction to Machine Learning. Nakul Verma

ECE521 week 3: 23/26 January 2017

Linear Discrimination Functions

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

MIRA, SVM, k-nn. Lirong Xia

Support vector machines Lecture 4

Support Vector Machines: Maximum Margin Classifiers

Machine Learning Lecture 5

Deriving Principal Component Analysis (PCA)

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Least Mean Squares Regression

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Overfitting, Bias / Variance Analysis

Logistic Regression. COMP 527 Danushka Bollegala

ECE521 Lecture7. Logistic Regression

Discriminative Models

Introduction to Logistic Regression and Support Vector Machine

Machine Learning Practice Page 2 of 2 10/28/13

1 Machine Learning Concepts (16 points)

Kernelized Perceptron Support Vector Machines

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Stochastic gradient descent; Classification

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CS145: INTRODUCTION TO DATA MINING

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Linear & nonlinear classifiers

Machine Learning (CS 567) Lecture 5

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Midterm: CS 6375 Spring 2015 Solutions

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Logistic Regression. Machine Learning Fall 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Support Vector Machines

Nonparametric Bayesian Methods (Gaussian Processes)

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Generative v. Discriminative classifiers Intuition

Deep Learning (CNNs)

6.036 midterm review. Wednesday, March 18, 15

Lecture 16: Modern Classification (I) - Separating Hyperplanes

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Support Vector Machines

Bias-Variance Tradeoff

Machine Learning and Data Mining. Linear classification. Kalev Kask

1 Learning Linear Separators

Ch 4. Linear Models for Classification

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

6.867 Machine learning

Lecture 8. Instructor: Haipeng Luo

Transcription:

10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1

Q&A Q: I can t read the chalkboard, can you write larger? A: Sure. Just raise your hand and let me know if you can t read something. Q: I m concerned that you won t be able to read my solution in the homework template because it s so tiny, can I use my own template? A: No. However, we do all of our grading online and can zoom in to view your solution! Make it as small as you need to. 2

Reminders Homework 2: Decision Trees Out: Wed, Jan 24 Due: Mon, Feb 5 at 11:59pm Homework 3: KNN, Perceptron, Lin.Reg. Out: Mon, Feb 5 Due: Mon, Feb 12 at 11:59pm possibly delayed by two days 3

ANALYSIS OF PERCEPTRON 4

Geometric Margin Definition: The margin of example! w.r.t. a linear sep. " is the distance from! to the plane "! = 0 (or the negative if on wrong side) Margin of positive example! '! ' w Margin of negative example! (! ( Slide from Nina Balcan

Geometric Margin Definition: The margin of example! w.r.t. a linear sep. " is the distance from! to the plane "! = 0 (or the negative if on wrong side) Definition: The margin ) * of a set of examples wrt a linear separator " is the smallest margin over points!. Slide from Nina Balcan ) * ) * w

Geometric Margin Definition: The margin of example! w.r.t. a linear sep. " is the distance from! to the plane "! = 0 (or the negative if on wrong side) Definition: The margin ) * of a set of examples wrt a linear separator " is the smallest margin over points!. Definition: The margin ) of a set of examples is the maximum ) * over all linear separators ". Slide from Nina Balcan ) ) w

Linear Separability Def: For a binary classification problem, a set of examples is linearly separable if there exists a linear decision boundary that can separate the points Case 1: Case 2: Case 3: Case 4: 8

Analysis: Perceptron Perceptron Mistake Bound Guarantee: If data has margin and all points inside a ball of radius R, then Perceptron makes (R/ ) 2 mistakes. (Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn t change the number of mistakes; algois invariant to scaling.) g g R Slide adapted from Nina Balcan 9

Analysis: Perceptron (Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn t change the number of mistakes; algois invariant to scaling.) Slide adapted from Nina Balcan Perceptron Mistake Bound Guarantee: If data has margin and all points inside a ball of radius R, then Perceptron makes (R/ ) 2 mistakes. Def: We say that the (batch) g g perceptron algorithm has converged if it stops making mistakes on the training data (perfectly classifies the training data). Main Takeaway: For linearly separable R data, if the perceptron algorithm cycles repeatedly through the data, it will converge in a finite # of steps. 10

Analysis: Perceptron Perceptron Mistake Bound Theorem 0.1 (Block (1962), Novikoff (1962)). Given dataset: D = {( (i),y (i) )} N i=1. Suppose: 1. Finite size inputs: x (i) R 2. Linearly separable data: s.t. =1and y (i) ( (i) ), i Then: The number of mistakes made by the Perceptron algorithm on this dataset is Figure from Nina Balcan k (R/ ) 2 g g R 11

Analysis: Perceptron Proof of Perceptron Mistake Bound: We will show that there exist constants A and B s.t. Ak (k1) B k Ak (k1) B k Ak 12

Analysis: Perceptron Theorem 0.1 (Block (1962), Novikoff (1962)). Given dataset: D = {( (i),y (i) )} N i=1. Suppose: 1. Finite size inputs: x (i) R 2. Linearly separable data: s.t. =1and y (i) ( (i) ), i Then: The number of mistakes made by the Perceptron algorithm on this dataset is g g R k (R/ ) 2 Algorithm 1 Perceptron Learning Algorithm (Online) 1: procedure P (D = {( (1),y (1) ), ( (2),y (2) ),...}) 2: 0, k =1 Initialize parameters 3: for i {1, 2,...} do For each example 4: if y (i) ( (k) (i) ) 0 then If mistake 5: (k1) (k) y (i) (i) Update parameters 6: k k 1 7: return 13

Analysis: Perceptron Proof of Perceptron Mistake Bound: Part 1: for some A, Ak (k1) (k1) =( (k) y (i) (i) ) by Perceptron algorithm update = (k) y (i) ( (i) ) (k) by assumption (k1) k by induction on k since (1) = 0 (k1) k since and =1 CauchySchwartz inequality 15

Analysis: Perceptron Proof of Perceptron Mistake Bound: Part 2: for some B, (k1) B k (k1) 2 = (k) y (i) (i) 2 by Perceptron algorithm update = (k) 2 (y (i) ) 2 (i) 2 2y (i) ( (k) (i) ) (k) 2 (y (i) ) 2 (i) 2 since kth mistake y (i) ( (k) (i) ) 0 = (k) 2 R 2 since (y (i) ) 2 (i) 2 = (i) 2 = R 2 by assumption and (y (i) ) 2 =1 (k1) 2 kr 2 by induction on k since ( (1) ) 2 =0 (k1) kr 16

Analysis: Perceptron Proof of Perceptron Mistake Bound: Part 3: Combining the bounds finishes the proof. k k (R/ ) 2 (k1) kr The total number of mistakes must be less than this 17

Analysis: Perceptron What if the data is not linearly separable? 1. Perceptron will not converge in this case (it can t!) 2. However, Freund & Schapire (1999) show that by projecting the points (hypothetically) into a higher dimensional space, we can achieve a similar bound on the number of mistakes made on one pass through the sequence of examples Theorem 2. Let (x 1, y 1 ),..., (x m, y m ) be a sequence of labeled examples with x i R. Let u be any vector with u = 1 and let γ > 0. Define the deviation of each example as d i = max{0, γ y i (u x i )}, m and define D = i=1 d2 i. Then the number of mistakes of the online perceptron algorithm on this sequence is bounded by ( ) R D 2. γ = 18

Summary: Perceptron Perceptron is a linear classifier Simple learning algorithm: when a mistake is made, add / subtract the features Perceptron will converge if the data are linearly separable, it will not converge if the data are linearly inseparable For linearly separable and inseparable data, we can bound the number of mistakes (geometric argument) Extensions support nonlinear separators and structured prediction 19

Perceptron Learning Objectives You should be able to Explain the difference between online learning and batch learning Implement the perceptron algorithm for binary classification [CIML] Determine whether the perceptron algorithm will converge based on properties of the dataset, and the limitations of the convergence guarantees Describe the inductive bias of perceptron and the limitations of linear models Draw the decision boundary of a linear model Identify whether a dataset is linearly separable or not Defend the use of a bias term in perceptron 20

LINEAR REGRESSION 24

Linear Regression Outline Regression Problems Definition Linear functions Residuals Notation trick: fold in the intercept Linear Regression as Function Approximation Objective function: Mean squared error Hypothesis space: Linear Functions Optimization for Linear Regression Normal Equations (Closedform solution) Computational complexity Stability SGD for Linear Regression Partial derivatives Update rule Gradient Descent for Linear Regression Probabilistic Interpretation of Linear Regression Generative vs. Discriminative Conditional Likelihood Background: Gaussian Distribution Case #1: 1D Linear Regression Case #2: Multiple Linear Regression 25

Whiteboard Regression Problems Definition Linear functions Residuals Notation trick: fold in the intercept 26

Linear Regression as Function Whiteboard Approximation Objective function: Mean squared error Hypothesis space: Linear Functions 27

OPTIMIZATION FOR ML 28

Optimization for ML Not quite the same setting as other fields Function we are optimizing might not be the true goal (e.g. likelihood vs generalization error) Precision might not matter (e.g. data is noisy, so optimal up to 1e16 might not help) Stopping early can help generalization error (i.e. early stopping is a technique for regularization discussed more next time) 29

Topographical Maps 30

Topographical Maps 31

Calculus InClass Exercise Plot three functions: Answer Here: 32

Optimization for ML Whiteboard Unconstrained optimization Convex, concave, nonconvex Derivatives Zero derivatives Gradient and Hessian 34

Convexity There is only one local optimum if the function is convex Slide adapted from William Cohen 35

OPTIMIZATION FOR LINEAR REGRESSION 36

Optimization for Linear Regression Whiteboard Closedform (Normal Equations) 39