Machine Learning Linear Models

Similar documents
Recap from previous lecture

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Kernelized Perceptron Support Vector Machines

Linear Classifiers. Michael Collins. January 18, 2012

CS-E3210 Machine Learning: Basic Principles

Classification with Perceptrons. Reading:

Binary Classification / Perceptron

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Logistic Regression. COMP 527 Danushka Bollegala

MLCC 2017 Regularization Networks I: Linear Models

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear Discrimination Functions

Neural Networks DWML, /25

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Linear Regression (continued)

Lecture 4 Logistic Regression

Linear discriminant functions

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CSC242: Intro to AI. Lecture 21

An Introduction to Statistical and Probabilistic Linear Models

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Supervised Learning. George Konidaris

Discriminative Models

More about the Perceptron

Linear Regression. S. Sumitra

Machine Learning (CS 567) Lecture 3

Neural Networks. Advanced data-mining. Yongdai Kim. Department of Statistics, Seoul National University, South Korea

GRADIENT DESCENT AND LOCAL MINIMA

Single layer NN. Neuron Model

FA Homework 2 Recitation 2

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Neural Networks: Backpropagation

Discriminative Models

Regression with Numerical Optimization. Logistic

LINEAR REGRESSION, RIDGE, LASSO, SVR

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Logistic Regression. William Cohen

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machines for Classification: A Statistical Portrait

Lecture 2 Machine Learning Review

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

ECS171: Machine Learning

Overfitting, Bias / Variance Analysis

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

CSCI567 Machine Learning (Fall 2014)

Neural Networks. Haiming Zhou. Division of Statistics Northern Illinois University.

CS260: Machine Learning Algorithms

Machine Learning Lecture 7

CSC 411 Lecture 17: Support Vector Machine

6.036 midterm review. Wednesday, March 18, 15

Machine Learning Practice Page 2 of 2 10/28/13

CE213 Artificial Intelligence Lecture 14

The Perceptron Algorithm 1

Lecture 6. Notes on Linear Algebra. Perceptron

Optimization and Gradient Descent

18.6 Regression and Classification with Linear Models

Warm up: risk prediction with logistic regression

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Tutorial on Machine Learning for Advanced Electronics

Learning Decision Trees

Machine Learning Lecture 5

Machine Learning, Fall 2012 Homework 2

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning 2017

Selected Topics in Optimization. Some slides borrowed from

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Logistic Regression & Neural Networks

Error Functions & Linear Regression (2)

18.9 SUPPORT VECTOR MACHINES

Practicals 5 : Perceptron

Coordinate Descent and Ascent Methods

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning (CS 567) Lecture 5

Linear & nonlinear classifiers

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CS 4700: Artificial Intelligence

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

Neural Networks (Part 1) Goals for the lecture

Logic and machine learning review. CS 540 Yingyu Liang

Convex Optimization / Homework 1, due September 19

The Perceptron Algorithm

CS 6375 Machine Learning

Linear Regression with mul2ple variables. Mul2ple features. Machine Learning

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Multiclass Classification-1

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Polyhedral Computation. Linear Classifiers & the SVM

CPSC 340: Machine Learning and Data Mining

Introduction to Logistic Regression and Support Vector Machine

The Perceptron. Volker Tresp Summer 2016

Neural networks and support vector machines

Machine Learning and Data Mining. Linear classification. Kalev Kask

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Feedforward Neural Nets and Backpropagation

Transcription:

Machine Learning Linear Models

Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method (d) Iterative Method: Gradient descent (e) Pros and Cons of both methods 2. Linear Classification: Perceptron (a) Definition and history (b) Example (c) Algorithm

Supervised Learning Training data: examples x with labels y. (x 1,y 1 ),...,(x n,y n ) /x i 2 R d Regression: y is a real value, y 2 R f : R d! R f is called a regressor. Classification: y is discrete. To simplify, y 2{ 1, +1} f : R d! { 1, +1} f is called a binary classifier.

: History A very popular technique. Rooted in Statistics. Method of Least Squares used as early as 1795 by Gauss. Re-invented in 1805 by Legendre. Frequently applied in astronomy to study the large scale of the universe. Still a very useful tool today. Carl Friedrich Gauss

Given: Training data: (x 1,y 1 ),...,(x n,y n ) /x i 2 R d and y i 2 R

Given: Training data: (x 1,y 1 ),...,(x n,y n ) /x i 2 R d and y i 2 R example x 1! x 11 x 12... x 1d y 1 label.................. example x i! x i1 x i2... x id y i label.................. example x n! x n1 x n2... x nd y n label

Given: Training data: (x 1,y 1 ),...,(x n,y n ) /x i 2 R d and y i 2 R example x 1! x 11 x 12... x 1d y 1 label.................. example x i! x i1 x i2... x id y i label.................. example x n! x n1 x n2... x nd y n label Task: Learn a regression function: f : R d! R f(x) =y

Given: Training data: (x 1,y 1 ),...,(x n,y n ) /x i 2 R d and y i 2 R example x 1! x 11 x 12... x 1d y 1 label.................. example x i! x i1 x i2... x id y i label.................. example x n! x n1 x n2... x nd y n label Task: Learn a regression function: f : R d! R f(x) =y Linear Regression: A regression model is said to be linear if it is represented by a linear function.

Y #" X 2!" X 1 d = 1, line in R 2 d = 2, hyperplane is R 3 Credit: Introduction to Statistical Learning.

Linear Regression Model: f(x) = 0 + dx j=1 jx j with j 2 R, j 2{1,...,d} s are called parameters or coe cients or weights.

Linear Regression Model: f(x) = 0 + dx j=1 jx j with j 2 R, j 2{1,...,d} s are called parameters or coe cients or weights. Learning the linear model! learning the 0 s

Linear Regression Model: f(x) = 0 + dx j=1 jx j with j 2 R, j 2{1,...,d} s are called parameters or coe cients or weights. Learning the linear model! learning the 0 s Estimation with Least squares: Use least square loss: `oss(y i,f(x i )) = (y i f(x i )) 2

Linear Regression Model: f(x) = 0 + dx j=1 jx j with j 2 R, j 2{1,...,d} s are called parameters or coe cients or weights. Learning the linear model! learning the 0 s Estimation with Least squares: Use least square loss: `oss(y i,f(x i )) = (y i f(x i )) 2 We want to minimize the loss over all examples, that is minimize the risk or cost function R: R = 1 2n (y i f(x i )) 2

Asimplecasewithonefeature(d =1): f(x) = 0 + 1 x

Asimplecasewithonefeature(d =1): We want to minimize: f(x) = 0 + 1 x R = 1 2n (y i f(x i )) 2

Asimplecasewithonefeature(d =1): We want to minimize: f(x) = 0 + 1 x R = 1 2n R( )= 1 2n (y i f(x i )) 2 (y i 0 1 x i ) 2

Asimplecasewithonefeature(d =1): We want to minimize: f(x) = 0 + 1 x R = 1 2n R( )= 1 2n Find 0 and 1 that minimize: R( )= 1 2n (y i f(x i )) 2 (y i 0 1 x i ) 2 (y i 0 1 x i ) 2

3 3 2.5 β 1 0.03 0.04 0.05 0.06 3 2.15 2.2 2.3 3 RSS β 0 β 1 5 6 7 8 9 β 0 Credit: Introduction to Statistical Learning.

Find 0 and 1 so that: argmin ( 1 2n (y i 0 1 x i ) 2 )

Find 0 and 1 so that: Minimize: R( 0, argmin ( 1 2n 1), that is: (y i 0 1 x i ) 2 ) @ 0 =0 @ 1 =0

Find 0 and 1 so that: Minimize: R( 0, =2 1 @ 0 2n argmin ( 1 2n 1), that is: (y i 0 1 x i ) 2 ) @ 0 =0 @ 1 =0 (y i 0 1 x i ) @ (y i 0 1 x i ) @ 0

Find 0 and 1 so that: Minimize: R( 0, =2 1 @ 0 2n argmin ( 1 2n 1), that is: (y i 0 1 x i ) 2 ) @ 0 =0 @ 1 =0 (y i 0 1 x i ) @ (y i 0 1 x i ) @ 0 @ 0 = 1 n (y i 0 1 x i ) ( 1) = 0

Find 0 and 1 so that: Minimize: R( 0, =2 1 @ 0 2n argmin ( 1 2n 1), that is: (y i 0 1 x i ) 2 ) @ 0 =0 @ 1 =0 (y i 0 1 x i ) @ (y i 0 1 x i ) @ 0 @ 0 = 1 n (y i 0 1 x i ) ( 1) = 0 0 = 1 n y i 1 1 n x i

=2 1 @ 1 2n (y i 0 1 x i ) @ (y i 0 1 x i ) @ 1

=2 1 @ 1 2n (y i 0 1 x i ) @ (y i 0 1 x i ) @ 1 @ 1 = 1 n (y i 0 1 x i ) ( x i )=0

=2 1 @ 1 2n (y i 0 1 x i ) @ (y i 0 1 x i ) @ 1 @ 1 = 1 n (y i 0 1 x i ) ( x i )=0 1 x i 2 = x i y i n X 0x i

=2 1 @ 1 2n (y i 0 1 x i ) @ (y i 0 1 x i ) @ 1 @ 1 = 1 n (y i 0 1 x i ) ( x i )=0 Plugging 0 in 1 : 1 x i 2 = x i y i n X 0x i 1 = P n y i x i P n x 2 i 1 n P n y i P n x i 1 n P n x i P xi

With more than one feature: Find the f(x) = 0 + j that minimize: dx j=1 jx j R = 1 2n (y i 0 d X j=1 jx ij )) 2 Let s write it more elegantly with matrices!

Matrix representation Let X be an n (d + 1) matrix where each row starts with a 1 followed by a feature vector. Let y be the label vector of the training set. Let be the vector of weights (that we want to estimate!). X := 0 B B B B B B @ 1 x 11 x 1j x 1d...... 1 x i1 x ij x id...... 1 x n1 x nj x nd 1 C C C C C C A y := 0 B B B B B B @ y 1. y i. y n 1 C C C C C C A := 0 B B B B B B @ 0. j. d 1 C C C C C C A

Normal Equation We want to find (d + 1) s that minimize R. We write R: We have that: R( )= 1 (y X ) 2 2n R( )= 1 2n (y X )T (y X ) @ = 1 n XT (y X ) @ 2 R @ = 1 n XT X is positive definite which ensures that X T (y X )=0 is a minimum. We solve: The unique solution is: =(X T X) 1 X T y

Gradient descent!"#$%&'() ' 4.5' +,-./0$'12,3$' *$%&'( '

Gradient descent Gradient Descent is an optimization method. Repeat until convergence: Update simultaneously all j for (j = 0 and j = 1) 0 := 0 @ @ 0 R( 0, 1) 1 := 1 @ @ 1 R( 0, 1)

Gradient descent Gradient Descent is an optimization method. Repeat until convergence: Update simultaneously all j for (j = 0 and j = 1) 0 := 0 @ @ 0 R( 0, 1) 1 := 1 @ @ 1 R( 0, 1) is a learning rate.

Gradient descent In the linear case: @ 0 = 1 n (y i 0 1 x i ) ( 1) = 0 Let s generalize it! @ 1 = 1 n (y i 0 1 x i ) ( x i )

Gradient descent In the linear case: @ 0 = 1 n (y i 0 1 x i ) ( 1) = 0 @ 1 = 1 n (y i 0 1 x i ) ( x i ) Repeat until convergence: Update simultaneously all j for (j = 0 and j = 1) 0 := 0 1 n ( 0 + 1 x i y i ) 1 := 1 1 n ( 0 + 1 x i y i )(x i )

Pros and Cons Analytical approach: Normal Equation + No need to specify a convergence rate or iterate. - Works only if X T X is invertible - Very slow if d is large O(d 3 ) to compute (X T X) 1 Iterative approach: Gradient Descent + E ective and e cient even in high dimensions. - Iterative (sometimes need many iterations to converge). - Needs to choose the rate.

Practical considerations 1. Scaling: Bring your features to a similar scale.

Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i )

Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i ) 2. Learning rate: Don t use a rate that is too small or too large.

Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i ) 2. Learning rate: Don t use a rate that is too small or too large. 3. Rshoulddecreaseafter each iteration.

Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i ) 2. Learning rate: Don t use a rate that is too small or too large. 3. Rshoulddecreaseafter each iteration. 4. Declare convergence if it start decreasing by less

Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i ) 2. Learning rate: Don t use a rate that is too small or too large. 3. Rshoulddecreaseafter each iteration. 4. Declare convergence if it start decreasing by less 5. If X T X is not invertible?

Practical considerations 1. Scaling: Bring your features to a similar scale. x i := x i µ i stdev(x i ) 2. Learning rate: Don t use a rate that is too small or too large. 3. Rshoulddecreaseafter each iteration. 4. Declare convergence if it start decreasing by less 5. If X T X is not invertible? (a) Too many features as compared to the number of examples (e.g., 50 examples and 500 features) (b) Features linearly dependent: e.g., weight in pounds and in kilo.

Credit The elements of statistical learning. Data mining, inference, and prediction. 10th Edition 2009. T. Hastie, R. Tibshirani, J. Friedman. Machine Learning 1997. Tom Mitchell.