Machine Learning. Regression. Manfred Huber

Similar documents
CSCI567 Machine Learning (Fall 2014)

Regression with Numerical Optimization. Logistic

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Regression. S. Sumitra

Linear Models for Regression

Machine Learning. Lecture 2: Linear regression. Feng Li.

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Linear Regression (continued)

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Overfitting, Bias / Variance Analysis

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Linear Models for Regression CS534

Linear Models for Regression. Sargur Srihari

Lecture 2 Machine Learning Review

Linear Models for Regression CS534

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Models for Regression

CMU-Q Lecture 24:

CSC321 Lecture 2: Linear Regression

ECE521 week 3: 23/26 January 2017

Lecture 11 Linear regression

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

PATTERN CLASSIFICATION

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear Models for Regression CS534

Least Mean Squares Regression. Machine Learning Fall 2018

Least Mean Squares Regression

MODULE -4 BAYEIAN LEARNING

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning. Ensemble Methods. Manfred Huber

Logistic Regression & Neural Networks

Iterative Reweighted Least Squares

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Reading Group on Deep Learning Session 1

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Error Functions & Linear Regression (1)

Linear discriminant functions

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning and Data Mining. Linear regression. Kalev Kask

CSC242: Intro to AI. Lecture 21

ECE521 Lectures 9 Fully Connected Neural Networks

CSC 411: Lecture 02: Linear Regression

Mathematical Formulation of Our Example

CS545 Contents XVI. Adaptive Control. Reading Assignment for Next Class. u Model Reference Adaptive Control. u Self-Tuning Regulators

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS 6375 Machine Learning

Artificial Neural Networks

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bias-Variance Tradeoff

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Logistic Regression. COMP 527 Danushka Bollegala

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Linear Discrimination Functions

Bayesian Machine Learning

Logistic Regression. Machine Learning Fall 2018

Machine Learning Practice Page 2 of 2 10/28/13

On the Worst-case Analysis of Temporal-difference Learning Algorithms

Computer Vision Group Prof. Daniel Cremers. 3. Regression

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Introduction to Machine Learning

Stochastic Gradient Descent

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Binary Classification / Perceptron

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning (CS 567) Lecture 3

Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Statistical Machine Learning Hilary Term 2018

Lecture 2: Linear regression

Linear Regression with mul2ple variables. Mul2ple features. Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning Basics: Maximum Likelihood Estimation

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Data Mining. Supervised Learning. Hamid Beigy. Sharif University of Technology. Fall 1396

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Lecture 5: Logistic Regression. Neural Networks

Machine Learning (COMP-652 and ECSE-608)

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 2, Kevin Jamieson 1

Tutorial on Machine Learning for Advanced Electronics

Logistic Regression. William Cohen

Machine Learning in Modern Well Testing

CS229 Lecture notes. Andrew Ng

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Part of the slides are adapted from Ziko Kolter

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Multiclass Logistic Regression

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Machine Learning: Logistic Regression. Lecture 04

Linear Regression (9/11/13)

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning Linear Models

Transcription:

Machine Learning Regression Manfred Huber 2015 1

Regression Regression refers to supervised learning problems where the target output is one or more continuous values Continuous output values imply that outputs can be interpolated and that they can be (at least partially) ordered Training data: D = {( x (i), y (i) ) :i! {1..n}} The task is to learn a function (hypothesis), h, such that that h(x (i) ) is a good predictor for y (i) Manfred Huber 2015 2

Regression Regression is often better formulated as a discriminative learning task Continuous output would require learning a conditional probability density function with continuous conditions for the generative scenario For discriminative learning, we explicitly form a representation for the the function h estimating the MLE/MAP value for the input x Lower dimensional function Doe not need to represent the full distribution, only the estimate Manfred Huber 2015 3

Linear Regression The simplest representation for regression is a linear function with a set of parameters In a linear representation, the function is linear in terms of the function parameters but not necessarily in the features of the input x h! (x) = A special case of this is the traditional linear representation in the original data features! j! j " j (x) h! (x) =! 0 + m!! j x j j=1 This represents a line in the original feature space Manfred Huber 2015 4

Linear Regression Often we will write this in vector/matrix notation h! (x) =!! j " j (x) =!! T "(x) j Since we do not have a probabilistic representation we do need a different performance function Squared error ( ) n E(!) =! E!, x (i), y (i) = = 1 2 i=1! n ( ) ( h! (x (i) )" y (i) ) 2 1 ( 2 h! (x (i) )" y (i) ) 2 Manfred Huber 2015 i=1 5! n i=1

Error Minimization There are different ways we can minimize an error (or other performance) function Analytic optimization Find parameters where the derivative is 0!E(!)!! = 0 Ensure it is a minimum and not a maximum Batch gradient descent! :=!!" "E(!) "! Stochastic/incremental gradient descent ( ( )) "!! :=!!" "E!, x (i), y (i) Manfred Huber 2015 6

Linear Regression Using the Least Square error function we can derive a stochastic gradient descent method! :=!!" " "! E (!, ( x(i), y (i) )) =!!" " 1 "! 2 h! (x(i) )! y (i) ( ) " =!!" 1 2 2 h! (x (i) )! y (i) =!!" ( h! (x (i) )! y (i) )! x (i) "! h! (x (i) ) =!!" h! (x (i) )! y (i) ( ) 2 ( ) " This is also called the Widrow-Hoff learning rule Converting this to batch gradient descent! :=!!" " "! E! ( ) ( ) =!!" h! (x (i) )! y (i) # n i=1 ( )# x (i) "!! T! x (i) ( ) Manfred Huber 2015 7 ( )

Linear Regression Example Example from Andrew Ng. Using simple representation linear in the features housing prices Living area (feet 2 ) Price (1000s) 2104 400 1600 330 2400 369 1416 232 3000 540.. price (in 1000) 1000 900 800 700 600 500 400 300 200 100 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 square feet Manfred Huber 2015 8

Linear Regression Example Using batch gradient descent 50 45 40 35 30 25 20 15 10 5 5 10 15 20 25 30 35 40 45 50 Manfred Huber 2015 9

Linear Regression Example Resulting in a linear predictor housing prices 1000 900 800 700 price (in 1000) 600 500 400 300 200 100 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 square feet Manfred Huber 2015 10

Least Square Error and Maximum Likelihood Why is least square a good performance function for Linear regression? Under certain assumption least square for linear regression leads to the maximum likelihood parameters for a linear function Assume that the real function is linear with noise y (i) =! T "(x i) )+# (i) Assume that the noise is normally distributed! (i) ~ N(0," 2 ) Manfred Huber 2015 11

Least Square Error and Maximum Likelihood Under these assumptions we can express the likelihood of the parameters L (! ) = p! ( y! n X) =! p! (y (i) x (i) ) =! n i=1 i=1 ( y (i ) "! T! (x (i ) )) 2 1 2!" e" 2! 2 Converting this to log likelihood log L! ( ) = log # % % " n i=1 ( y (i )!! T! (x (i ) )) 2 1 2!" e! 2! 2 1 = n log 2!"! 1 1 n ) ( y (i)!! T!(x (i) )) 2 " 2 2 i=1 Maximum likelihood is the same as least squares Manfred Huber 2015 12 & ( (

Analytic Optimization To simplify notation we convert data and target values into matrices "!(X) = #! 1 (x (1) )! 2 (x (1) )! 3 (x (1) )!! m (x (1) )! 1 (x (2) )! 2 (x (2) )! 3 (x (2) )!! m (x (2) )! 1 (x (3) )! 2 (x (3) )! 3 (x (3) )!! m (x (3) )!!! "!! 1 (x (n) )! 2 (x (n) )! 3 (x (n) )!! m (x (n) ) Using this the vector of predicted values h(x) can be computed as! h(x) =!(X)!! Manfred Huber 2015 13 % & ; Y = " # y (1)T y (2)T y (3)T! y (n)t % &

Analytic Optimization The squared error function can then be rewritten as E(!) = 1 2 (! h(x)!y ) T (! h(x)!y ) To optimize we have to calculate the derivative Derivative of a matrix is the matrix of its partial derivatives with respect to all the parameters This is also called its Jacobian # % % %!! E(!) = % % % "E! "E "! 1,1 "! 1,k! "! "E! "E "! m,1 "! m,k Manfred Huber 2015 14 & ( ( ( ( ( (

Analytic Optimization For linear regression the derivative can be derived!! E(!) =!! 1 2 (! h(x)"y ) T (! h(x)"y ) =!! 1 2 (#(X)!! "Y ) T (#(X)!! "Y ) = #(X) T #(X)!! " #(X) T Y This leads to an analytic solution!(x) T!(X)!! "!(X) T Y = 0 #!(X) T!(X)!! =!(X) T Y ( ) "1!(X) T Y =!(X) + Y #!! =!(X) T!(X) These are the Normal Equations of the Least Squares problem A+ is called the Moore-Penrose Pseudoinverse Manfred Huber 2015 15

Analytic Optimization Using Pseudoinverse poses challenges Computation the Pseudoinverse for matrices with a large number of rows can be numerically unstable Pseudoinverses can be ill-conditioned, leading to numerically unstable/incorrect results Results should always be verified There are more stable ways to solve this optimization problem E.g. use of extended system and QR factorization Incremental optimization Manfred Huber 2015 16

y y y Linear Regression Choosing feature functions ϕ(x) allows linear regression to represent non-linear functions Still linear in the learned parameters Can not learn the type of non-linearity (only choose it) Adding many feature functions leads to overfitting 4.5 4.5 4.5 4 4 4 3.5 3.5 3.5 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 Locally Linear Regression learns different parameters for different regions of ϕ(x) 0 0 1 2 3 4 5 6 7 x 0 0 1 2 3 4 5 6 7 x 0 0 1 2 3 4 5 6 7 x Manfred Huber 2015 17

Locally Weighted Linear Regression Different ways to increase the types of classes that linear regression can form exist Locally weighted linear regression combines ideas of linear regression with ideas from nonparametric representations Error is weighted squared error where the weight determines influence of a point and is based on similarity n E(!, x) =! w (i) (x)e!, x (i), y (i) = 1 i=1 2 ( ( ))! ( ) 2 w (i) (x) h! (x (i) )" y (i) Weights are not parameters to be learned but fixed functions of the point for which the error is calculated n i=1 Manfred Huber 2015 18

Locally Weighted Linear Regression Common weights are local to the point, e.g.: ( w (i) (x) = e! x (i )!x) 2 2! 2 Since weights are positive they can be pulled into the design matrix Φ(X) and the target output Y Can be solved the same way as linear regression once the data point x is known Manfred Huber 2015 19

Locally Weighted Linear Regression For each point weights can be computed, reducing it to a weighted least squares problem with fixed weights Corresponds to optimizing a different Error function for each query Locally linear regression forms on regression line for each data point, leading to a smooth regression curve using linear regression Can have problems when there is limit data for which the weight function is sufficiently large Manfred Huber 2015 20

Linear Regression Linear regression is a powerful technique to build Maximum Likelihood predictors for regression problems Has only to be linear in the learning parameters and can thus represent limited non-linear function in terms of the input Choice of feature functions is important for its expressiveness Analytic solution can be computed but is sometimes numerically unstable Incremental techniques can be used for on-line learning or if analytic solution is not stable Manfred Huber 2015 21