Machine Learning - MT Linear Regression

Similar documents
Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine learning - HT Maximum Likelihood

Machine Learning - MT Classification: Generative Models

Machine Learning - MT & 14. PCA and MDS

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Overfitting, Bias / Variance Analysis

Linear Regression (continued)

CPSC 540: Machine Learning

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Association studies and regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine learning - HT Basis Expansion, Regularization, Validation

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Bayesian Linear Regression [DRAFT - In Progress]

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Least Squares Regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Regression with Numerical Optimization. Logistic

Least Squares Regression

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

DD Advanced Machine Learning

ECE521 week 3: 23/26 January 2017

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Machine Learning Linear Models

Loss Functions and Optimization. Lecture 3-1

4 Bias-Variance for Ridge Regression (24 points)

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Lecture 2: Linear regression

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Linear regression COMS 4771

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

COMP90051 Statistical Machine Learning

Machine Learning Lecture 5

Linear Models in Machine Learning

J. Sadeghi E. Patelli M. de Angelis

Will it rain tomorrow?

Multivariate Bayesian Linear Regression MLAI Lecture 11

Introduction to Machine Learning

Advanced Introduction to Machine Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Regression Models - Introduction

Introduction to Gaussian Processes

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear classifiers: Logistic regression

COMP 875 Announcements

Machine Learning (COMP-652 and ECSE-608)

Machine Learning Lecture 7

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Week 3: Linear Regression

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Introduction to Simple Linear Regression

CS540 Machine learning Lecture 5

CS798: Selected topics in Machine Learning

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Bayesian Machine Learning

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Statistical Machine Learning Hilary Term 2018

Determine the trend for time series data

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

A summary of Deep Learning without Poor Local Minima

CSCI567 Machine Learning (Fall 2014)

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Classification with Perceptrons. Reading:

Regression, Ridge Regression, Lasso

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CSC321 Lecture 2: Linear Regression

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Learning from Data: Regression

Machine Learning Lecture 2

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Sparse Linear Models (10/7/13)

CS 188: Artificial Intelligence Spring Announcements

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Loss Functions and Optimization. Lecture 3-1

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Intelligent Systems (AI-2)

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Sample questions for Fundamentals of Machine Learning 2018

Linear Models for Classification

Lecture 7: Kernels for Classification and Regression

Machine Learning (CS 567) Lecture 5

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Introduction to Machine Learning (67577) Lecture 3

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

GI01/M055: Supervised Learning

CS-E3210 Machine Learning: Basic Principles

Machine Learning (CSE 446): Probabilistic Machine Learning

Part 6: Multivariate Normal and Linear Models

Machine Learning, Fall 2009: Midterm

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Transcription:

Machine Learning - MT 2016 2. Linear Regression Varun Kanade University of Oxford October 12, 2016

Announcements All students eligible to take the course for credit can sign-up for classes and practicals Attempt Problem Sheet 0 (contact your class tutor if you intend to attend class in Week 2) Problem Sheet 1 is posted (submit by noon 21 Oct at CS reception) 1

Announcement : Strachey Lecture Will finish 15-20 min early on Monday, October 31 May run over by 5 minutes or so a few other days 2

Outline Goals Review the supervised learning setting Describe the linear regression framework Apply the linear model to make predictions Derive the least squares estimate Supervised Learning Setting Data consists of input and output pairs Inputs (also covariates, independent variables, predictors, features) Output (also variates, dependent variable, targets, labels) 3

Why study linear regression? Least squares is at least 200 years old going back to Legendre and Gauss Francis Galton (1886): Regression to the mean Often real processes can be approximated by linear models More complex models require understanding linear regression Closed form analytic solutions can be obtained Many key notions of machine learning can be introduced 4

A toy example : Commute Times Want to predict commute time into city centre What variables would be useful? Distance to city centre Day of the week Data dist (km) day commute time (min) 2.7 fri 25 4.1 mon 33 1.0 sun 15 5.2 tue 45 2.8 sat 22 5

Linear Models Suppose the input is a vector x R D and the output is y R. We have data x i, y i N i=1 Notation: data dimension D, size of dataset N, column vectors Linear Model y = w 0 + x 1w 1 + + x Dw D + ɛ Bias/intercept Noise/uncertainty 6

Linear Models : Commute Time Linear Model y = w 0 + x 1w 1 + + x Dw D + ɛ Bias/intercept Noise/uncertainty Input encoding: mon-sun has to be converted to a number monday: 0, tuesday: 1,..., sunday: 6 0 if weekend, 1 if weekday Say x 1 R (distance) and x 2 {0, 1} (weekend/weekday) Linear model for commute time y = w 0 + w 1x 1 + w 2x 2 + ɛ Using 0-6 is a bad encoding. Use seven 0-1 features instead called one-hot encoding 7

Linear Model : Adding a feature for bias term dist day commute time x 1 x 2 y 2.7 fri 25 4.1 mon 33 1.0 sun 15 5.2 tue 45 2.8 sat 22 one dist day commute time x 0 x 1 x 2 y 1 2.7 fri 25 1 4.1 mon 33 1 1.0 sun 15 1 5.2 tue 45 1 2.8 sat 22 Model y = w 0 + w 1x 1 + w 2x 2 + ɛ Model y = w 0x 0 + w 1x 1 + w 2x 2 + ɛ = w x + ɛ 8

Learning Linear Models Data: (x i, y i) N i=1, where x i R D and y i R Model parameter w, where w R D Training phase: (learning/estimation w from data) (x i, y i) N i=1 data Learning Algorithm w (estimate) Testing/Deployment phase: (predict ŷ new = x new w) How different is ŷ new from y new (actual observation)? We should keep some data aside for testing before deploying a model 9

(x i, y i) N i=1, where x i R and y i R ŷ(x) = w 0 + x w 1, (no noise term in ŷ) L(w) = L(w 0, w 1) = 1 2N N (ŷ i y i) 2 = 1 N (w 0 + x i w 1 y i) 2 2N i=1 i=1 Loss function Cost function Objective Function Energy Function Notation - L, J, E, R This objective is known as the residual sum of squares or (RSS) The estimate (w 0, w 1) is known as the least squares estimate 10

(x i, y i) N i=1, where x i R and y i R ŷ(x) = w 0 + x w 1, (no noise term in ŷ) L(w) = L(w 0, w 1) = 1 2N N (ŷ i y i) 2 = 1 N (w 0 + x i w 1 y i) 2 2N i=1 i=1 L = 1 w 0 N L = 1 w 1 N N (w 0 + w 1 x i y i) i=1 N (w 0 + w 1 x i y i)x i i=1 We obtain the solution for (w 0, w 1) by setting the partial derivatives to 0 and solving the resulting system. (Normal Equations) w 0 w 0 + w 1 i xi N + w1 i xi N = i x2 i N = i yi N i xiyi N (1) (2) x = ȳ = var(x) = ĉov(x, y) = w 1 = i xi N i yi N i x2 i N x2 i xiyi x ȳ N ĉov(x, y) var(x) w 0 = ȳ w 1 x 11

Linear Regression : General Case Recall that the linear model is ŷ i = D j=0 x ijw j where we assume that x i0 = 1 for all x i, so that the bias term w 0 does not need to be treated separately. Expressing everything in matrix notation ŷ = Xw Here we have ŷ R N 1, X R N (D+1) and w R (D+1) 1 ŷ N 1 ŷ 1 ŷ 2. ŷ N = X N (D+1) w (D+1) 1 w 0 x T 1 x T 2.. x T N. w D = X N (D+1) x 10 x 1D x 20 x 2D....... x N0 x ND w (D+1) 1 w 0. w D 12

Back to toy example one dist (km) weekday? commute time (min) 1 2.7 1 (fri) 25 1 4.1 1 (mon) 33 1 1.0 0 (sun) 15 1 5.2 1 (tue) 45 1 2.8 0 (sat) 22 We have N = 5, D + 1 = 3 and so we get 25 1 2.7 1 33 1 4.1 1 y = 15, X = 1 1.0 0, w = 45 1 5.2 1 22 1 2.8 0 Suppose we get w = [6.09, 6.53, 2.11] T. Then our predictions would be 25.83 34.97 ŷ = 12.62 42.16 24.37 w 0 w 1 w 2 13

Least Squares Estimate : Minimise the Squared Error L(w) = 1 2N N (x T i w y i) 2 = (Xw y) T (Xw y) i=1 14

Finding Optimal Solutions using Calculus L(w) = 1 2N = 1 2N = 1 2N = N i=1 (x T i w y i) 2 = 1 2N (Xw y)t (Xw y) ( w T ( X T X ) ) w w T X T y y T Xw + y T y ( ( ) ) w T X T X w 2 y T Xw + y T y Then, write out all partial derivatives to form the gradient wl L w 0 = L w 1 =. L w D = Instead, we will develop tricks to differentiate using matrix notation directly 15

Differentiating Matrix Expressions Rules (Tricks) ) (i) Linear Form Expressions: w (c T w = c c T w = D c jw j j=0 ) (c T w) = c w j j, and so w (c T w (ii) Quadratic Form Expressions: ) w (w T Aw = Aw + A T w ( = 2Aw for symmetric A) w T Aw = (w T Aw) w k = D i=0 j=0 i=0 D w iw ja ij D D w ia ik + A kj w j = A T [:,k]w + A [k,:] w j=0 = c (3) ) w (w T Aw = A T w + Aw (4) 16

Deriving the Least Squares Estimate L(w) = 1 2N N i=1 (x T i w y i) 2 = 1 2N ( ( ) ) w T X T X w 2 y T Xw + y T y We compute the gradient wl = 0 using the matrix differentiation rules, wl = 1 ( ( ) X X) T w X T y N By setting wl = 0 and solving we get, ( ) X T X w = X T y ( 1 w = X X) T X T y (Assuming inverse exists) The predictions made by the model on the data X are given by ( 1 ŷ = Xw = X X X) T X T y ( 1 For this reason the matrix X X X) T X T is called the hat matrix 17

Least Squares Estimate w = ( X T X) 1 X T y When do we expect X T X to be invertible? rank(x T X) = rank(x) min{d + 1, N} As X T X is D + 1 D + 1, invertible is rank(x) = D + 1 What if we use one-hot encoding for a feature like day? Suppose x mon,..., x sun stand for 0-1 valued variables in the one-hot encoding We always have x mon + + x sun = 1 This introduces a linear dependence in the columns of X reducing the rank In this case, we can drop some features to adjust rank. We ll see alternative approaches later in the course. What is the computational complexity of computing w? Relatively easy to get O(D 2 N) bound 18

19

Recap : Predicting Commute Time Goal Predict the time taken for commute given distance and day of week Do we only wish to make predictions or also suggestions? Model and Choice of Loss Function Use a linear model y = w 0 + w 1x 1 + + w Dx D + ɛ = ŷ + ɛ Minimise average squared error 1 (yi ŷ 2N i) 2 Algorithm to Fit Model Simple matrix operations using closed-form solution 20

Model and Loss Function Choice Optimisation View of Machine Learning Pick model that you expect may fit the data well enough Pick a measure of performance that makes sense and can be optimised Run optimisation algorithm to obtain model parameters Probabilistic View of Machine Learning Pick a model for data and explicitly formulate the deviation (or uncertainty) from the model using the language of probability Use notions from probability to define suitability of various models Find the parameters or make predictions on unseen data using these suitability criteria (Frequentist vs Bayesian viewpoints) 21

Next Time Probabilistic View of Machine Learning (Maximum Likelihood) Non-linearity using basis expansion What to do when you have more features than data? Make sure you re familiar with the the multi-variate Gaussian distribution 22