Linear Models for Regression CS534

Similar documents
Linear Models for Regression CS534

Linear Models for Regression CS534

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

ECE521 week 3: 23/26 January 2017

Regression with Numerical Optimization. Logistic

Overfitting, Bias / Variance Analysis

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

STA414/2104 Statistical Methods for Machine Learning II

CSCI567 Machine Learning (Fall 2014)

Linear Models for Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Models for Regression. Sargur Srihari

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Linear Regression (continued)

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

y Xw 2 2 y Xw λ w 2 2

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

DS-GA 1002 Lecture notes 12 Fall Linear regression

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Midterm exam CS 189/289, Fall 2015

Least Squares Regression

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Least Squares Regression

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Relevance Vector Machines

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Midterm: CS 6375 Spring 2018

4 Bias-Variance for Ridge Regression (24 points)

Statistical Machine Learning Hilary Term 2018

Linear Models in Machine Learning

Introduction to Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning. Lecture 2: Linear regression. Feng Li.

Linear Regression (9/11/13)

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS534 Machine Learning - Spring Final Exam

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning Linear Classification. Prof. Matteo Matteucci

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture #11: Classification & Logistic Regression

CMU-Q Lecture 24:

LINEAR REGRESSION, RIDGE, LASSO, SVR

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Least Mean Squares Regression. Machine Learning Fall 2018

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Machine Learning Lecture 7

Linear Regression. S. Sumitra

Linear Models for Regression

Machine Learning and Data Mining. Linear regression. Kalev Kask

10-701/ Machine Learning - Midterm Exam, Fall 2010

Week 3: Linear Regression

Midterm: CS 6375 Spring 2015 Solutions

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

CSC 411: Lecture 02: Linear Regression

Linear classifiers: Overfitting and regularization

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Lecture 3 - Linear and Logistic Regression

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Machine Learning (COMP-652 and ECSE-608)

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

CSC321 Lecture 18: Learning Probabilistic Models

ECE521 Lecture7. Logistic Regression

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 4 September 19, Readings: Bishop, 3.

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Classification Logistic Regression

Bayesian Machine Learning

Bias-Variance Tradeoff

Machine Learning Lecture 5

Today. Calculus. Linear Regression. Lagrange Multipliers

Linear Models for Regression

Classification Logistic Regression

Linear Regression and Its Applications

Bits of Machine Learning Part 1: Supervised Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Linear Regression Linear Regression with Shrinkage

Transcription:

Linear Models for Regression CS534

Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the abundance of a species based on Environmental conditions General set up: Given a set of training examples (x i, t i ), i=1, N Goal: learn a function yˆ ( x) to minimize some loss function: L( yˆ, t)

A example regression problem We want to learn to predict a person s height based on his/her knee height and/or arm span This is useful for patients who are bed bound and cannot stand to take an accurate measurement of their height Training data: Some measurements taken from past class

Target function Let y represent a person s height, and x represents the measurements we would use to predict y Here x contains two measurements referred to as features, knee height and arm span x = x 1 x 2 Knee height Arm span Supervised learning

Hypothesis space Linear function ( thus the name linear regression): To simplify: 1 1 2 2 We will rename b to, and define 1, we have: Let w,,,x,,, we have w x Supervised learning

Mean Squared Loss Given a set of training examples x,, x,, x, Our hypothesis space consider functions of T the form: y( x, w) w x Loss function:

Optimization of the loss function There are many ways one could use to optimize this objective function One of the simplest approach is called gradient descent The idea is quite simple Start with some random guess of the parameter Improve the parameter by following the steepest descent direction

Gradient descent 1. Start from some initial guess 2. Find the direction of steepest descent opposite of the gradient direction 3. Take a step toward that direction 4. Repeat until no local improvement is possible Different starting point may lead to different local minimum

Batch Gradient Descent for LMS Start with some initial guess Take a gradient descent step: Repeat until convergence Convergence is achieved if the magnitude of the gradient goes to zero. Controlling the learning rate controls how fast it converges.

Stochastic gradient descent Start with some initial guess Loop until converged for m=1 to n for i=0,1,,d Stochastic gradient descent takes a updating step every time it sees a single training example Appropriate for very large data set In such cases, converge faster than batch gradient descent

Alternative way to optimizing We will now represent w using vector form Define, w 1 2 w 1 2 w w w w 0 w 0 X Xw T X T Y Normal equation

, The normal equation: X T Xw A set of linear equations with variables Existence: there is always a solution to this equation Uniqueness: the solution is unique if the columns of X are linearly independent, i.e., the features are linearly independent, is invertible In this case, we can directly solve for w X T Y w ( X T X) 1 X T Y Supervised learning

Probabilistic Interpretation of the Least Squared Objective One might ask, why is the least squared objective a good choice? There is a natural probabilistic explanation for this objective when we make some simple assumptions Let s assume that the target variable and the input vector x are related as follows: w x where is an error term that is distributed according to Gaussian distribution with zero mean and some variance, i.e., ~0, This is equivalent to assuming ~w x, σ

Likelihood Function Now we are given the observed data x,, x,, x, We assume the error term of each of these examples is Independently and Identically Distributed (i.i.d) Given the inputs x, x,,x, what is the probability that their target outputs are,,,? ;w x ; w One can view this as a function of Y for fixed w and X Alternatively, this can also be viewed as a function of w, with fixed Y and X This is called the likelihood function w

Maximum likelihood principle Maximum likelihood estimation seeks to find the parameter that maximizes the likelihood function It is often easier to work with the log of the likelihood function: log x ; w log log 2 w x 2 1 2 exp w x 2 arg max w w argmin w 1 2 w x

Background: Maximum Likelihood Estimation

Parameter Estimation Given observation of some random variable(s), assuming the variable(s) follow some fixed distribution, how to estimate the parameters? Example: coin tosses (Bernoulli distribution, parameter: p probability of head) Dice or grades (Discrete distribution, parameters: for 1,,1with categories) Height and weight of a person (continuous variable, parameters depends on the distribution, for example for Gaussian Distribution, the parametes are, the mean and Σ, the covariance

Maximum Likelihood Principle We will use a general term to denote the model (parameters) we try to estimate We will use a general term to denote the data that we observe is a set of examples, let denote the th example in Assuming a fixed model, what is the probability of observing? This can be written as ; The maximum likelihood principle seeks to find that maximizes this probability This is called the likelihood function ; ; here we are making the IID assumption, that is examples in D are independently and identically distributed, thus we can use the product It is often more convenient to work with the log of log log ;

Example: Coin Toss You are given a coin, and want to decide the probability of head of this coin You toss it for 500 times, and observe heads 242 times What is your estimate of? 242 0.484 500 But why? this is doing maximum likelihood estimation In other words, 0.484 leads to the highest probability of observing 242 heads out of 500 tosses Now let s go over the math to convince ourselves

Example Cont. We have,,, where is the outcome of the th coin toss, 1 means head, and 0 means tail Lets write down the likelihood function: ; ; For binary variable, we have a nice compact form to write down its probability mass function ; 1 So we have Take the log: 1 log log 1 log 1 log1 log log1 and

Maximizing the likelihood We take derivative of : Setting it to zero: Solving this leads to:

Revisit example: Polynomial Curve Fitting In this example, there is only one feature. We learn a function of M order polynomial. Alternatively, we could also view this as learning a linear model considering 1,,,, as the features. Note that this new feature space is derived from the original feature We refer to these features as the basis functions

Linear Basis Function Models (1) Generally, the linear models we have been discussing can be used to learn nonlinear models when using appropriate basis functions where s are known as basis functions. Typically 1 so that acts as the bias term. 1 Simplest case: use linear basis functions : x Using nonlinear basis functions allows us to use the basic linear regression algorithm to learn nonlinear functions

Linear Basis Function Models (2) Polynomial basis functions: These are global basis functions A small change in x affects all basis functions.

Linear Basis Function Models (3) Gaussian basis functions: These are local A small change in x only affects nearby basis functions and s control the location and scale (width)

Linear Basis Function Models (4) Sigmoidal basis functions: where Also local; a small change in x only affects nearby basis functions. and s control location and scale (slope).

Over fitting issue What can we do to curb overfitting Use less complex model Use more training examples Regularization

Regularized Least Squares (1) Consider the error function: With the sum of squares error function and a quadratic regularizer, we get which is minimized by Data term + Regularization term (penalize complex models) 1 2 w ϕ x 2 w w w Encourage small weight values is called the regularization coefficient.

Regularized Least Squares (2) With a more general regularizer, we have 1 2 w ϕ x 2 Note this is equivalent to minimizing the mean squared loss subject to a constraint that the Lasso Quadratic (Ridge)

Regularized Least Squares (3) Lasso tends to generate sparser solutions (majority of the weights shrink to zero) than a quadratic regularizer.

What we are doing next We see that the model complexity can be controlled by choosing different basis functions or adjusting the parameter under the regularization framework (larger less complexity) By controlling model complexity (via different mechanisms), we are trading off the risk of over fitting with the risk of using a oversimplified model that is not flexible enough to capture the underlying function (under fitting) In the next few slides, we will see some theoretical analysis to understand this trade off from a statistical point of view In particular, we will analyze the expected loss of our prediction, and see how it can be decomposed

Analyzing the Expected Loss Let x be our estimation of the true value Expected loss is then written as:, x x, x x x, x x x x x, x x x, x x x x, x 2 x x x x 0

The Bias Variance Decomposition (1) Consider the expected squared loss, x x x, x x x, x The second term of E[L] corresponds to the noise inherent in the target variable. This is part of the data, we have no control over it What about the first term?

The Bias Variance Decomposition (2) Suppose we were given multiple data sets, each of size. Any particular training data set D, will give a particular function. Let ], the first term can be represented as: x x; x x; x; x; x x; x; x; 2 x x; x; x; Take the expectation of, the last term goes to zero

The Bias Variance Decomposition (3) Taking the expectation over D yields x x; x x; x; x;

The Bias Variance Decomposition (4) Thus we can write where

The Bias Variance Decomposition (5) Example: 25 training data sets from the sinusoidal, varying the degree of regularization,. The model uses 24 Gaussian Basis functions

The Bias Variance Decomposition (6)

The Bias Variance Decomposition (7)

The Bias Variance Trade off From these plots, we note that an overregularized model (large ) will have a high bias, while an underregularized model (small ) will have a high variance.

Summary Linear regression with least squared objective Gradient descent algorithm for optimization Closed form solution by solving the normal equation Can be viewed as maximum likelihood estimation assuming IID Gaussian noise with zero mean Using basis functions allows us to learn nonlinear functions Controlling over fitting by Regularization Ridge regression is equivalent to adding a diagonal matrix to LASSO encourages sparse solution