Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Similar documents
Why should you care about the solution strategies?

Probability and Statistical Decision Theory

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 2: Linear regression

Parameter estimation Conditional risk

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Overfitting, Bias / Variance Analysis

Mathematical Formulation of Our Example

CS 195-5: Machine Learning Problem Set 1

Linear Regression and Its Applications

Linear Regression (continued)

Lecture 2 Machine Learning Review

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

PROBABILITY THEORY REVIEW

Lecture 5: Logistic Regression. Neural Networks

ECE521 week 3: 23/26 January 2017

Machine Learning Lecture Notes

Introduction to the regression problem. Luca Martino

Lecture 3: Introduction to Complexity Regularization

Week 5: Logistic Regression & Neural Networks

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Introduction to Machine Learning

CSC321 Lecture 2: Linear Regression

Gaussian Process Regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

6.867 Machine Learning

Linear Models in Machine Learning

Machine learning - HT Maximum Likelihood

Machine Learning Lecture 5

41903: Introduction to Nonparametrics

Bayesian Machine Learning

12 Statistical Justifications; the Bias-Variance Decomposition

Machine Learning 4771

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Lecture : Probabilistic Machine Learning

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Bayesian Linear Regression [DRAFT - In Progress]

11. Learning graphical models

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Linear Models for Regression

CSC 411: Lecture 09: Naive Bayes

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Recap from previous lecture

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Notes on Discriminant Functions and Optimal Classification

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

CSCI567 Machine Learning (Fall 2014)

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Generative Clustering, Topic Modeling, & Bayesian Inference

Probability and Information Theory. Sargur N. Srihari

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Understanding Generalization Error: Bounds and Decompositions

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

BAYESIAN DECISION THEORY

CS 6375 Machine Learning

Association studies and regression

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Least Squares Regression

Regression I: Mean Squared Error and Measuring Quality of Fit

CMSC858P Supervised Learning Methods

First Year Examination Department of Statistics, University of Florida

CMU-Q Lecture 24:

Linear Models for Regression CS534

CSE546 Machine Learning, Autumn 2016: Homework 1

Least Squares Regression

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

10-701/ Machine Learning, Fall

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

ECE 541 Stochastic Signals and Systems Problem Set 9 Solutions

Terminology. Experiment = Prior = Posterior =

Tutorial on Machine Learning for Advanced Electronics

Introduction to Bayesian Statistics

Is the test error unbiased for these programs?

6.867 Machine Learning

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Machine Learning, Fall 2009: Midterm

Transcription:

Linear regression

Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a section, label the question with a topic (e.g., Picking models) 2

Properties of distributions Mean is the expected value (E[X]) Mode is the most likely value (i.e., x with largest p(x)) Median m is the value such that X is equally likely to fall above or below m: P( X m ) = P( X m ) When we use a squared-error cost, obtain f(x) = E[Y x] If we use an absolute-error cost, obtain f(x) = median ( p (y x) ) 3

Summary of optimal models Expected cost introduced to formalize our objective 8 Bayes risk function indicates best we could do >< f(x) specified for each x, rather than having some simpler (continuous) function class >: can think of it as a table of values { ptimal solut For classification (with uniform cost) = arg max f (x) {z } y2y {p(y x)}. 4 ptimal solution ˆ f (x) = For regression (with squared-error cost) Y yp(y x)dy

Learning functions Hypothesize a functional form, e.g. f(x) = dx w j x j j=1 f(x 1,x 2 )=w 0 + w 1 x 1 + w 2 x 2 f(x 1,x 2 )=wx 1 x 2 Then need to find the best parameters for this function; we will find the parameters that best approximate E[y x]... 5

Exercise: Reducible error Can f(x) = dx w j x j always represent E[Y x]? j=1 No. Imagine y =wx 1 x 2 This is deterministic, so there is enough information in x to predict y i.e., the stochasticity is not the problem, have zero irreducible error Rather simplistic functional form means we cannot predict y 6

Linear versus polynomial function 5 f 3 (x) 4 3 f 1 (x) 2 1 0 1 2 3 4 5 x 7

Linear Regression e.g., x_i = size of house y_i = cost of house y ( x, y ) 2 2 fx ( ) f(x) =w 0 + w 1 x ( x, y ) 1 1 e1 = f( x1) { y1 x Figure 4.1: An example of a linear regression fitting on data set D = {(1, 1.2), (2, 2.3), (3, 2.3), (4, 3.3)}. The task of the optimization process is to find the best linear function f(x) =w 0 + w 1 x so that the sum of squared errors e 2 1 + e2 2 + e2 3 + e2 4 is minimized. 8

(Multiple) Linear Regression e.g., x_{i1} = size of house x_{i2} = age of house y_i = cost of house 9

Linear regression importance Many other techniques will use linear weighting of features including neural networks Often, we will add non-linearity using non-linear transformations of linear weighting non-linear transformations of features Becoming comfortable will linear weightings, for multiple inputs and outputs, is important 10

Example: regression that it projects to the column space of. Example 11: Consider again data set D = {(1, 1.2), (2, 2.3), (3, 2.3), (4, 3.3)} from Figure 4.1. Wewanttofindtheoptimalcoefficients of the least-squares X = 2 6 4 1 1 1 2 1 3 1 4 3 7 5, w = apple w0 w 1, y = 2 6 4 1.2 2.3 2.3 3.3 3 7 5, 1 1 2 j k f j d 2 T x i i x ij d y i = n ŷ i = x > i w n 11 X = wy

12 Matrix multiplication

Whiteboard Maximum likelihood formulation (and assumptions) Solving the optimization Weighted error functions, if certain data points matter more than others Predicting multiple outputs (multivariate y) 13

Comments (Sep 26, 2017) Assignment 1 due on Thursday More review of linear algebra today 14

M = U V > SVD Mx = U V > x = U (V > x) 15

What we ve done so far Discussed linear regression goal: obtain weights w such that < x, w> approximates E[Y x] Discussed maximum likelihood formulation Solution: w =(X > X) 1 X > y ŷ = Xw > Starting discussing the properties of the solution When is it stable? Today: What does this mean for accuracy in predicting on new data? 16

Example: OLS that it projects to the column space of. Example 11: Consider again data set D = {(1, 1.2), (2, 2.3), (3, 2.3), (4, 3.3)} from Figure 4.1. Wewanttofindtheoptimalcoefficients of the least-squares X = 2 6 4 1 1 1 2 1 3 1 4 In Matlab, can compute 1. X > X 2. (X > X) 1 3 7 5, w = apple w0 w 1, y = X > X = 2 6 4 1.2 2.3 2.3 3.3 3 7 5, apple 1 1 1 1 1 2 3 4 2 6 4 1 1 1 2 1 3 1 4 3 7 5 3. (X > X) 1 X > y 17 What if we did not add the column of 1s?

Whiteboard More about inverses of matrices Refresh about stability of the solution Using regularization to fix the problem Properties of solution: Bias (and underfitting) Variance (and overfitting) 18

Overfitting 5 f 3 (x) 4 3 f 1 (x) 2 1 0 1 2 3 4 5 x Figure 4.4: Example of a linear vs. polynomial fit on a data set shown in Figure 4.1. The linear fit, f 1 (x), isshownasasolidgreenline,whereasthe cubic polynomial fit, f 3 (x), isshownasasolidblueline.thedottedredline indicates the target linear concept. w 1 =(0.7, 0.63) w 3 =( 3.1, 6.6, 2.65, 0.35) 19

Comments (Sep 28, 2017) Assignment 1 due today Need matplotlib for simulate.py Today: finish-off bias-variance 20

Terminology clarification What is a parameter? Any coefficients (i.e., scalars, vectors or matrices) that define the function you care about e.g., a and b are both parameters for function f f(x) = a if x<0 b if x 0 Maximum likelihood solution: parameters for the pmf or pdf that make the data the most likely The following function is not the maximum likelihood solution f(x) = arg max y2y p(y x) 21

Linear regression for non-linear problems e.g. f(x) =w 0 + w 1 x, f(x) = px w j x j, j=0 e.g. f(x 1,x 2 )=w 0 + w 1 x 1 + w 2 x 2 + w 3 x 1 x 2 + w 4 x 2 1 + w 5 x 2 2 X 1 x 1 0 (x 1 )... p(x 1 ) x 2!..................... n x n 0 (x n )... p(x n ) Figure 4.3: Transformation of an n 1 data matrix X into an n (p + 1) matrix using a set of basis functions j, j =0, 1,...,p. 22 w = > 1 > y.

Polynomial representations w 0 + w 1 x 1 + w 2 x 2 +...+ w 9 x 9 23

Whiteboard Bias and variance of linear regression solutions 24

Bias-variance trade-off Low Variance High Variance Low Bias High Bias 25 *Nice images from: http://scott.fortmann-roe.com/docs/biasvariance.html

Example: regularization and bias Picked a Gaussian prior and obtained l2 regularization We discussed the bias of this regularization no regularization was unbiased E[w] = true w with regularization meant E[w] was not equal to the true w Previously, however, mentioned that MAP and ML converge to the same estimate Does that happen here? w =(X > X + I) 1 X > y 26

How do we pick lambda? Discussed goal to minimize bias-variance trade-off i.e., minimizing MSE But, this involves knowing the true w! Recall our actual goal: learn w to get good prediction accuracy on new data Called generalization error Alternative to directly minimize MSE: use data to determine which choice of lambda provides good prediction accuracy 27

How can we tell if its a good model? What if you train many different models on a batch of data, check their accuracy on that data, and pick the best one? Imagine your are predicting how much energy your appliances will use today You train your models on all previous data for energy use in your home How well will this perform in the real world? What if the models you are testing are only different in terms of the regularization parameter lambda that they use? What will you find? 28

Simulating generalization error labeled data set training set test set learning method learned model accuracy estimate 29

Simulating generalization error Now we have one model, trained similarly to how it will be trained, and a measure of accuracy on new data (but distributed identically to trained data) What if we pick the model with the best test accuracy? Any issues? 30

Picking other priors Picked Gaussian prior on weights Encodes that we want the weights to stay near zero, varying with at most 1/lambda What if we had picked a different prior? e.g., the Laplace prior? 1 2b exp( x µ /b) 31

Regularization intuition Figure 4.5: A comparison between Gaussian and Laplace priors. The Gaussian prior prefers the values to be near zero, whereas the Laplace prior more strongly prefers the values to equal zero. 32

33 Regularization intuition

l1 regularization Feature selection, as well as preventing large weight t k = 7 1 0 1 0 0 0 1 = 1 0 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 1 = k = 3 1 1 1 How do we solve this optimization? min w2r d kxw yk 2 2 + kwk 1 34

How do we solve with l1 regularizer? min w2r d kxw yk 2 2 + kwk 1 Is there a closed form solution? What approaches can we take? 35

Practically solving optimization In general, what are the advantages and disadvantages of the closed form linear regression solution? + Simple approach: no need to add additional requirements, like stopping rules - Is not usually possible - Must compute an expensive inverse - With a large number of features, inverting large matrix? What about a large number of samples? 36