Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Similar documents
Machine Learning and Data Mining. Linear regression. Kalev Kask

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Machine Learning and Data Mining. Linear classification. Kalev Kask

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CSC 411 Lecture 6: Linear Regression

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Lecture 2: Linear regression

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Linear Regression (continued)

Lecture 2 Machine Learning Review

CSC321 Lecture 2: Linear Regression

Regression.

Linear Models for Regression CS534

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Models for Regression CS534

Linear Models for Regression CS534

Machine Learning Basics: Maximum Likelihood Estimation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Gradient Descent. Sargur Srihari

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Linear Models for Regression

Week 3: Linear Regression

CSC 411: Lecture 02: Linear Regression

Linear Regression with mul2ple variables. Mul2ple features. Machine Learning

Optimization and Gradient Descent

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Linear Models in Machine Learning

COMS 4771 Regression. Nakul Verma

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Least Mean Squares Regression. Machine Learning Fall 2017

CMU-Q Lecture 24:

CSC321 Lecture 9: Generalization

Linear Models for Regression

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Linear Regression (9/11/13)

11 More Regression; Newton s Method; ROC Curves

CPSC 340: Machine Learning and Data Mining

CSC321 Lecture 9: Generalization

ECE521 week 3: 23/26 January 2017

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Overfitting, Bias / Variance Analysis

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Lecture 5: Logistic Regression. Neural Networks

Bias/variance tradeoff, Model assessment and selec+on

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

CSCI567 Machine Learning (Fall 2014)

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Linear Discrimination Functions

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Linear Models for Regression. Sargur Srihari

Machine Learning Linear Regression. Prof. Matteo Matteucci

Linear Methods for Regression. Lijun Zhang

HOMEWORK #4: LOGISTIC REGRESSION

Introduction to Gaussian Process

Stochastic Gradient Descent

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

CPSC 540: Machine Learning

Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Lecture Data Science

Fundamentals of Machine Learning

Day 3 Lecture 3. Optimizing deep networks

Parameter Norm Penalties. Sargur N. Srihari

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

12 Statistical Justifications; the Bias-Variance Decomposition

Statistical Machine Learning Hilary Term 2018

Linear Regression In God we trust, all others bring data. William Edwards Deming

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Logistic Regression Logistic

Introduction to Statistical modeling: handout for Math 489/583

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Deep Feedforward Networks

MS-C1620 Statistical inference

Lecture 3 - Linear and Logistic Regression

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

CSC321 Lecture 8: Optimization

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Sample questions for Fundamentals of Machine Learning 2018

INTRODUCTION TO DATA SCIENCE

STA414/2104 Statistical Methods for Machine Learning II

Lecture 9: Generalization

Lecture 4: Training a Classifier

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Transcription:

+ Machine Learning and Data Mining Linear regression Prof. Alexander Ihler

Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Learning algorithm Program ( Learner ) Change µ Improve performance Training data (examples) Features Feedback / Target values Characterized by some parameters µ Procedure (using µ) that outputs a predic>on Score performance ( cost func>on )

Linear regression 40 Predictor : Evaluate line: Target y return r 20 0 0 10 20 Feature x Define form of function f(x) explicitly Find a good f(x) within that family

Nota>on Define feature x 0 = 1 (constant) Then

Measuring error Observation Error or residual Prediction 0 0 20

Mean squared error How can we quantify the error? Could choose something else, of course Computationally convenient (more later) Measures the variance of the residuals Corresponds to likelihood under Gaussian model of noise

MSE cost func>on Rewrite using matrix form # Python / NumPy: e = Y X.dot( theta.t ); J = e.t.dot( e ) / m # = np.mean( e ** 2 )

Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Learning algorithm Program ( Learner ) Change µ Improve performance Training data (examples) Features Feedback / Target values Characterized by some parameters µ Procedure (using µ) that outputs a predic>on Score performance ( cost func>on )

Visualizing the cost func>on J(µ)! µ 1! 40 40 30 30 20 20 10 10 0 0-1 0-1 0-2 0-3 0-2 0-3 0-4 0-1 -0.5 0 0.5 1 1.5 2 2.5 3-4 0-1 -0.5 0 0.5 1 1.5 2 2.5 3

Finding good parameters Want to find parameters which minimize our error Think of a cost surface : error residual for that µ

+ Machine Learning and Data Mining Linear regression: Gradient descent & stochastic gradient descent Prof. Alexander Ihler

Gradient descent? How to change µ to improve J(µ)? Choose a direction in which J(µ) is decreasing

Gradient descent How to change µ to improve J(µ)? Choose a direction in which J(µ) is decreasing Derivative Positive => increasing Negative => decreasing

Gradient descent in more dimensions Gradient vector Indicates direction of steepest ascent (negative = steepest descent)

Gradient descent Initialization Step size Can change as a function of iteration Gradient direction Stopping condition Initialize µ Do { µ Ã µ - r µ J(µ) } while ( rj > ² )

Gradient for the MSE MSE r J =? 0 0

Gradient for the MSE MSE r J =?

Gradient descent Initialization Step size Can change as a function of iteration Gradient direction Stopping condition Initialize µ Do { µ Ã µ - r µ J(µ) } while ( rj > ² ) { Error magnitude & direction for datum j { Sensitivity to each µ i

Deriva>ve of MSE { Error magnitude & direction for datum j { Sensitivity to each µ i Rewrite using matrix form e = Y X.dot( theta.t ); # error residual DJ = - e.dot(x) * 2.0/m # compute the gradient theta -= alpha * DJ # take a step

Gradient descent on cost func>on 40 20 30 15 20 10 10 5 0 0-10 -5-20 -10-30 -15-40 -1-0.5 0 0.5 1 1.5 2 2.5 3-20 0 2 4 6 8 10 12 14 16 18 20

Comments on gradient descent Very general algorithm we ll see it many times Local minima Sensitive to starting point

Comments on gradient descent Very general algorithm we ll see it many times Local minima Sensitive to starting point Step size Too large? Too small? Automatic ways to choose? May want step size to decrease with iteration Common choices: Fixed Linear: C/(iteration) Line search / backoff (Armijo, etc.) Newton s method

Newton s method Want to find the roots of f(x) Root : value of x for which f(x)=0 Ini>alize to some point x Compute the tangent at x & compute where it crosses x-axis Op>miza>on: find roots of rj(µ) ( Step size = 1/rrJ ; inverse curvature) Does not always converge; some>mes unstable If converges, usually very fast Works well for smooth, non-pathological func>ons, locally quadra>c For n large, may be computa>onally hard: O(n 2 ) storage, O(n 3 ) >me (Multivariate: r J(µ) = gradient vector r 2 J(µ) = matrix of 2 nd derivatives a/b = a b -1, matrix inverse)

Stochas>c / Online gradient descent MSE Gradient Stochastic (or online ) gradient descent: Use updates based on individual datum j, chosen at random At optima, (average over the data)

Online gradient descent Update based on each datum at a time Find residual and the gradient of its part of the error & update Initialize µ Do { for j=1:m µ Ã µ - r µ J j (µ) } while (not done) 40 20 30 15 20 10 10 5 0 0-10 -5-20 -10-30 -15-40 -1-0.5 0 0.5 1 1.5 2 2.5 3-20 0 2 4 6 8 10 12 14 16 18 20

Online gradient descent Update based on each datum at a time Find residual and the gradient of its part of the error & update Initialize µ Do { for j=1:m µ Ã µ - r µ J j (µ) } while (not done) 40 20 30 15 20 10 10 5 0 0-10 -5-20 -10-30 -15-40 -1-0.5 0 0.5 1 1.5 2 2.5 3-20 0 2 4 6 8 10 12 14 16 18 20

Online gradient descent Update based on each datum at a time Find residual and the gradient of its part of the error & update Initialize µ Do { for j=1:m µ Ã µ - r µ J j (µ) } while (not done) 40 20 30 15 20 10 10 5 0 0-10 -5-20 -10-30 -15-40 -1-0.5 0 0.5 1 1.5 2 2.5 3-20 0 2 4 6 8 10 12 14 16 18 20

Online gradient descent Update based on each datum at a time Find residual and the gradient of its part of the error & update Initialize µ Do { for j=1:m µ Ã µ - r µ J j (µ) } while (not done) 40 20 30 15 20 10 10 5 0 0-10 -5-20 -10-30 -15-40 -1-0.5 0 0.5 1 1.5 2 2.5 3-20 0 2 4 6 8 10 12 14 16 18 20

Online gradient descent Update based on each datum at a time Find residual and the gradient of its part of the error & update Initialize µ Do { for j=1:m µ Ã µ - r µ J j (µ) } while (not done) 40 20 30 15 20 10 10 5 0 0-10 -5-20 -10-30 -15-40 -1-0.5 0 0.5 1 1.5 2 2.5 3-20 0 2 4 6 8 10 12 14 16 18 20

Online gradient descent Update based on each datum at a time Find residual and the gradient of its part of the error & update Initialize µ Do { for j=1:m µ Ã µ - r µ J j (µ) } while (not done) 40 20 30 15 20 10 10 5 0 0-10 -5-20 -10-30 -15-40 -1-0.5 0 0.5 1 1.5 2 2.5 3-20 0 2 4 6 8 10 12 14 16 18 20

Online gradient descent Benefits Lots of data = many more updates per pass Computa>onally faster Initialize µ Do { for j=1:m µ Ã µ - r µ J j (µ) } while (not converged) Drawbacks No longer strictly descent Stopping condi>ons may be harder to evaluate (Can use running es>mates of J(.), etc. ) Related: mini-batch updates, etc.

+ Machine Learning and Data Mining Linear regression: direct minimization Prof. Alexander Ihler

MSE Minimum Consider a simple problem One feature, two data points Two unknowns: µ 0, µ 1 Two equations: Can solve this system directly: However, most of the time, m > n There may be no linear function that hits all the data exactly Instead, solve directly for minimum of MSE function

MSE Minimum Reordering, we have X (X T X) -1 is called the pseudo-inverse If X T is square and independent, this is the inverse If m > n: overdetermined; gives minimum MSE fit

Python MSE This is easy to solve in Python / NumPy # y = np.matrix( [[y1],, [ym]] ) # X = np.matrix( [[x1_0 x1_n], [x2_0 x2_n], ] ) % Solution 1: manual th = y.t * X * np.linalg.inv(x.t * X); % Solution 2: least squares solve th = np.linalg.lstsq(x, Y);

Normal equa>ons Interpretation: (y - µx) = (y - yhat) is the vector of errors in each example X are the features we have to work with for each example Dot product = 0: orthogonal

Normal equa>ons Interpretation: (y - µx) = (y - yhat) is the vector of errors in each example X are the features we have to work with for each example Dot product = 0: orthogonal Example:

Effects of MSE choice Sensitivity to outliers 18 16 14 12 10 8 6 4 2 5 4 3 2 1 16 2 cost for this one datum Heavy penalty for large errors 0-20 -15-10 -5 0 5 0 0 2 4 6 8 10 12 14 16 18 20

L1 error 18 L2, original data 16 14 L1, original data L1, outlier data 12 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20

Cost func>ons for regression (MSE) (MAE) Something else entirely (???) Arbitrary functions can t be solved in closed form - use gradient descent

+ Machine Learning and Data Mining Linear regression: nonlinear features Prof. Alexander Ihler

More dimensions? 26 26 y 24 22 y 24 22 20 20 30 20 10 20 x 1 10 0 0 x 2 30 40 30 20 20 x 10 1 10 0 0 x 2 30 40

Nonlinear func>ons What if our hypotheses are not lines? Ex: higher-order polynomials 18 Order 1 polynomial 18 Order 3 polynomial 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 0 2 4 6 8 10 12 14 16 18 20 0 0 2 4 6 8 10 12 14 16 18 20

Nonlinear func>ons Single feature x, predict target y: Add features: Linear regression in new features Sometimes useful to think of feature transform

Higher-order polynomials Fit in the same way More features 18 16 14 12 Order 1 polynomial 10 8 6 4 2 18 Order 2 polynomial 0 0 2 4 6 Order 8 3 polynomial 10 12 14 16 18 20 18 16 14 12 10 8 6 4 2 0 16 14 12 10 8 6 4 2-2 0 2 4 6 8 10 12 14 16 18 20 0 0 2 4 6 8 10 12 14 16 18 20

Features In general, can use any features we think are useful Other information about the problem Sq. footage, location, age, Polynomial functions Features [1, x, x 2, x 3, ] Other functions 1/x, sqrt(x), x 1 * x 2, Linear regression = linear in the parameters Features we can make as complex as we want!

Higher-order polynomials Are more features better? Nested hypotheses 2 nd order more general than 1 st, 3 rd order than 2 nd, Fits the observed data better

Overfiing and complexity More complex models will always fit the training data better But they may overfit the training data, learning complex relationships that are not really present Simple model Complex model Y Y X X

Test data After training the model Go out and get more data from the world New observations (x,y) How well does our model perform?

Training versus test error Plot MSE as a function of model complexity Polynomial order Decreases More complex function fits training data better 30 25 20 Training data New, test data What about new data? 0 th to 1 st order Error decreases Underfitting Higher order Error increases Overfitting 15 10 5 Mean squared error 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Polynomial order

+ Machine Learning and Data Mining Linear regression: bias and variance Prof. Alexander Ihler

Induc>ve bias The assumptions needed to predict examples we haven t seen Makes us prefer one model over another Polynomial functions; smooth functions; etc Some bias is necessary for learning! Simple model Complex model Y Y X X

Bias & variance The world Data we observe Three different possible data sets:

Bias & variance The world Data we observe Three different possible data sets: Each would give different predictors for any polynomial degree:

Detec>ng overfiing Overfitting effect Do better on training data than on future data Need to choose the right complexity One solution: Hold-out data Separate our data into two sets Training Test Learn only on training data Use test data to estimate generalization quality Model selection All good competitions use this formulation Often multiple splits: one by judges, then another by you

What to do about under/overfiing? Ways to increase complexity? Add features, parameters We ll see more Ways to decrease complexity? Remove features ( feature selection ) Fail to fully memorize data Partial training Regularization Predictive Error Error on Test Data Error on Training Data Model Complexity Ideal Range for Model Complexity Underfitting Overfitting

+ Machine Learning and Data Mining Linear regression: regularization Prof. Alexander Ihler

Linear regression Linear model, two data Quadratic model, two data? Infinitely many settings with zero error How to choose among them? Higher order coefficents = 0? Uses knowledge of where features came from Could choose e.g. minimum magnitude: A type of bias: tells us which models to prefer

Regulariza>on Can modify our cost function J to add preference for certain parameter values New solution (derive the same way) L 2 penalty: Ridge regression Problem is now well-posed for any degree Notes: Shrinks the parameters toward zero Alpha large: we prefer small theta to small MSE Regularization term is independent of the data: paying more attention reduces our model variance

Regulariza>on Compare between unreg. & reg. results Alpha =0 (Unregularized) Alpha =1

Different regulariza>on func>ons More generally, for the L p regularizer: Isosurfaces: µ p = constant p=0.5 p=1 p=2 p=4 Lasso Quadratic L 0 = limit as p! 0 : number of nonzero weights, a natural notion of complexity L 1 = limit as p! 1 : maximum parameter value

Regulariza>on: L1 vs L2 Estimate balances data term & regularization term Minimizes data term Minimizes combination Minimizes regularization

Regulariza>on: L1 vs L2 Estimate balances data term & regularization term Lasso tends to generate sparser solutions than a quadratic regularizer. Data term only: all µ i non-zero Regularized estimate: some µ i may be zero

+ Machine Learning and Data Mining Linear regression: hold-out, cross-validation Prof. Alexander Ihler

Model selec>on Which of these models fits the data best? p=0 (constant); p=1 (linear); p=3 (cubic); Or, should we use KNN? Other methods? Model selec>on problem Can t use training data to decide (esp. if models are nested!) Want to es>mate J = loss func>on (MSE) D = training data set p=0 p=1 p=3

Hold-out method Valida>on data Hold out some data for evalua>on (e.g., 70/30 split) Train only on the remainder Some problems, if we have few data: Few data in hold-out: noisy es>mate of the error More hold-out data leaves less for training! x (i) y (i) 88 79 Training data 32-2 27 30 68 73 MSE = 331.8 Valida>on data 7-16 20 43 53 77 17 16 87 94

Cross-valida>on method K-fold cross-valida>on Divide data into K disjoint sets Hold out one set (= M / K data) for evalua>on Train on the others (= M*(K-1) / K data) Split 1: MSE = 331.8 Split 2: MSE = 361.2 Training data Valida>on data x (i) y (i) 88 79 32-2 27 30 68 73 7-16 20 43 Split 3: MSE = 669.8 3-Fold X-Val MSE = 464.1 53 77 17 16 87 94

Cross-valida>on method K-fold cross-valida>on Divide data into K disjoint sets Hold out one set (= M / K data) for evalua>on Train on the others (= M*(K-1) / K data) Split 1: MSE = 280.5 Split 2: MSE = 3081.3 Training data Valida>on data x (i) y (i) 88 79 32-2 27 30 68 73 7-16 20 43 Split 3: MSE = 1640.1 3-Fold X-Val MSE = 1667.3 53 77 17 16 87 94

Cross-valida>on Advantages: Lets us use more (M) valida>on data (= less noisy es>mate of test performance) Disadvantages: More work Trains K models instead of just one Doesn t evaluate any par(cular predictor Evaluates K different models & averages Scores hyperparameters / procedure, not an actual, specific predictor! Also: s>ll es>ma>ng error for M < M data

Learning curves Plot performance as a func>on of training size Assess impact of fewer data on performance Ex: MSE0 - MSE (regression) or 1-Err (classifica>on) Few data More data significantly improve performance Enough data Performance saturates 1 / MSE Training data size (m) If slope is high, decreasing m (for valida>on / cross-valida>on) might have a big impact

Leave-one-out cross-valida>on When K=M (# of data), we get Train on all data except one Evaluate on the lev-out data Repeat M >mes (each data point held out once) and average MSE = MSE = Training data Valida>on data x (i) y (i) 88 79 32-2 27 30 68 73 7-16 LOO X-Val MSE = 20 43 53 77 17 16 87 94

Cross-valida>on Issues Need to balance: Computa>onal burden (mul>ple trainings) Accuracy of es>mated performance / error Single hold-out set: Es>mates performance with M < M data (important? learning curve?) Need enough data to trust performance es>mate Es>mates performance of a par>cular, trained learner K-fold XVal K >mes as much work, computa>onally Bewer es>mates, s>ll of performance with M < M data LOO XVal M >mes as much work, computa>onally M ¼ M, but overall error es>mate may have high variance