Linear Regression. Volker Tresp 2018

Similar documents
Linear Regression. Volker Tresp 2014

Association studies and regression

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Linear Regression (continued)

Linear regression methods

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Early Brain Damage. Volker Tresp, Ralph Neuneier and Hans Georg Zimmermann Siemens AG, Corporate Technologies Otto-Hahn-Ring München, Germany

Linear Methods for Regression. Lijun Zhang

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

6. Regularized linear regression

Classification Logistic Regression

4 Bias-Variance for Ridge Regression (24 points)

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Introduction to Machine Learning

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Linear Models for Regression CS534

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

DATA MINING AND MACHINE LEARNING

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning for OR & FE

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression, Ridge Regression, Lasso

The Equivalence between Row and Column Linear Regression: A Surprising Feature of Linear Regression Updated Version 2.

The Perceptron. Volker Tresp Summer 2014

Neural Networks. Volker Tresp Summer 2015

Is the test error unbiased for these programs? 2017 Kevin Jamieson

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Optimization and Gradient Descent

Machine Learning for Biomedical Engineering. Enrico Grisan

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

ECE521 week 3: 23/26 January 2017

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

CS540 Machine learning Lecture 5

Introduction to Machine Learning

ECS171: Machine Learning

Linear Models in Machine Learning

Sparse Linear Models (10/7/13)

Linear & nonlinear classifiers

CS260: Machine Learning Algorithms

Lecture 5: Logistic Regression. Neural Networks

The Perceptron. Volker Tresp Summer 2018

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

MSA220/MVE440 Statistical Learning for Big Data

25 : Graphical induced structured input/output models

Logistic Regression Logistic

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lasso, Ridge, and Elastic Net

Ad Placement Strategies

Summary and discussion of: Dropout Training as Adaptive Regularization

Overfitting, Bias / Variance Analysis

Linear classifiers: Overfitting and regularization

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning Basics III

ESS2222. Lecture 4 Linear model

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

CSCI567 Machine Learning (Fall 2014)

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Prediction & Feature Selection in GLM

Regularized Regression A Bayesian point of view

Lasso Regression: Regularization for feature selection

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

The Perceptron. Volker Tresp Summer 2016

Lab 5: 16 th April Exercises on Neural Networks

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

ESL Chap3. Some extensions of lasso

Linear Models for Regression. Sargur Srihari

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

25 : Graphical induced structured input/output models

Machine learning - HT Basis Expansion, Regularization, Validation

Big Data Analytics: Optimization and Randomization

l 1 and l 2 Regularization

Linear Regression Linear Regression with Shrinkage

ECS289: Scalable Machine Learning

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Multilayer Perceptron

Ad Placement Strategies

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CS6220: DATA MINING TECHNIQUES

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Ridge Regression: Regulating overfitting when using many features. Training, true, & test error vs. model complexity. CSE 446: Machine Learning

A Significance Test for the Lasso

Dimension Reduction Methods

Feature Engineering, Model Evaluations

Lasso, Ridge, and Elastic Net

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Biostatistics Advanced Methods in Biostatistics IV

STA141C: Big Data & High Performance Statistical Computing

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

CSC242: Intro to AI. Lecture 21

Transcription:

Linear Regression Volker Tresp 2018 1

Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w j x j (Note: x 0 = 1 is a constant input, so that w 0 is the bias) New: The activation is the output (no thresholding) ŷ = f w (x) = h Regression: the target function can take on real values 2

Method of Least Squares Squared-loss cost function: cost(w) = N (y i f w (x i )) 2 i=1 The parameters that minimize the cost function are called least squares (LS) estimators w ls = arg min w cost(w) For visualization, we take M = 1 (although linear regression is often applied to high-dimensional inputs) 3

One-dimensional regression: f w (x) = w 0 + w 1 x w = (w 0, w 1 ) T Squared error: Least-squares Estimator for Regression Goal: cost(w) = N (y i f w (x i )) 2 i=1 w ls = arg min w cost(w) w 0 = 1, w 1 = 2, var(ɛ) = 1 4

Least-squares Estimator in Several Dimensions General Model: ŷ i = f w (x i ) = w 0 + = x T i w M j=1 w j x i,j w = (w 0, w 1,... w M ) T x i = (1, x i,1,..., x i,m ) T 5

Linear Regression with Several Inputs 6

Contribution to the Cost Function of one Data Point 7

Predictions as Matrix-vector product The vector of all predictions at the training data is ŷ 1 ŷ = ŷ 2... = Xw ŷ N 8

Gradient Descent Learning Initialize parameters (typically using small random numbers) Adapt the parameters in the direction of the negative gradient With f w (x i ) = M j=0 w j x i,j cost(w) = N i=1 (y i f w (x i )) 2 The parameter gradient is (Example: w j ) cost w j N = 2 (y i f w (x i ))x i,j i=1 A sensible learning rule is w j w j + η N (y i f w (x i ))x i,j i=1 9

ADALINE-Learning Rule ADALINE: ADAptive LINear Element The ADALINE uses stochastic gradient descent (SGD) Let x t and y t be the training pattern in iteration t. The we adapt, t = 1, 2,... w j w j + η(y t ŷ t )x t,j j = 0, 1, 2,..., M η > 0 is the learning rate, typically 0 < η << 0.1 This is identical to the Perceptron learning rule. For the Perceptron y t { 1, 1}, ŷ t { 1, 1} 10

Analytic Solution The ADALINE is optimized by SGD Online Adaptation: a physical system constantly produces new data: the ADALINE (SGD in general) can even track changes in the system With a fixed training data set the least-squares solution can be calculated analytically in one step (least-squares regression) 11

Cost Function in Matrix Form cost(w) = N (y i f w (x i )) 2 i=1 = (y Xw) T (y Xw) y = (y 1,..., y N ) T X = x 1,0... x 1,M......... x N,0... x N,M 12

Necessary Condition for an Optimum A necessary condition for an optimum is that cost(w) w = 0 w=w opt 13

One Parameter: Explicit f w (x 1 ) = x 1 w 1 and cost(w 1 ) = N i=1 (y i x i,1 w 1 ) 2 (chain rule: inner derivative times outer derivative) cost(w 1 ) w 1 = N i=1 (y i x i,1 w 1 ) w 1 2(y i x i,1 w 1 ) N N N = 2 x i,1 (y i x i,1 w 1 ) = 2 x i,1 y i + 2w 1 x i,1 x i,1 i=1 i=1 i=1 Thus w 1,l = N x i,1 x i,1 1 N i=1 i=1 x i,1 y i 14

One Parameter: in Vector Notation f w (x 1 ) = x 1 w 1 and cost(w 1 ) = (y x 1 w 1 ) T (y x 1 w 1 ), where x 1 = (x 1,1,..., x N,1 ) T (chain rule: inner derivative times outer derivative) cost(w 1 ) w 1 = (y x 1w 1 ) w 1 2(y x 1 w 1 ) = 2 x T 1 (y x 1w 1 ) = 2 x T 1 y + 2w 1 x T 1 x 1 Thus w 1,ls = ( x T 1 x 1) 1 x T 1 y 15

General Case f w (x) = x T w and cost(w) = (y Xw) T (y Xw) (chain rule: inner derivative times outer derivative) cost(w) w = (y Xw) 2(y Xw) w = 2X T (y Xw) = 2X T y + 2wX T X Thus w ls = ( X T X) 1 X T y 16

Setting First Derivative to Zero ŵ ls = (X T X) 1 X T y Complexity (linear in N): O(M 3 + NM 2 ) ŵ 0 = 0.75, ŵ 1 = 2.13 17

Derivatives of Vector Products We have used Ax = AT x x xt x = 2x x xt Ax = (A + A T )x Comment: one also finds the conventions, x Ax = A x xt x = 2x T x xt Ax = x T (A + A T ) 18

Stability of the Solution When N >> M, the LS solution is stable (small changes in the data lead to small changes in the parameter estimates) When N < M then there are many solutions which all produce zero training error Of all these solutions, one selects the one that minimizes M i=0 w 2 i = wt w (regularised solution) Even with N > M it is advantageous to regularize the solution, in particular with noise on the target 19

Linear Regression and Regularisation Regularised cost function (Penalized Least Squares (PLS), Ridge Regression, Weight Decay): the influence of a single data point should be small cost pen (w) = N M (y i f w (x i )) 2 + λ i=1 i=0 w 2 i ŵ pen = ( X T X + λi) 1 X T y Derivation: cost pen (w) w = 2X T (y Xw) + 2λw = 2[ X T y + (X T X + λi)w] 20

ADALINE-Learning Rule with Weight Decay Let x t and y t be the training pattern in iteration t. Then we adapt, t = 1, 2,... w j w j + η[(y t ŷ t )x t,j λ N w j] j = 0, 1, 2,..., M 21

Toy Example: Univariate Model (Pearson Correlation Coefficient) We generated N = 100 data points with M = 3 (no bias) x 1 and x 2 are highly correlated. x 3 is independent from x 1, x 2, and y We generate output data with y = x 1 +ɛ, where ɛ stands for independent noise with standard deviation 0.2 and thus variance of 0.04. Thus ground truth parameters are w true = (1, 0, 0) T. Note that, y causally only depends on x 1 All variables are normalized to 0 mean and variance 1. In unit variate models, with only one input, the weights are identical to the sample Pearson correlation coefficients (here: r j = i y ix i,j /N) between the output and the input, I get r 1 = 0.99, r 2 = 0.96, r 3 = 0.21 A deeper analysis (see Appendix) reveals that the estimate r 1 has a mean of 1 and a standard deviation of 0.02. r 1 reflects the dependency of y on x 1 22

The second coefficient, r 2 = 0.96, does not reflect a causal effect, but reflects the fact that x 1 and x 2 are highly correlated, and thus also y and x 2 (correlation does not imply causality). A deeper analysis (see Appendix) reveals that with perfect correlation between x 1 and x 2, the estimate r 1 also would have a mean of 1 and a standard deviation of 0.02 The third value r 3 is correctly closer to 0, but not really small in magnitude. A deeper analysis (see Appendix) reveals that the estimate r 3 has a mean of 0 and a standard deviation of approximately of 0.1

Toy Example: Least Squares Regression We get experimentally: X T X = 100 98 18 98 100 16 18 16 100 Approximately Ncov(x); we see the strong correlation between x 1 and x 2 (X T X) 1 = 0.255 0.249 0.007 0.249 0.253 0.005 0.007 0.005 0.010 X T y = (99, 97, 20) T This is approximately Nr; we see the strong correlation between both x 1 and x 2 with y 23

Toy Example: Least Squares Regression (cont d) We get w ls = (X T X) 1 X T y = (1.137, 0.150, 0.018) T Interestingly, linear regression pretty much identifies the correct causality, with w ls,1 1 and w ls,2 0! A deeper analysis (see Appendix) reveals that w ls,1 has a mean of 1 and a standard deviation of 0.1. So the estimator is unbiased but the uncertainty is larger then in the unit variate analysis ŵ 2 has mean of zero and a standard deviation of 0.1. Thus the bias is removed if compared to Pearson! w ls,1 and w ls,2 are negatively correlated. Note, that w ls,1 +w ls,2 = 0.987 which is close to the true 1. w ls,3 = 0.018 is much closer to 0 than the sample Pearson correlation coefficient r 3 = 0.21 24

A deeper analysis (see Appendix) reveals that w ls,3 has a mean of 0 and a standard deviation of 0.02. Here it is important to see that the standard deviation of the spurious input is largely reduced! Intuitive explanation: Consider that I can write the cost function as i ([y i w 1 x i,1 w 2 x i,2 ] w 3 x i,3 ). Thus w 3 x i,3 only needs to fit the residual target [y i w 1 x i,1 w 2 x i,2 ] instead of the original target y i. Overall, in regression, the causal influence of x 1 stands out much more clearly! Both the influence of the correlated input x 2 and the noise input x 3 are largely reduced Application in healthcare: Same data. Consider that x 2 is a medication and y the outcome. If I do a univariate analysis, I would see a strong positive influence of x 2 on y (the medication works). Only if I include the so-called confounder x 1 in the regression model, it becomes clear that the confounder x 1 is the cause and not the treatment x 2. The treatment has no significant effect!

Toy Example: Penalized Least Squares Regression We get with λ = 0.6: X T X + λi = 100.6 98 19 98 100.6 17 19 17 100.6 (X T X + λi) 1 = 0.197 0.191 0.005 0.191 0.195 0.003 0.005 0.003 0.010 X T y = (99, 97, 20) T w pen = (X T X + λi) 1 X T y = (0.990, 0.005, 0.021) T Note that w pen,2 is even closer to ground truth! 25

Remarks The Pearson correlation coefficient does not reflect causality The regression coefficients display causal behavior, much more closely If one is only interested in prediction accuracy: adding inputs liberally in regression can be beneficial if regularization is used (in ad placements and ad bidding, hundreds or thousands of features are used) The weight parameters of useless (noisy) features become close to zero with regularization (ill-conditioned parameters) Regularization is especially important when N M, and N < M If parameter interpretation is essential or if, for computational reasons, one wants to keep the number of inputs small: Forward selection; start with the empty model; at each step add the input that reduces the error most 26

Backward selection (pruning); start with the full model; at each step remove the input that increases the error the least But no guarantee, that one finds the best subset of inputs or that one finds the true inputs

Experiments with Real World Data: Data from Prostate Cancer Patients 8 Inputs, 97 data points; y: Prostate-specific antigen 10-times cross validation error LS 0.586 Best Subset (3) 0.574 Ridge (Penalized) 0.540 27

Examples where High-dimensional Linear Systems are Used Ranking in search engines (relevance of a web page to a query) Ad placements: where to put which ad for a user GWAS 28

Genome-wide Association Study (GWAS) Trait (here: the disease systemic sclerosis) is the output and the SNPs are the inputs The major allele is encoded as 0 and the minor allele as 1. Thus w j is the influence of SNP j on the trait. Shown is the (log of the p-value) of w j ordered by the locations on the chromosomes. The weights can be calculated by penalized least squares (ridge regression) Solely based on the Pearson correlation, the plot would show many more (non-causal) associations. The regression analysis reduces the apparent influence of noncausal correlated inputs and the influence of uncorrelated inputs In practice one often uses an elastic net penalty: λ 2 j w2 j + λ 1 j w j where the lasso penalty λ 1 j w j increases sparsity 29

Appendix: A Deeper Analysis of Pearson versus Regression* The Pearson correlation coefficient is in the mean approximately (1, 1, 0). The variance of r 1, and r 2 can be estimated as var = σ 2 /N = 0.04/100 = 0.0004 and standard deviation stdev = var = 0.02. For r 3 we get a variance var(r 3 ) = var(y)/n = 1/N = 0.01, and a standard deviation of stdev(r 3 ) = 0.1. Comment: r 2 does not reflect the true dependency; the variance of r 3 is relatively large. Since linear regression is unbiased, the parameter estimates have mean 1, 0, 0 (unbiased solutions). We get for the covariances cov(w ls ) = σ 2 (X T X) 1 The variances are then (we consider the diagonal terms) var(w ls,1 ) = var(w ls,2 ) 0.04 0.25 = 0.01 var(w ls,3 ) 0.04 0.01 = 0.0004, 30

and stdev(w ls,1 ) = stdev(w ls,2 ) 0.2 stdev(w ls,3 ) 0.02 Thus the estimates are unbiased; the uncertainty of w ls,3 is greatly reduced and thus closer to zero!