Linear Regression. Volker Tresp 2014

Size: px

Start display at page:

Download "Linear Regression. Volker Tresp 2014"

Morgan Snow
5 years ago
Views:

1 Linear Regression Volker Tresp

2 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h i = M 1 j=0 w i,j x i,j (Note: x i,0 = 1 is a constant input, so that w 0 is the bias) The activation is the the output (no thresholding) ŷ i = f(x i ) = h i Regression: when the target function can take on real values 2

3 Method of Least Squares Squared-loss cost function: cost(w) = N (y i f(x i, w)) 2 i=1 The parameters that minimize the cost function are called least squares (LS) estimators w ls = arg min w cost(w) For visualization, on chooses M = 2 (although linear regression is often applied to high-dimensional inputs) 3

4 One-dimensional regression: f(x, w) = w 0 + w 1 x w = (w 0, w 1 ) T Squared error: Least-squares Estimator for Regression Goal: cost(w) = N (y i f(x i, w)) 2 i=1 w ls = arg min w cost(w) w 0 = 1, w 1 = 2, var(ɛ) = 1 4

5 Least-squares Estimator in General General Model: f(x i, w) = w 0 + M 1 j=1 w j x i,j = x T i w w = (w 0, w 1,... w M 1 ) T x i = (1, x i,1,..., x i,m 1 ) T 5

6 Linear Regression with Several Inputs 6

7 Contribution to the Cost Function of one Data Point 7

8 Gradient Descent Learning Initialize parameters (typically using small random numbers) Adapt the parameters in the direction of the negative gradient With cost(w) = N yi M 1 2 w j x i,j i=1 j=0 The parameter gradient is (Example: w j ) cost w j N = 2 (y i f(x i ))x i,j i=1 A sensible learning rule is w j w j + η N (y i f(x i ))x i,j i=1 8

9 ADALINE-Learning Rule ADALINE: ADAptive LINear Element The ADALINE uses stochastic gradient descent (SGE) Let x t and y t be the training pattern in iteration t. The we adapt, t = 1, 2,... w j w j + η(y t ŷ t )x t,j j = 1, 2,..., M η > 0 is the learning rate, typically 0 < η << 0.1 Compare: the Perceptron learning rule (only applied to misclassified patterns) w j w j + ηy t x t,j j = 1,..., M 9

10 Analytic Solution The least-squares solution can be calculated in one step 10

11 Cost Function in Matrix Form cost(w) = N (y i f(x i, w)) 2 i=1 = (y Xw) T (y Xw) y = (y 1,..., y N ) T X = x 1,0... x 1,M x N,0... x N,M 1 11

12 Calculating the First Derivative Matrix calculus: Thus cost(w) w = (y Xw) w 2(y Xw) = 2X T (y Xw) 12

13 Setting First Derivative to Zero Calculating the LS-solution: cost(w) w = 2X T (y Xw) = 0 ŵ ls = (X T X) 1 X T y Complexity (linear in N!): O(M 3 + NM 2 ) ŵ 0 = 0.75, ŵ 1 =

14 Stability of the Solution When N >> M, the LS solution is stable (small changes in the data lead to small changes in the paramater estimates) When N < M then there are many solutions which all produce the zero training error Of all these solutions, one selects the one that minimizes M i=0 wi 2 solution) (regularised Even with N > M it is advantageous to regularize the solution, in particular with noise on the target 14

15 Linear Regression and Regularisation Regularised cost function (Penalized Least Squares (PLS), Ridge Regression, Weight Decay): the influence of a single data point should be small cost pen (w) = N (y i f(x i, w)) 2 + λ i=1 M 1 i=0 w 2 i ŵ pen = ( X T X + λi) 1 X T y Derivation: J pen N (w) w = 2X T (y Xw) + 2λw = 2[ X T y + (X T X + λi)w] 15

16 Example Three data points are generated as (true model) Here, ɛ i is independent noise y i = x i,1 + ɛ i (correct) model 1 f(x i ) = w 0 + w 1 x i,1 Training data for model 1: x 1 y The LS solution gives w ls = (0.58, 0.77) In comparison,the true parameters are: w = (0.50, 1.00) 16

17 Model 2 Here we generate a second correlated input x i,2 = x i,1 + δ i Again, δ i is uncorrelated noise Modell 2 f(x i ) = w 0 + w 1 x i,1 + w 2 x i,2 Daten, die Modell 2 sieht: x 1 x 2 y Die least squares solution gives w ls = (0.67, 136, 137)!!! 17

18 Model 2 with Regularisation All as before, only that large weights are penalized Die penalized least squares solution gives w pen = (0.58, 0.38, 0.39)!!! Compare: the LS-solution for model-2 gave w ls = (0.58, 0.77) The collinearity (strong correlation of the inputs) hurts the LS-solution but does not hurt the penalized LS solution. We even obtain a higher robustness with respect to errors in the inputs, since weights are smaller! 18

19 Training Data Training: y M1 : ŷ ML M2 : ŷ ML M2 : ŷ pen For Model 1 and Model 2 with regularization we have nonzero error on the training data For Model 2 without regularization, the training error is zero If we only consider the training error, we would prefer Model 2 19

20 Test Data Test Data: y M1 : ŷ ML M2 : ŷ ML M2 : ŷ pen On test data model 1 and model 2 with regularization give better results Even more dramatic: extrapolation 20

21 Experiments with real world data: data from Prostate Cancer 8 Inputs, 97 data points; y: Prostate-specific antigen 10-times cross validation LS Best Subset (3) Ridge (Weight Decay)

22 GWAS Study Correlation with disease (systemic sclerosis) versus location of SNPs on the gene. The regression weight of a single SNP as an input is calculated with other inputs representing general personal traits (male/femal, Caucasian, Asian, PCA features,...). Repeated for all SNPs (maybe 1 Mio). 22

Linear Regression. Volker Tresp 2018

Linear Regression. Volker Tresp 2018 Linear Regression Volker Tresp 2018 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w