9. Least squares data fitting

Size: px

Start display at page:

Download "9. Least squares data fitting"

Edwina Stone
6 years ago
Views:

1 L. Vandenberghe EE133A (Spring 2017) 9. Least squares data fitting model fitting regression linear-in-parameters models time series examples validation least squares classification statistics interpretation 9-1

2 Model fitting suppose x and a scalar quantity y are related as y f(x) x is the explanatory variable or independent variable y is the outcome, or response variable, or dependent variable we don t know f, but have some idea about its general form Model fitting find an approximate model ˆf for f, based on observations we use the notation ŷ for the model prediction of the outcome y: ŷ = ˆf(x) Least squares data fitting 9-2

3 Prediction error we have data x 1, y 1, x 2, y 2,..., x N, y N data are also called observations, examples, samples, measurements model prediction for data sample i is ŷ i = ˆf(x i ) the prediction error or residual for data sample i is r i = ŷ i y i = ˆf(x i ) y i the model ˆf fits the data well if the N residuals ri are small prediction error can be quantified using the mean square error (MSE) 1 N N i=1 r 2 i the square root of the MSE is the RMS error Least squares data fitting 9-3

4 Outline model fitting regression linear-in-parameters models time series examples validation least squares classification statistics interpretation

5 Regression we first consider the regression model (page 1-32): ˆf(x) = x T β + v here the independent variable x is an n-vector the elements of x are the regressors the model is parameterized by the weight vector β and the offset (intercept) v the prediction error for data sample i is r i = ˆf(x i ) y i = x T i β + v y i the MSE is 1 N N ri 2 = 1 N i=1 N (x T i β + v y i ) 2 i=1 Least squares data fitting 9-4

6 Least squares regression choose the model parameters v, β that minimize the MSE 1 N N (x T i β + v y i ) 2 i=1 this is a least squares problem: minimize Aθ y 2 with A = 1 x T 1 1 x T x T N, θ = [ v β ], y = y 1 y 2. y N we write the solution as ˆθ = (ˆv, ˆβ) Least squares data fitting 9-5

7 Example: house price regression model example of page 1-32 ŷ = x T β + v ŷ is predicted sales price (in 1000 dollars); y is actual sales price two regressors: x 1 is house area; x 2 is number of bedrooms 800 data set of N = 774 house sales RMS error of least squares fit is 74.8 Predicted price ŷ Actual price y Least squares data fitting 9-6

8 Example: house price regression model regression model with additional regressors feature vector x has 7 elements ŷ = x T β + v x 1 is area of the house (in 1000 square feet) x 2 = max {x 1 1.5, 0}, i.e., area in excess of 1.5 (in 1000 square feet) x 3 is number of bedrooms x 4 is one for a condo; zero otherwise x 5, x 6, x 7 specify location (four groups of ZIP codes) Location x 5 x 6 x 7 A B C D Least squares data fitting 9-7

9 Example: house price regression model use least squares to fit the eight model parameters v, β RMS fitting error is Predicted price ŷ (thousand dollars) Actual price y (thousand dollars) Least squares data fitting 9-8

10 Outline model fitting regression linear-in-parameters models time series examples validation least squares classification statistics interpretation

11 Linear-in-parameters model we choose the model ˆf(x) from a family of models ˆf(x) = θ 1 f 1 (x) + θ 2 f 2 (x) + + θ p f p (x) the functions f i are scalar valued basis functions (chosen by us) the basis functions often include a constant function (typically, f 1 (x) = 1) the coefficients θ 1,..., θ p are the model parameters the model ˆf(x) is linear in the parameters θi if f 1 (x) = 1, this can be interpreted as a regression model ŷ = β T x + v with parameters v = θ 1, β = θ 2:p and new features x generated from x: x 1 = f 2 (x),..., x p = f p (x), Least squares data fitting 9-9

12 Least squares model fitting fit linear-in-parameters model to data set (x 1, y 1 ),..., (x N, y N ) residual for data sample i is r i = ˆf(x i ) y i = θ 1 f 1 (x i ) + + θ p f p (x i ) y i least squares model fitting: choose parameters θ by minimizing MSE 1 N (r2 1 + r r 2 N) = 1 N r 2 this is a least squares problem: minimize Aθ y 2 with A = f 1 (x 1 ) f 2 (x 1 ) f p (x 1 ) f 1 (x 2 ) f 2 (x 2 ) f p (x 2 ) f 1 (x N ) f 2 (x N ) f p (x N ), θ = θ 1 θ 2.. θ p, y = y 1 y 2.. y N Least squares data fitting 9-10

13 Example: polynomial approximation ˆf(x) = θ 1 + θ 2 x + θ 3 x θ p x p 1 a linear-in-parameters model with basis functions 1, x,..., x p 1 least squares model fitting: choose parameters θ by minimizing MSE 1 N ( (f(x1 ) y 1 ) 2 + (f(x 2 ) y 2 ) (f(x N ) y N ) 2) in matrix notation: minimize Aθ y 2 with A = 1 x 1 x 2 1 x p x 2 x 2 2 x p x N x 2 N xp 1 N 2. Least squares data fitting 9-11

14 Example ˆf(x) degree 2 (p = 3) ˆf(x) degree 6 ˆf(x) degree 10 x ˆf(x) degree 15 x data set of 100 samples x x Least squares data fitting 9-12

15 Piecewise-affine function define knot points a 1 < a 2 < < a k on the real axis piecewise-affine function is continuous, and affine on each interval [a k, a k+1 ] piecewise-affine function with knot points a 1,..., a k can be written as f(x) = θ 1 + θ 2 x + θ 3 (x a 1 ) θ 2+k (x a k ) + where u + = max {u, 0} (x + 1) (x 1) x x Least squares data fitting 9-13

16 Piecewise-affine function fitting piecewise-affine model is in linear in the parameters θ, with basis functions f 1 (x) = 1, f 2 (x) = x, f 3 (x) = (x a 1 ) +,..., f k+2 (x) = (x a k ) + Example: fit piecewise-affine function with knots a 1 = 1, a 2 = 1 to 100 points ˆf(x) x Least squares data fitting 9-14

17 Outline model fitting regression linear-in-parameters models time series examples validation least squares classification statistics interpretation

18 Time series trend N data samples from time series: y i is value at time i, for i = 1,..., N straight-line fit ŷ i = θ 1 + θ 2 i is the trend line difference y ŷ is the de-trended time series least squares fitting of trend line: minimize Aθ y 2 with A = N, y = y 1 y 2 y 3.. y N Least squares data fitting 9-15

19 Example: world petroleum consumption time series of world petroleum consumption (million barrels/day) versus year left figure shows data samples and trend line right figure shows de-trended time series Consumption De-trended consumption Year Year Least squares data fitting 9-16

20 Trend plus seasonal component model time series as a linear trend plus a periodic component with period P : ŷ = ŷ trend + ŷ seas with ŷ trend = θ 1 (1, 2,..., N) and ŷ seas = (θ 2, θ 3,..., θ P +1, θ 2, θ 3,..., θ P +1,..., θ 2, θ 3,..., θ P +1 ) the mean of ŷ seas serves as a constant offset difference y ŷ is the de-trended, seasonally adjusted time series least squares formulation: minimize Aθ y 2 with A 1:N,1 = 1 2. N, A 1:N,2:P +1 = I P I P.. I P, y = y 1 y 2. y N Least squares data fitting 9-17

21 Example: vehicle miles traveled in the US per month 10 5 Data Miles (millions) Jan Least squares fit of linear trend and seasonal (12-month) component Miles (millions) Least squares data fitting 9-18

22 Auto-regressive (AR) time series model ẑ t+1 = β 1 z t + + β M z t M+1, t = M, M + 1,... z 1, z 2,... is a time series ẑ t+1 is a prediction of z t+1, made at time t prediction ẑ t+1 is a linear function of previous M values z t,..., z t M+1 M is the memory of the model Least squares fitting of AR model: given oberved data z 1,..., z T, minimize (ẑ M+1 z M+1 ) 2 + (ẑ M+2 z M+2 ) (ẑ T z T ) 2 this is a least squares problem: minimize Aβ y 2 with A = z M z M 1 z 1 z M+1 z M z 2... z T 1 z T 2 z T M, β = β 1 β 2.. β M, y = z M+1 z M+2. z T Least squares data fitting 9-19

23 Example: hourly temperature at LAX Temperature ( F) t AR model of memory M = 8 model was fit on time series of length T = 744 (May 1 31, 2016) plot shows first five days Least squares data fitting 9-20

24 Outline model fitting regression linear-in-parameters models time series examples validation least squares classification statistics interpretation

25 Generalization and validation Generalization ability: ability of model to predict outcomes for new data Model validation: to assess generalization ability, divide data in two sets: training set and test (or validation) set use training set to fit model use test set to get an idea of generalization ability this is also called out-of-sample validation Over-fit model model with low prediction error on training set, bad generalization ability prediction error on training set is much smaller than on test set Least squares data fitting 9-21

26 Example: polynomial fitting Relative RMS error Train Test Degree training set is data set of 100 points used on page 9-11 test set is a similar set of 100 points plot suggests using degree 6 Least squares data fitting 9-22

27 Over-fitting polynomial of degree 20 on training and test set ˆf(x) training set ˆf(x) test set x x over-fitting is evident at the left end of the interval Least squares data fitting 9-23

28 Cross-validation an extension of out-of-sample validation divide data in K sets (folds); typical values are K = 5, K = 10 for i = 1 to K, fit model i using fold i as test set and other data as training set compare parameters and train/test RMS errors for the K models House price model (page 9-7) with 5 folds (155 or 154 samples each) Model parameters RMS error Fold v β 1 β 2 β 3 β 4 β 5 β 6 β 7 Train Test Least squares data fitting 9-24

29 Outline model fitting regression linear-in-parameters models time series example validation least squares classification statistics interpretation

30 Boolean (two-way) classification a data fitting problem where the outcome y can take two values +1, 1 values of y represent two categories (true/false, spam/not spam,... ) model ŷ = ˆf(x) is called a Boolean classifier Least squares classifier use least squares to fit a model f(x) to training data set (x1, y 1 ),..., (x N, y N ) f(x) can be a regression model f(x) = x T β + v or linear in parameters f(x) = θ 1 f 1 (x) + + θ p f p (x) take sign of f(x) to get a Boolean classifier ˆf(x) = sign( f(x)) = { +1 if f(x) 0 1 if f(x) < 0 Least squares data fitting 9-25

31 Example: handwritten digit classification MNIST data set used in homework images of handwritten digits (n = 28 2 = 784 pixels) data set contains training examples; test examples we only use the 493 pixels that are nonzero in at least 600 training examples Boolean classifier distinguishes digit zero (y = 1) from other digits (y = 1) Least squares data fitting 9-26

32 Classifier with basic regression model ˆf(x) = sign( f(x)) = sign(x T β + v) x is vector of 493 pixel intensities figure shows distribution of f(x i ) = x T i ˆβ + ˆv on training data set 0.1 Positive (digit 0) Negative (digits 1 9) Fraction f(x i ) blue bars to the left of dashed line are false negatives (misclassified digits zero) red bars to the right of dashed line are false positives (misclassified non-zeros) Least squares data fitting 9-27

33 Prediction error for each data point x, y we have four combinations of prediction and outcome Prediction Outcome ŷ = +1 ŷ = 1 y = +1 true positive false negative y = 1 false positive true negative classifier can be evaluated by counting data points for each combination Training data set Test data set Prediction Outcome ŷ = +1 ŷ = 1 Total y = y = All Prediction Outcome ŷ = +1 ŷ = 1 Total y = y = All error rate ( )/60000 = 1.6% error rate ( )/10000 = 1.6% Least squares data fitting 9-28

34 Classifier with additional nonlinear features p ˆf(x) = sign( f(x)) = sign( θ i f i (x)) basis functions include constant, 493 elements of x, plus 5000 functions i=1 f i (x) = max {0, r T i x + s i } with randomly generated r i, s i figure shows distribution of f(x i ) on training data set 0.15 Positive (digit 0) Negative (digits 1 9) Fraction f(x i ) Least squares data fitting 9-29

35 Prediction error Training data set: error rate 0.21% Prediction Outcome ŷ = +1 ŷ = 1 Total y = y = All Test data set: error rate 0.24% Prediction Outcome ŷ = +1 ŷ = 1 Total y = y = All Least squares data fitting 9-30

36 Multi-class classification a data fitting problem where the outcome y can takes values 1,..., K values of y represent K labels or categories multi-class classifier ŷ = ˆf(x) maps x to an element of {1, 2,..., K} Least squares multi-class classifier for k = 1,..., K, compute Boolean classifier to distinguish class k from not k ˆf k (x) = sign( f k (x)) define multi-class classifier as ˆf(x) = argmax k=1,...,k f k (x) Least squares data fitting 9-31

37 Example: handwritten digit classification we compute a least squares Boolean classifier for each digit versus the rest ˆf k (x) = sign(x T β k + v k ), k = 1,..., K table shows results for test set (error rate 13.9%) Prediction Digit Total All Least squares data fitting 9-32

38 Example: handwritten digit classification ten least squares Boolean classifiers use 5000 new features (page 9-29) table shows results for test set (error rate 2.6%) Prediction Digit Total All Least squares data fitting 9-33

39 Outline model fitting regression linear-in-parameters models time series example validation least squares classification statistics interpretation

40 Linear regression model y = Xβ + ɛ β is (non-random) p-vector of unknown parameters X is n p (data matrix of design matrix, i.e., result of experiment design) if there is an offset v, we include it in β and add column of ones in X ɛ is a random n-vector (random error or disturbance) y is an observable random n-vector this notation differs from previous sections but is common in statistics we discuss methods for estimating parameters β from observations of y Least squares data fitting 9-34

41 Assumptions X is tall (n > p) with linearly independent columns random disturbances ɛ i have zero mean E ɛ i = 0 for i = 1,..., n random disturbances have equal variances σ 2 E ɛ 2 i = σ 2 for i = 1,..., n random disturbances are uncorrelated (have zero covariances) E(ɛ i ɛ j ) = 0 for i, j = 1,..., n and i j last three assumptions can be combined using matrix and vector notation: E ɛ = 0, E ɛɛ T = σ 2 I Least squares data fitting 9-35

42 Least squares estimator least squares estimate ˆβ of parameters β, given the observations y, is ˆβ = X y = (X T X) 1 X T y y = Xβ + ɛ ɛ e = y X ˆβ range(x) Xβ X ˆβ X ˆβ is the orthogonal projection of y on range(x) residual e = y X ˆβ is an (observable) random variable Least squares data fitting 9-36

43 Mean and covariance of least squares estimate ˆβ = X (Xβ + ɛ) = β + X ɛ least squares estimator is unbiased: E ˆβ = β covariance matrix of least squares estimate is E ( ˆβ β)( ˆβ β) T = E ( (X ɛ)(x ɛ) T ) covariance of ˆβ i and ˆβ j (i j) is E = E ( (X T X) 1 X T ɛɛ T X(X T X) 1) = σ 2 (X T X) 1 ( ( ˆβ i β i )( ˆβ ) j β j ) = σ 2 ( (X T X) 1) ij for i = j, this is the variance of ˆβ i Least squares data fitting 9-37

44 Estimate of σ 2 ɛ y = Xβ + ɛ e = y X ˆβ E ɛ 2 = nσ 2 E e 2 = (n p)σ 2 Xβ X ˆβ E X( ˆβ β) 2 = pσ 2 (proof on next page) define estimate ˆσ of σ as ˆσ = e n p ˆσ 2 is an unbiased estimate of σ 2 : E ˆσ 2 = 1 n p E e 2 = σ 2 Least squares data fitting 9-38

45 Proof. first expression is immediate: E ɛ 2 = n i=1 E ɛ2 i = nσ2 to show that E X( ˆβ β) 2 = pσ 2, first note that X( ˆβ β) = XX y Xβ = XX (Xβ + ɛ) Xβ = XX ɛ = X(X T X) 1 X T ɛ on line 3 we used X X = I (however, note that XX I if X is tall!) squared norm of X(β ˆβ) is X( ˆβ β) 2 = ɛ T (XX ) 2 ɛ = ɛ T XX ɛ first step uses symmetry of XX ; second step, X X = I Least squares data fitting 9-39

46 expected value of squared norm is E X( ˆβ β) 2 = E ( ɛ T XX ɛ ) = i,j E(ɛ i ɛ j )(XX ) ij n = σ 2 (XX ) ii i=1 = σ 2 n i=1 j=1 p X ij (X ) ji p = σ 2 (X X) jj = pσ 2 j=1 expression E e 2 = (n p)σ 2 on page 9-38 now follows from ɛ 2 = e + X ˆβ Xβ 2 = e 2 + X( ˆβ β) 2 Least squares data fitting 9-40

47 Linear estimator linear regression model (p.9-34), with same assumptions as before (p.9-35): y = Xβ + ɛ a linear estimator of β maps observations y to the estimate ˆβ = By estimator is defined by the p n matrix B least squares estimator is an example with B = X Least squares data fitting 9-41

48 Unbiased linear estimator if B is a left inverse of X, then estimator ˆβ = By can be written as: ˆβ = By = B(Xβ + ɛ) = β + Bɛ this shows that the linear estimator is unbiased (E ˆβ = β) if BX = I covariance matrix of unbiased linear estimator is E ( ( ˆβ β)( ˆβ β) T ) = E ( Bɛɛ T B T ) = σ 2 BB T if c is an (non-random) p-vector, then estimate c T ˆβ of c T β has variance E (c T ˆβ c T β) 2 = σ 2 c T BB T c least squares estimator is an example with B = X and BB T = (X T X) 1 Least squares data fitting 9-42

49 Best linear unbiased estimator if B is a left inverse of X then for all p-vectors c c T BB T c c T (X T X) 1 c (proof on next page) left-hand side gives variance of c T ˆβ for linear unbiased estimator ˆβ = By right-hand side gives variance of c T ˆβls for least squares estimator ˆβ ls = X y least squares estimator is the best linear unbiased estimator (BLUE) this is known as the Gauss-Markov theorem Least squares data fitting 9-43

50 Proof. use BX = I to write BB T as BB T = (B (X T X) 1 X T )(B T X(X T X) 1 ) + (X T X) 1 = (B X )(B X ) T + (X T X) 1 hence, c T BB T C = c T (B X )(B X ) T c + c T (X T X) 1 c = (B X ) T c 2 + c T (X T X) 1 c c T (X T X) 1 c with equality if B = X Least squares data fitting 9-44

Least Squares Data Fitting

Least Squares Data Fitting Stephen Boyd EE103 Stanford University October 31, 2017 Outline Least squares model fitting Validation Feature engineering Least squares model fitting 2 Setup we believe a scalar