Least Squares Data Fitting

Size: px

Start display at page:

Download "Least Squares Data Fitting"

Dortha Stokes
6 years ago
Views:

1 Least Squares Data Fitting Stephen Boyd EE103 Stanford University October 31, 2017

2 Outline Least squares model fitting Validation Feature engineering Least squares model fitting 2

3 Setup we believe a scalar y and an n-vector x are related by model y f(x) x is called the independent variable y is called the outcome or response variable f : R n R gives the relation between x and y often x is a feature vector, and y is something we want to predict we don t know f, which gives the true relationship between x and y Least squares model fitting 3

4 Data we are given some data x (1),..., x (N), y (1),..., y (N) also called observations, examples, samples, or measurements x (i), y (i) is ith data pair x (i) j is the jth component of ith data point x (i) Least squares model fitting 4

5 Model choose model ˆf : R n R, a guess or approximation of f linear in the parameters model form: ˆf(x) = θ 1 f 1 (x) + + θ p f p (x) f i : R n R are basis functions that we choose θ i are model parameters that we choose ŷ (i) = ˆf(x (i) ) is (the model s) prediction of y (i) we d like ŷ (i) y (i), i.e., model is consistent with observed data Least squares model fitting 5

6 Least squares data fitting prediction error or residual is r i = y (i) ŷ (i) least squares data fitting: choose model parameters θ i to minimize RMS prediction error on data set ( (r (1) ) (r (N) ) 2 N ) 1/2 this can be formulated (and solved) as a least squares problem Least squares model fitting 6

7 Least squares data fitting express y (i), ŷ (i), and r (i) as N-vectors y d = (y (1),..., y (N) ) is vector of outcomes ŷ d = (ŷ (1),..., ŷ (N) ) is vector of predictions r d = (r (1),..., r (N) ) is vector of residuals rms(r d ) is RMS prediction error define N p matrix A, A ij = f j (x (i) ), so ŷ d = Aθ least squares data fitting: choose θ to minimize r d 2 = y d ŷ d 2 = y d Aθ 2 = Aθ y d 2 ˆθ = (A T A) 1 A T y (if columns of A are independent) Aˆθ y 2 /N is minimum mean-square (fitting) error Least squares model fitting 7

8 Fitting a constant model simplest possible model: p = 1, f 1 (x) = 1, so model ˆf(x) = θ1 is a constant function A = 1, so ˆθ 1 = (1 T 1) 1 1 T y d = (1/N)1 T y d = avg(y d ) the mean of y (1),..., y (N) is the least squares fit by a constant MMSE is std(y d ) 2 ; RMS error is std(y d ) more sophisticated models are judged against the constant model Least squares model fitting 8

9 Fitting univariate functions when n = 1, we seek to approximate a function f : R R we can plot the data (x i, y i ) and the model function ŷ = ˆf(x) Least squares model fitting 9

10 Straight-line fit p = 2, with f 1 (x) = 1, f 2 (x) = x model has form ˆf(x) = θ 1 + θ 2 x matrix A has form 1 x (1) 1 x (2) A =.. 1 x (N) can work out ˆθ 1 and ˆθ 2 explicitly: ˆf(x) = avg(y d ) + ρ std(yd ) std(x d ) (x avg(xd )) where x d = (x (1),..., x (N) ) Least squares model fitting 10

11 Example ˆf(x) Least squares model fitting 11 x

12 Asset α and β x is return of whole market, y is return of a particular asset write straight-line model as ŷ = (r rf + α) + β(x µ mkt ) µ mkt is the average market return r rf is the risk-free interest rate several other slightly different definitions are used called asset α and β, widely used Least squares model fitting 12

13 Time series trend y (i) is value of quantity at time x (i) = i ŷ (i) = ˆθ 1 + ˆθ 2 i, i = 1,..., N, is called trend line y d ŷ d is called de-trended time series ˆθ2 is trend coefficient Least squares model fitting 13

14 World petroleum consumption 90 Petroleum consumption (million barrels per day) Year Least squares model fitting 14

15 World petroleum consumption, de-trended 6 Petroleum consumption (million barrels per day) Year Least squares model fitting 15

16 Polynomial fit f i (x) = x i 1, i = 1,..., p model is a polynomial of degree less than p ˆf(x) = θ 1 + θ 2 x + + θ p x p 1 (here x i means scalar x to ith power; x (i) is ith data point) A is Vandermonde matrix 1 x (1) (x (1) ) p 1 1 x (2) (x (2) ) p 1 A =... 1 x (N) (x (N) ) p 1 Least squares model fitting 16

17 Example N = 100 data points ˆf(x) degree 2 ˆf(x) degree 6 x x ˆf(x) degree 10 ˆf(x) degree 15 x Least squares model fitting 17 x

18 Regression as general data fitting regression model is affine function ŷ = ˆf(x) = x T β + v fits general fitting form with basis functions f 1 (x) = 1, f i (x) = x i 1, i = 2,..., n + 1 so model is ŷ = θ 1 + θ 2 x θ n+1 x n = x T θ 2:n + θ 1 β = θ 2:n+1, v = θ 1 Least squares model fitting 18

19 General data fitting as regression general fitting model ˆf(x) = θ1 f 1 (x) + + θ p f p (x) common assumption: f 1 (x) = 1 same as regression model ˆf( x) = x T β + v, with x = (f 2(x),..., f p(x)) are transformed features v = θ 1, β = θ 2:p Least squares model fitting 19

20 Auto-regressive time series model time zeries z 1, z 2,... auto-regressive (AR) prediction model: ẑ t+1 = θ 1 z t + + θ M z t M+1, t = M, M + 1,... M is memory of model ẑ t+1 is prediction of next value, based on previous M values we ll choose β to minimize sum of squares of prediction errors, (ẑ M+1 z M+1 ) (ẑ T z T ) 2 put in general form with y (i) = z M+i, x (i) = (z M+i 1,..., z i ), i = 1,..., T M Least squares model fitting 20

21 Example hourly temperature at LAX in May 2016, length 744 average is F, standard deviation 3.05 F predictor ẑ t+1 = z t gives RMS error 1.16 F predictor ẑ t+1 = z t 23 gives RMS error 1.73 F AR model with M = 8 gives RMS error 0.98 F Least squares model fitting 21

22 Example solid line shows one-hour ahead predictions from AR model, first 5 days Temperature ( F) k Least squares model fitting 22

23 Outline Least squares model fitting Validation Feature engineering Validation 23

24 Generalization basic idea: goal of model is not to predict outcome for the given data instead it is to predict the outcome on new, unseen data a model that makes reasonable predictions on new, unseen data has generalization ability, or generalizes a model that makes poor predictions on new, unseen data is said to suffer from over-fit Validation 24

25 Validation a simple and effective method to guess if a model will generalize split original data into a training set and a test set typical splits: 80%/20%, 90%/10% build ( train ) model on training data set then check the model s predictions on the test data set (can also compare RMS prediction error on train and test data) if they are similar, we can guess the model will generalize Validation 25

26 Validation can be used to choose among different candidate models, e.g. polynomials of different degrees regression models with different sets of regressors we d use one with low, or lowest, test error Validation 26

27 Example polynomials fit using training set of 100 points plots below show performance with test set of 100 points ˆf(x) degree 2 ˆf(x) degree 6 x x ˆf(x) degree 10 ˆf(x) degree 15 x x Validation 27

28 Example suggests degree 4, 5, or 6 are reasonable choices Relative RMS error Data set Test set Degree Validation 28

29 Cross validation to carry out cross validation: divide data into 10 folds for i = 1,..., 10, build (train) model using all folds except i test model on data in fold i interpreting cross validation results: if test RMS errors are much larger than train RMS errors, model is over-fit if test and train RMS errors are similar and consistent, we can guess the model will have a similar RMS error on future data Validation 29

30 Example house price, regression fit with x = (area/1000 ft. 2, bedrooms) 774 sales, divided into 5 folds of 155 sales each fit 5 regression models, removing each fold Model parameters RMS error Fold v β 1 β 2 Train Test Validation 30

31 Outline Least squares model fitting Validation Feature engineering Feature engineering 31

32 Feature engineering start with original or base feature n-vector x choose basis functions f 1,..., f p to create mapped feature p-vector (f 1 (x),..., f p (x)) now fit linear in parameters model with mapped features ŷ = θ 1 f 1 (x) + + θ p f p (x) check the model using validation Feature engineering 32

33 Transforming features standardizing features: replace x i with (x i b i )/a i b i mean value of the feature across the data a i standard deviation of the feature across the data new features are called z-scores log transform: if x i is nonnegative and spans a wide range, replace it with log(1 + x i ) hi and lo features: create new features given by max{x 1 b, 0}, min{x 1 a, 0} (called hi and lo versions of original feature x i ) Feature engineering 33

34 Example house price prediction start with base features x 1 is area of house (in 1000ft. 2 ) x 2 is number of bedrooms x 3 is 1 for condo, 0 for house x 4 is zip code of address (62 values) we ll use p = 8 basis functions: f 1(x) = 1, f 2(x) = x 1, f 3(x) = max{x 1 1.5, 0} f 4(x) = x 2, f 5(x) = x 3 f 6(x), f 7(x), f 8(x) are Boolean functions of x 4 which encode 4 groups of nearby zipcodes (i.e., neighborhood) five fold model validation Feature engineering 34

35 Example Model parameters RMS error Fold θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 θ 7 θ 8 Train Test Feature engineering 35

9. Least squares data fitting

L. Vandenberghe EE133A (Spring 2017) 9. Least squares data fitting model fitting regression linear-in-parameters models time series examples validation least squares classification statistics interpretation