Least Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD

Least Squares Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 75A Winter 0 - UCSD

(Unweighted) Least Squares Assume linearity in the unnown, deterministic model parameters Scalar, additive noise model: y f ( x; ) ( x) E.g., for a line (f affine in x), f ( x; ) x 0 ( x) 0 x his can be generalized to arbitrary functions of x: ( x) 0( x) 0 ( x)

Examples Components of (x) can be arbitrary nonlinear functions of x that are linear in : Line Fitting (affine in x): f ( x; ) x 0 Polynomial Fitting (nonlinear in x): f ( x; ) i x i 0 runcated Fourier Series (nonlinear in x): i ( x) [ x] ( x) x f ( x; ) i cos( i x) i 0 ( x) cos( x) 3

(Unweighted)Least Squares Loss = Euclidean norm of model error: where y L( ) y ( x) y ( x ) 0 ( x) y n ( xn ) he loss function can also be rewritten as, n yi xi n y x i L( ) ( ) ( ) n E.g., for the line, L ) ( yi 0 x i i 4

Examples he most important component is the Design Matrix (x) Line Fitting: Polynomial Fitting: f ( x; ) x 0 x ( x) x n runcated Fourier Series: f ( x; ) i 0 x ( x) x n i x i f ( x; ) i cos( i x) i 0 cos( x) ( x) cos( xn) 5

(Unweighted) Least Squares One way to minimize L( ) y ( x) ˆ is to find a value such that L( ˆ ) y ( x) ˆ ( x) 0 and L( ˆ ) ( x) ( x) 0 hese conditions will hold when (x) is one-to-one, yielding ˆ ( x) ( x) ( x y ( ) xy ) hus, the least squares solution is determined by the Pseudoinverse of the Design Matrix (x): ( ) x ( x) ( x) ( x) 6

(Unweighted) Least Squares If the design matrix (x) is one-to-one (has full column ran) the least squares solution is conceptually easy to compute. his will nearly always be the case in practice. Recall the example of fitting a line in the plane: x x Γ ( x) Γ( x) n x x n x x x n Γ ( x) y y y n x x n xy y n 7

(Unweighted)Least squares Pseudoinverse solution = LS solution: ˆ ( x) y ( x) ( x) ( ) he LS line fit solution is which (naively) requires a x matrix inversion followed by a x vector post-multiplication x ˆ y x x xy x y 8

(Unweighted) Least Squares What about fitting th order polynomial? : f ( x; ) i x i 0 i x ( x) x n x ( x) ( x) x x n xn x n x x 9

(Unweighted) Least squares and y y ( x) y n x xn y n x y hus, when (x) is one-to-one (has full column ran): ˆ ( x) y ( x) ( x) ( x) y Mathematically, this is a very straightforward procedure. Numerically, this is generally NO how the solution is computed. (Viz. the sophisticated algorithms in Matlab) K x y K K x x x y 0

(Unweighted) Least Squares Note the computational costs when (naively) fitting a line ( st order polynomial): x ˆ y x x xy x matrix inversion, followed by a x vector post-multiplication when (naively) fitting a general th order polynomial: x y ˆ x x x y (+)x(+) matrix inversion, followed by a (+)x multiplication It is evident that the computational complexity is an increasing function of the number of parameters. More efficient (and numerically stable) algorithms are used in practice, but complexity scaling with number of parameters still remains true.

(Unweighted) Least Squares Suppose the dataset {x i } is constant in value over repeated, non-constant measurements of the dataset {y i }. e.g. consider some process (e.g., temperature) where the measurements {y i } are taen at the same locations {x i } every day hen the design matrix (x) is constant in value! In advance (off-line) compute: Everyday (on-line) re-compute: A x x x y ˆ A x y Hence, least squares sometimes can be implemented very efficiently. here are also efficient recursive updates, e.g. the Kalman filter.

Geometric Solution & Interpretation Alternatively, we can derive the least squares solution using geometric (Hilbert Space) considerations. Goal: minimize the size (norm) of the model prediction error (aa residual error), e ( ) = y - (x) : L( ) e( ) y ( x) Note that given a nown design matrix (x) the vector ( x) i i is a linear combination of the column vectors i. I.e. (x) is in the range space (column space) of (x) 3

(Hilbert Space) Geometric Interpretation he vector lives in the range (column space) of = (x) = [,, ] : y y ˆ ˆ Assume that y is as shown. Equivalent statements are: ˆ is the value of closest to y in the range of. ˆ is the orthogonal projection of y onto the range of. he residual e = y- is to the range of. 4

Geometric Interpretation of LS e = y - is to the range of iff y ˆ '' " y '' " y ˆ ˆ hus e = y - ˆ is in the nullspace of : y ˆ 0 ˆ y 0 y ˆ y 0 0 5

Geometric interpretation of LS Note: Nullspace Condition Normal Equation: y ˆ 0 ˆ y hus, if is one-to-one (has full column ran) we again get the pseudoinverse solution: ˆ ( x) y ( x) ( x) ( x) y 6

Probabilistic Interpretation We have seen that estimating a parameter by minimizing the least squares loss function is a special case of MLE his interpretation holds for many loss functions First, note that ˆ Now note that, because argminl y, f ( x; ) arg maxe L y, f ( x; ) L y, f ( x; ) e 0, y, x we can mae this exponential function into a Y X pdf by an appropriate normalization. 7

Prob. Interpretation of Loss Minimization I.e. by defining P ( y x; ) YX L y, f ( x ; ) L y, f ( x; ) L y, f ( x; ) ( x; ) e e e dy If the normalization constant (x;) does not depend on, (x;) = (x), then ˆ y f x arg maxe L, ( ; ) arg max P y x; YX which maes the problem a special case of MLE 8

Prob. Interpretation of Loss Minimization Note that for loss functions of the form, L( ) g[ y f ( x; ) ] the model f(x; ) only changes the mean of Y A shift in mean does not change the shape of a pdf, and therefore cannot change the value of the normalization constant Hence, for loss functions of the type above, it is always true that ˆ y f x arg maxe L, ( ; ) arg max P y x; YX 9

Regression his leads to the interpretation that we saw last time Which is the usual definition of a regression problem two random variables X and Y a dataset of examples D = {(x,y ),,(x n,y n )} a parametric model of the form y f ( x; ) where is a parameter vector, and a random variable that accounts for noise the pdf of the noise determines the loss in the other formulation 0

Regression Error pdf: Gaussian P ( ) Laplacian P ( ) Rayleigh P ( ) e e e Equivalent Loss Function: L distance L( x, y) ( y x) L distance L( x, y) y x Rayleigh distance L( x, y) ( y x) log( y x)

Regression We now how to solve the problem with losses Why would we want the added complexity incurred by introducing error probability models? he main reason is that this allows a data driven definition of the loss One good way to see this is the problem of weighted least squares Suppose that you now that not all measurements (x i,y i ) have the same importance his can be encoded in the loss function Remember that the unweighted loss is L y x y x ( ) ( ) i i

Regression o weigh different points differently we could use L w y ( x), w 0, i i i i i or even the more generic form L y ( x) W y ( x), W W 0, In the latter case the solution is (homewor) * he question is how do I now these weights? Without a probabilistic model one has little guidance on this. ( x) W ( x) ( x) W y 3

Regression he probabilistic equivalent * arg max e L y, f ( x; ) argmaxexp ( y ( x) ) W ( y ( x) ) is the MLE for a Gaussian pdf of nown covariance W In the case where the covariance W is diagonal we have i L y x i ( ) i In this case, each point is weighted by the inverse variance. i 4

Regression his maes sense Under the probabilistic formulation the variance i is the variance of the error associated with the i th observation his means that it is a measure of the uncertainty of the observation When y f ( x ; ) i i i W we are weighting each point by the inverse of its uncertainty (variance) We can also chec the goodness of this weighting matrix by plotting the histogram of the errors if W is chosen correctly, the errors should be Gaussian 5

Model Validation In fact, by analyzing the errors of the fit we can say a lot his is called model validation Leave some data on the side, and run it through the predictor Analyze the errors to see if there is deviation from the assumed model (Gaussianity for least squares) 6

Model Validation & Improvement Many times this will give you hints to alternative models that may fit the data better ypical problems for least squares (and solutions) 7

Model Validation & Improvement Example error histogram this does not loo Gaussian loo at the scatter plot of the error (y f(x, *)) increasing trend, maybe we should try log( y ) f ( x ; ) i i i 8

Model Validation & Improvement Example error histogram for the new model this loos Gaussian this model is probably better there are statistical tests that you can use to chec this objectively these are covered in statistics classes 9

Model Validation & Improvement Example error histogram his also does not loo Gaussian Checing the scatter plot now seems to suggest to try y f ( x ; ) i i i 30

Model Validation & Improvement Example error histogram for the new model Once again, seems to wor he residual behavior loos Gaussian However, It is NO always the case that such changes will wor. If not, maybe the problem is the assumption of Gaussianity itself Move away from least squares, try MLE with other error pdfs 3

END 3