Least Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD

Similar documents
Dimensionality Reduction and Principle Components

Bayes Decision Theory - I

DS-GA 1002 Lecture notes 10 November 23, Linear models

ECE521 week 3: 23/26 January 2017

Normed & Inner Product Vector Spaces

Dimensionality Reduction and Principal Components

Ch. 12 Linear Bayesian Estimators

15 Singular Value Decomposition

Linear Regression Linear Regression with Shrinkage

Pseudoinverse and Adjoint Operators

Linear Regression Linear Regression with Shrinkage

ECE 275A Homework # 3 Due Thursday 10/27/2016

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Lecture Notes 1: Vector spaces

Managing Uncertainty

Linear Prediction Theory

Metric-based classifiers. Nuno Vasconcelos UCSD

The Hilbert Space of Random Variables

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

10-701/15-781, Machine Learning: Homework 4

Linear Models for Regression

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

ECE 636: Systems identification

Linear Models in Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

6.435, System Identification

EM Algorithm & High Dimensional Data

Practice Final Exam Solutions for Calculus II, Math 1502, December 5, 2013

Mathematical Formulation of Our Example

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

ESTIMATION THEORY. Chapter Estimation of Random Variables

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

STAT 100C: Linear models

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

ROBOTICS 01PEEQW. Basilio Bona DAUIN Politecnico di Torino

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Foundation of Intelligent Systems, Part I. Regression 2

DATA ASSIMILATION FOR FLOOD FORECASTING

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Singular Value Decomposition (SVD)

4 Bias-Variance for Ridge Regression (24 points)

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

Stochastic State Estimation

Neural networks: Associative memory

Machine Learning Practice Page 2 of 2 10/28/13

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Autonomous Navigation for Flying Robots

Extension of the Sparse Grid Quadrature Filter

Short-Term Electricity Price Forecasting

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Linear regression COMS 4771

Estimation and Prediction Scenarios

Overfitting, Bias / Variance Analysis

Hand Written Digit Recognition using Kalman Filter

y Xw 2 2 y Xw λ w 2 2

SIMON FRASER UNIVERSITY School of Engineering Science

Statistical Machine Learning, Part I. Regression 2

1 Cricket chirps: an example

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction: The Perceptron

EKF, UKF. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Velocity (Kalman Filter) Velocity (knots) Time (sec) Kalman Filter Accelerations 2.5. Acceleration (m/s2)

EKF, UKF. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

Lecture Notes Special: Using Neural Networks for Mean-Squared Estimation. So: How can we evaluate E(X Y )? What is a neural network? How is it useful?

Introduction to Bayesian Learning. Machine Learning Fall 2018

conditional cdf, conditional pdf, total probability theorem?

CLASS NOTES Computational Methods for Engineering Applications I Spring 2015

DS-GA 1002 Lecture notes 12 Fall Linear regression

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Linear Models for Regression CS534

Kalman Filter Computer Vision (Kris Kitani) Carnegie Mellon University

Kalman Filter. Predict: Update: x k k 1 = F k x k 1 k 1 + B k u k P k k 1 = F k P k 1 k 1 F T k + Q

Shape of Gaussians as Feature Descriptors

Linear Models for Regression. Sargur Srihari

4 Derivations of the Discrete-Time Kalman Filter

L11: Pattern recognition principles

2D Image Processing. Bayes filter implementation: Kalman filter

Probabilistic Fundamentals in Robotics. DAUIN Politecnico di Torino July 2010

MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Least squares: the big idea

Vector Space Concepts

PART II : Least-Squares Approximation

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander

Shannon meets Wiener II: On MMSE estimation in successive decoding schemes

Two-Dimensional Signal Processing and Image De-noising

MATH 680 Fall November 27, Homework 3

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Factor Analysis and Kalman Filtering (11/2/04)

Multidisciplinary System Design Optimization (MSDO)

Simple Linear Regression for the Climate Data

Transcription:

Least Squares Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 75A Winter 0 - UCSD

(Unweighted) Least Squares Assume linearity in the unnown, deterministic model parameters Scalar, additive noise model: y f ( x; ) ( x) E.g., for a line (f affine in x), f ( x; ) x 0 ( x) 0 x his can be generalized to arbitrary functions of x: ( x) 0( x) 0 ( x)

Examples Components of (x) can be arbitrary nonlinear functions of x that are linear in : Line Fitting (affine in x): f ( x; ) x 0 Polynomial Fitting (nonlinear in x): f ( x; ) i x i 0 runcated Fourier Series (nonlinear in x): i ( x) [ x] ( x) x f ( x; ) i cos( i x) i 0 ( x) cos( x) 3

(Unweighted)Least Squares Loss = Euclidean norm of model error: where y L( ) y ( x) y ( x ) 0 ( x) y n ( xn ) he loss function can also be rewritten as, n yi xi n y x i L( ) ( ) ( ) n E.g., for the line, L ) ( yi 0 x i i 4

Examples he most important component is the Design Matrix (x) Line Fitting: Polynomial Fitting: f ( x; ) x 0 x ( x) x n runcated Fourier Series: f ( x; ) i 0 x ( x) x n i x i f ( x; ) i cos( i x) i 0 cos( x) ( x) cos( xn) 5

(Unweighted) Least Squares One way to minimize L( ) y ( x) ˆ is to find a value such that L( ˆ ) y ( x) ˆ ( x) 0 and L( ˆ ) ( x) ( x) 0 hese conditions will hold when (x) is one-to-one, yielding ˆ ( x) ( x) ( x y ( ) xy ) hus, the least squares solution is determined by the Pseudoinverse of the Design Matrix (x): ( ) x ( x) ( x) ( x) 6

(Unweighted) Least Squares If the design matrix (x) is one-to-one (has full column ran) the least squares solution is conceptually easy to compute. his will nearly always be the case in practice. Recall the example of fitting a line in the plane: x x Γ ( x) Γ( x) n x x n x x x n Γ ( x) y y y n x x n xy y n 7

(Unweighted)Least squares Pseudoinverse solution = LS solution: ˆ ( x) y ( x) ( x) ( ) he LS line fit solution is which (naively) requires a x matrix inversion followed by a x vector post-multiplication x ˆ y x x xy x y 8

(Unweighted) Least Squares What about fitting th order polynomial? : f ( x; ) i x i 0 i x ( x) x n x ( x) ( x) x x n xn x n x x 9

(Unweighted) Least squares and y y ( x) y n x xn y n x y hus, when (x) is one-to-one (has full column ran): ˆ ( x) y ( x) ( x) ( x) y Mathematically, this is a very straightforward procedure. Numerically, this is generally NO how the solution is computed. (Viz. the sophisticated algorithms in Matlab) K x y K K x x x y 0

(Unweighted) Least Squares Note the computational costs when (naively) fitting a line ( st order polynomial): x ˆ y x x xy x matrix inversion, followed by a x vector post-multiplication when (naively) fitting a general th order polynomial: x y ˆ x x x y (+)x(+) matrix inversion, followed by a (+)x multiplication It is evident that the computational complexity is an increasing function of the number of parameters. More efficient (and numerically stable) algorithms are used in practice, but complexity scaling with number of parameters still remains true.

(Unweighted) Least Squares Suppose the dataset {x i } is constant in value over repeated, non-constant measurements of the dataset {y i }. e.g. consider some process (e.g., temperature) where the measurements {y i } are taen at the same locations {x i } every day hen the design matrix (x) is constant in value! In advance (off-line) compute: Everyday (on-line) re-compute: A x x x y ˆ A x y Hence, least squares sometimes can be implemented very efficiently. here are also efficient recursive updates, e.g. the Kalman filter.

Geometric Solution & Interpretation Alternatively, we can derive the least squares solution using geometric (Hilbert Space) considerations. Goal: minimize the size (norm) of the model prediction error (aa residual error), e ( ) = y - (x) : L( ) e( ) y ( x) Note that given a nown design matrix (x) the vector ( x) i i is a linear combination of the column vectors i. I.e. (x) is in the range space (column space) of (x) 3

(Hilbert Space) Geometric Interpretation he vector lives in the range (column space) of = (x) = [,, ] : y y ˆ ˆ Assume that y is as shown. Equivalent statements are: ˆ is the value of closest to y in the range of. ˆ is the orthogonal projection of y onto the range of. he residual e = y- is to the range of. 4

Geometric Interpretation of LS e = y - is to the range of iff y ˆ '' " y '' " y ˆ ˆ hus e = y - ˆ is in the nullspace of : y ˆ 0 ˆ y 0 y ˆ y 0 0 5

Geometric interpretation of LS Note: Nullspace Condition Normal Equation: y ˆ 0 ˆ y hus, if is one-to-one (has full column ran) we again get the pseudoinverse solution: ˆ ( x) y ( x) ( x) ( x) y 6

Probabilistic Interpretation We have seen that estimating a parameter by minimizing the least squares loss function is a special case of MLE his interpretation holds for many loss functions First, note that ˆ Now note that, because argminl y, f ( x; ) arg maxe L y, f ( x; ) L y, f ( x; ) e 0, y, x we can mae this exponential function into a Y X pdf by an appropriate normalization. 7

Prob. Interpretation of Loss Minimization I.e. by defining P ( y x; ) YX L y, f ( x ; ) L y, f ( x; ) L y, f ( x; ) ( x; ) e e e dy If the normalization constant (x;) does not depend on, (x;) = (x), then ˆ y f x arg maxe L, ( ; ) arg max P y x; YX which maes the problem a special case of MLE 8

Prob. Interpretation of Loss Minimization Note that for loss functions of the form, L( ) g[ y f ( x; ) ] the model f(x; ) only changes the mean of Y A shift in mean does not change the shape of a pdf, and therefore cannot change the value of the normalization constant Hence, for loss functions of the type above, it is always true that ˆ y f x arg maxe L, ( ; ) arg max P y x; YX 9

Regression his leads to the interpretation that we saw last time Which is the usual definition of a regression problem two random variables X and Y a dataset of examples D = {(x,y ),,(x n,y n )} a parametric model of the form y f ( x; ) where is a parameter vector, and a random variable that accounts for noise the pdf of the noise determines the loss in the other formulation 0

Regression Error pdf: Gaussian P ( ) Laplacian P ( ) Rayleigh P ( ) e e e Equivalent Loss Function: L distance L( x, y) ( y x) L distance L( x, y) y x Rayleigh distance L( x, y) ( y x) log( y x)

Regression We now how to solve the problem with losses Why would we want the added complexity incurred by introducing error probability models? he main reason is that this allows a data driven definition of the loss One good way to see this is the problem of weighted least squares Suppose that you now that not all measurements (x i,y i ) have the same importance his can be encoded in the loss function Remember that the unweighted loss is L y x y x ( ) ( ) i i

Regression o weigh different points differently we could use L w y ( x), w 0, i i i i i or even the more generic form L y ( x) W y ( x), W W 0, In the latter case the solution is (homewor) * he question is how do I now these weights? Without a probabilistic model one has little guidance on this. ( x) W ( x) ( x) W y 3

Regression he probabilistic equivalent * arg max e L y, f ( x; ) argmaxexp ( y ( x) ) W ( y ( x) ) is the MLE for a Gaussian pdf of nown covariance W In the case where the covariance W is diagonal we have i L y x i ( ) i In this case, each point is weighted by the inverse variance. i 4

Regression his maes sense Under the probabilistic formulation the variance i is the variance of the error associated with the i th observation his means that it is a measure of the uncertainty of the observation When y f ( x ; ) i i i W we are weighting each point by the inverse of its uncertainty (variance) We can also chec the goodness of this weighting matrix by plotting the histogram of the errors if W is chosen correctly, the errors should be Gaussian 5

Model Validation In fact, by analyzing the errors of the fit we can say a lot his is called model validation Leave some data on the side, and run it through the predictor Analyze the errors to see if there is deviation from the assumed model (Gaussianity for least squares) 6

Model Validation & Improvement Many times this will give you hints to alternative models that may fit the data better ypical problems for least squares (and solutions) 7

Model Validation & Improvement Example error histogram this does not loo Gaussian loo at the scatter plot of the error (y f(x, *)) increasing trend, maybe we should try log( y ) f ( x ; ) i i i 8

Model Validation & Improvement Example error histogram for the new model this loos Gaussian this model is probably better there are statistical tests that you can use to chec this objectively these are covered in statistics classes 9

Model Validation & Improvement Example error histogram his also does not loo Gaussian Checing the scatter plot now seems to suggest to try y f ( x ; ) i i i 30

Model Validation & Improvement Example error histogram for the new model Once again, seems to wor he residual behavior loos Gaussian However, It is NO always the case that such changes will wor. If not, maybe the problem is the assumption of Gaussianity itself Move away from least squares, try MLE with other error pdfs 3

END 3