Statistical Filtering and Control for AI and Robotics. Part II. Linear methods for regression & Kalman filtering

Size: px
Start display at page:

Download "Statistical Filtering and Control for AI and Robotics. Part II. Linear methods for regression & Kalman filtering"

Transcription

1 Statistical Filtering and Control for AI and Robotics Part II. Linear methods for regression & Kalman filtering Riccardo Muradore 1 / 66

2 Outline Linear Methods for Regression Gaussian filter Stochastic model Kalman filtering Kalman smoother 2 / 66

3 References These lectures are based on the following books Sebastian Thrun, Wolfram Burgard and Dieter Fox, ProbabilisticRobotics, MIT Press, 2005 Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009 Several pictures from those books have been copied and pasted here 3 / 66

4 Linear Methods for Regression 4 / 66

5 probability Supervised learning: use the inputs (i.e. predictors, independent variables, features) to predict the values of the outputs (i.e. responses, dependent variables) This distinction in output type has led to a naming convention for the prediction tasks: regression when we predict quantitative outputs, and classification when we predict qualitative outputs. Notation: x R m random variable (x i R is its i-th component) x R m an observation of the random variable x R m (x i R is its i-th component) X R m N a collection of N observations (X T i R m is its i-th row) We will focus on the regression problem: this means that input and output vectors consist of qualitative measurements 5 / 66

6 Linear Models Input: x R m, x R m, X R N m Output: y R p, y R p, Y R N p Prediction: ŷ R p, ŷ R p, Ŷ Rp N Linear Model: ( from now on p = 1 ) where β R m Prediction y = f (x) = x T β ŷ = x T ˆβ where ˆβ R m is the matrix of coefficients that we have to determine Remark. If p = 1, the gradient f (x) = x f (x) = β is a vector pointing in the steepest uphill direction 6 / 66

7 Least Squares Let X R N m and Y R N a training set of data (collection of N pairs (x, y)) How to choice β? First of all we have to introduce an index as a function of β. Let RSS(β) be the residual sum of squares RSS(β) := We search for N (Y i X i β) T (Y i X i β) = (Y X β) T (Y X β) i=1 ˆβ := arg min RSS(β) β Computing the first and second derivative we get the normal equations β RSS(β) = 2X T (Y X β) 2 ββrss(β) = 2X T X 7 / 66

8 Least Squares If X T X is nonsingular (i.e. X has full column rank), the unique solution is given by the normal equations i.e. β RSS(β) = 0 X T (Y X β) = 0 ˆβ = (X T X ) 1 X T Y and the prediction of y given a new value x is ŷ = x T ˆβ Observations: We assume that the underlying model is linear Statistics of x and y do not play any role (it seems...) 8 / 66

9 Least Squares p > 1 Linear model Y = XB + E where X R N m, Y R N p, E R N p and B R m p The RSS takes the form RSS(B) := trace{(y XB) T (Y XB)} and the least square estimation of B is written in the same way ˆB = (X T X ) 1 X T Y Multiple outputs do not affect one another s least squares estimates If the component of the vector r.v e are correlated, i.e. e N (0, Σ), then we can define a weighted RSS RSS(B, Σ) := N (Y i X i B) T Σ 1 (Y i X i B) i=1 9 / 66

10 Geometric interpretation The normal equations X T (Y X β) = 0 means the estimation Ŷ = X ˆβ = X (X T X ) 1 X T Y is the orthogonal projection of Y into the subspace X 10 / 66

11 Statistical interpretation We now consider the r.v. x and y as input and output, respectively, and we seek a function f (x) for predicting y. The criterion should be now deal with stochastic quantities: we introduce the expected squared prediction error EPE (strictly related with the mean squared error MSE) EPE(f ) := E [ (y f (x)) T (y f (x)) ] = (y f (x)) T (y f (x))p(x, y)dxdy S x,s y where we implicitly assumed that x and y have a joint PDF. EPE(f ) is a L 2 loss function Conditioning on x we can re-write EPE(f ) as EPE(f ) := E x [ Ey x [ (y f (x)) T (y f (x)) x ] ] 11 / 66

12 Statistical interpretation We can determine f ( ) pointwise which means that f (x) = arg min c E y x [ (y c) T (y c) x = x ] f (x) = E [y x = x] i.e. the best f (x) is the conditional mean (according to the EPE criterion). Beautiful but, given the data X, Y how can we compute the conditional expectation?!? 12 / 66

13 Statistical interpretation Let us assume again f (x) = x T β then EPE(f ) := E [ (y x T β) T (y x T β) ] Differentiating w.r.t. β we end up with β = ( E[xx T ] ) 1 E[x T y] Computing the auto- and cross-correlation (i.e. using real numbers!) E[xx T ] N S xx := 1 N E[x T y] N S xy := 1 N N i=1 N i=1 Xi T X i = 1 N X T X X i Y T i = 1 N XY T 13 / 66

14 Statistical interpretation Then we get ˆβ = ( ) 1 1 N X T 1 X N XY T = ( X T X ) 1 XY T Again the normal equations!!! But now we can provide a statistical interpretation of ˆβ. Let y = x T β + e, e N (0, σ 2 ) be our model (p = 1), then ˆβ is a Gaussian variable ˆβ N (β, (X T X ) 1 σ 2 ) In fact, since ˆβ = ( X T X ) 1 X y ( X T X ) 1 X e ŷ = x T ˆβ + e 14 / 66

15 Gauss-Markov theorem Given the linear model y = x T β, Y = X β the least squares estimator ˆφ(x 0 ) = x0 T ˆβ of φ(x 0 ) = x0 T β is unbiased because E[x T 0 ˆβ] = x T 0 β Theorem If φ(x 0 ) is any other unbiased estimation (E[ φ(x 0 )] = x0 T β) then Var( ˆφ(x 0 )) Var( φ(x 0 )) Remark. Mean square error of a generic estimator φ (p = 1) MSE( φ) = E[( φ φ) 2 ] ( ) = Var( }{{ φ) + (E[ } φ] φ) 2 }{{} variance bias (*) = sum and subtract E[ φ]. 15 / 66

16 Gauss-Markov theorem Given the stochastic linear model y = x T β + e, e N (0, σ 2 ) and let φ(x 0 ) be the estimator for y 0 = φ(x 0 ) + e 0, φ(x 0 ) = x0 T β. The expected prediction error (EPE) of φ(x 0 ) is EPE( φ(x 0 )) = E[(y 0 φ(x 0 )) 2 ] = σ 2 + E[(x T 0 β φ(x 0 )) 2 ] = σ 2 + Var( φ) + (E[ φ] φ) 2 }{{} MSE 16 / 66

17 Bias-variance trade-off underfitting VS overfitting 17 / 66

18 Statistical models Statistical model: y = f (x) + e where y is a random error with zero mean (E[e] = 0) and is independent of x. This means that the relationship between y and x is not deterministic (f ( )) The additive r.v. e takes care of measurement noise, model uncertainty and non measured variables correlated with y as well We often assume that the random variables e are independent and identically distributed (i.i.d.) 18 / 66

19 Statistical models Assuming a linear basis expansion for f θ (x) parametrized by the unknowns collected within the vector θ K f θ (x) = h k (x)θ k where examples of h k (x) can be 1 h k (x) = x k h k (x) = (x k ) 2 h k (x) = sin(x k ) h k (x) = e x T β k The optimization problem to solve is ˆθ = arg min RSS(θ) = N (y i f θ (x i )) 2 θ Θ where RSS stands for Residual Sum of Squares 1 19 / 66

20 Statistical models Are there other kinds of criterion besides RSS, EPE? YES, A more general principle for estimation is maximum likelihood estimation Let p θ (y) be the PDF of the samples y 1,..., y N The log-probability (or log-likelihood) of the observed samples is L(θ) = N log p θ (y i ) 1 Principle of maximum likelihood: the most reasonable values for θ are those for which the probability of the observed samples is largest 20 / 66

21 Statistical models If the error e in the following statistical model y = f θ (x) + e is Gaussian, e N (0, σ 2 ), then the conditional probability is Then log-likelihood of the data is L(θ) = p(y x, θ) N (f θ (x), σ 2 ) N log p(y i f θ (x i ), θ) 1 = N 2 log(2π) N log σ 1 2σ 2 N i=1 (y i f θ (x i )) 2 Least squares for the additive error model is equivalent to maximum likelihood using the conditional probability (The yellow is the RSS(θ) ) 21 / 66

22 Penalty function and Regularization methods Penalty function, or regularization methods, introduces our knowledge about the type of functions f (x) we are looking for PRSS(f, λ) := RSS(f ) + λg(f ) where the functional g(f ) will force our knowledge (or desiderata) on f Example. One-dimension cubic smoothing spline is the solution of PRSS(f, λ) := N (y i f (x i )) 2 + λ [f (s)] 2 dx i=1 Remark. Penalty function methods have a Bayesian interpretation: g(f ) is the log-prior distribution PRSS(f, λ) is the log-posterior distribution the solution of arg min f PRSS(f, λ) is the posterior mode 22 / 66

23 Kernel Methods and Local Regression If we want a local regression estimation of f (x 0 ), we have to solve the problem ˆθ = arg min θ RSS(f θ, x 0 ) = N K λ (x 0, x i )(y i f θ (x i )) 2 i=1 where the kernel function K λ (x 0, x) weights the point x around x 0. The optimal estimation is fˆθ(x 0 ) An example of kernel function is the Gaussian kernel K λ (x 0, x) = 1 [ λ exp x x 0 2 ] Examples of f θ (x) are f θ (x) = θ 0, constant function f θ (x) = θ 0 + θ 1 x, linear regression 2λ 23 / 66

24 Basis functions The function f can be approximated using a set of M basis functions h m M f θ (x) = θ m h m (x) where θ = [θ 1 θ M ] Examples of basis functions: m=1 Radial basis functions: M f θ (x) = θ m K λm (µ m, x), K λ (µ, x) = e x µ 2 /2λ m=1 Single-layer feed-forward neural network f θ (x) = M θ m σ(αmx T 1 + b m ), σ(x) = 1 + e x m=1 Remark. Linear methods can then be used with nonlinear input-output transformation because the model is linear in the parameters θ 24 / 66

25 Subset selection The least squares estimates often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may improve the overall prediction accuracy. 25 / 66

26 Ridge Regression Ridge regression shrinks the regression coefficients by imposing a penalty on their size. The coefficients ˆβ ridge are obtained solving the minimization problem ˆβ ridge = arg min β { N (Y i X i β) T (Y i X i β) +λ i=1 } {{ } RSS(β) with λ 0, or the equivalent constrained problem The solution is m i=1 β 2 i }{{} g(β)=β T β N ˆβ ridge = arg min β (Y i X i β) T (Y i X i β) s. to i=1 m i=1 β 2 i t ˆβ ridge = (X T X + λi ) 1 X T Y } 26 / 66

27 Lasso The coefficients ˆβ lasso are obtained solving the minimization problem N m ˆβ lasso = arg min{ (Y i X i β) T (Y i X i β) +λ β i } β i=1 } {{ } i=1 }{{} RSS(β) g(β) with λ 0, or the equivalent constrained problem N ˆβ lasso = arg min β (Y i X i β) T (Y i X i β) s. to i=1 m β i t i=1 The are no closed form expression for ˆβ lasso Remark 1. The Ridge Regression uses a L 2 norm on β, whereas Lasso the L 1 norm. This means that the solution is nonlinear in the data. Remark 1. Decreasing t forces some of the coefficients to be set to zero (exactly). 27 / 66

28 Ridge regression VS Lasso Lasso Ridge β 1 + β 2 t β β2 2 t2 The red ellipses are the contours of the least squares error function 28 / 66

29 From r.v. to stochastic processes Definition (random variable) A random variable x: Ω E is a measurable function from the set of possible outcomes Ω to some set E. Ω is a probability space and E is a measurable space. Roughly speaking: A random variable x is a rule for assigning to every outcome ω of an experiments a number x(ω) Definition (stochastic process) Given a probability space (Ω, F, P) and a measurable space (S, Σ), an S-valued stochastic process is a collection of S-valued random variables on Ω, indexed by a totally ordered set T ( time ). That is, a stochastic process is a collection {x t : t T } where each x t is an S-valued random variable on Ω. The space S is then called the state space of the process. Roughly speaking: A stochastic process x t is a rule for assigning to every outcome ω a function x(t, ω) 29 / 66

30 From r.v. to stochastic processes {x t } has the following interpretations: It is a family of functions x t (ω) when t and ω are variables. It is a single time function (or a realization of the given process) x t ( ω) when t is a variable and ω = ω is fixed. It is a random variable if t = t is fixed and ω is variable, i.e. x t(ω) state of the process at time t. It is a number if t and ω are fixed If T = R, {x t } is a continuous-time process If T = Z, {x k } is a discrete-time process Even though the dynamics of the system is described by ODE, in the following we will consider discrete-time processes because the sensing system provides measurements at discrete moments. Remark The r.v. x k(ω) can be continuous even if k T = Z 30 / 66

31 Gaussian filter 31 / 66

32 State estimation Given the measurement y k, k = 0, 1,... related to an unknown state variable x k, we are interested in the following estimation problems - Filtering - h-step ahead Prediction y 0, y 1,..., y k ˆx k k - h-step backward Smoothing y 0, y 1,..., y k ˆx k+h k - Smoothing y 0, y 1,..., y k ˆx k h k y 0, y 1,..., y N ˆx k N 32 / 66

33 Gaussian filters Gaussian filters assume that the undergoing phenomena can be modeled by Gaussian distributions. This assumption allows to solve in recursive way the general Bayes filters formulation Why are Gaussian distributions so good? Gaussians are unimodal: they have a single maximum The statistics (mean, variance and higher order moments) of a Gaussian are described by two parameters: its mean and variance The linear combination of Gaussians is still Gaussian 33 / 66

34 Gaussian Random Variables Definition (Gaussian r.v.) An n-dimensional random variable X is Gaussian with mean µ R n and variance Σ R n n, Σ = Σ T > 0, X N (µ, Σ), if its probability density function (PDF) is given by p(x) = 1 (2π)n det Σ e 1 2 (x µ)t Σ 1 (x µ). This means that µ = E[X ] Σ = Var(X ) = E[(X E[X ])(X E[X ]) T ]. 34 / 66

35 Gaussian r.v. Theorem (Joint Gaussian r.v.) Let x R n and y R m be ([ joint Gaussian ] [ ]) µx Σxx Σ p(x, y) N, xy µ y Σ yx Σ yy Then the r.v. z = Ax + By is still Gaussian, i.e. z N (µ z, Σ z ), where µ z = E[Ax + By] = Aµ x + Bµ y Σ z = E [(Ax ] + By Aµ x Bµ y ) (Ax + By Aµ x Bµ y ) T [ ( [ ]) ( [ ]) ] T x µx x µx = E [A B] [A B] y µ y y µ y [ [ ] [ ] ] T x µx x µx = [A B]E [A B] T y µ y y µ y [ ] [ ] Σxx Σ = [A B] xy A T Σ yx Σ yy B T 35 / 66

36 Gaussian r.v. Theorem (...) the Gaussian random variable x conditioned on the Gaussian random variable y is still a Gaussian random variable. The PDF of x given y is where if Σ yy > 0. p(x y) = N (µ x y, Σ x y ) (1) µ x y = µ x + Σ xy Σ 1 yy (y µ y ) (2) Σ x y = Σ xx Σ xy Σ 1 yy Σ yx (3) 36 / 66

37 Optimal Estimator Theorem (Minimum Variance Estimator) Let x R n, y R m be two r.v. (non necessariamente Gaussiane), and g : R m R n a measurable function. We define ˆx g = g(y) as the estimator of x given y through the function g, and e g = x g(y) = x ˆx g the corresponding estimation error. The estimator ˆx = E[x y] = ĝ(y) is optimal because it minimizes the error variance, i.e. Var(e) = E[(x ˆx)(x ˆx) T ] E[(x ˆx g )(x ˆx g ) T ] = Var(e g ), g( ) where e = x ˆx is the error of the optimal estimator. The error of the optimal estimator and the estimation are uncorrelated E[eĝ(y) T ] = / 66

38 Stochastic model 38 / 66

39 LTI Stochastic model We focus now on the state-space representation of a generic Linear Time-Invariant (LTI) stochastic model: { xk+1 = Ax k + w k where: y k = Cx k + v k v k N (0, R), E[v k vh T ] = Rδ(k h) w k N (0, Q), E[w k wh T ] = Qδ(k h) x 0 N ( x 0, P 0 ) and v k, w k, x 0 are uncorrelated zero-mean Gaussian r.v. E[v k w T h ] = 0 E[x 0 v T k ] = 0 E[x 0 w T k ] = 0 The state-space model is a way to describe the dynamical evolution of a stochastic process 39 / 66

40 LTI Stochastic model From the evolution of the state and of the output at time t we also have x k = A k k0 x 0 + y k = CA k k0 x 0 + k 1 i=k 0 A k i 1 w i k 1 i=k 0 CA k i 1 w i + v k E[x k w T h ] = 0, h k E[x k v T h ] = 0 E[y k v T h ] = Qδ(k h) 40 / 66

41 Kalman filtering 41 / 66

42 Kalman filter The Kalman filter (or minimum variance filter) is defined as: ˆx k+1 k+1 = E [x k+1 y 0,..., y k+1 ] = E [ x k+1 y k+1, Y k] (4) where Y k = (y k,..., y 1, y 0 ). Goal: we need a recursive expression for ˆx k+1 k+1 without using µ x y = µ x + Σ xy Σ 1 yy (y µ y ) (5) at any time instant k, i.e. when a new measurement is available. The explicit expression for E [X Y ] is easy to derive from [ (5) if X e ] Y ΣXX Σ are joint Gaussian with means µ X, µ Y and variances XY. Σ YX Σ YY 42 / 66

43 Kalman filter To rewrite ˆx k+1 k+1 = E [x k+1 y 0,..., y k+1 ] = E [ x k+1 y k+1, Y k] in the form E [X Y ] we introduce the following conditional random variables X = x k+1 Y k Y = y k+1 Y k and compute the following means, variances and covariances: µ X = E [ x k+1 Y k] µ Y = E [ y k+1 Y k] P k+1 k = Σ XX = Var [ x k+1 Y k] Σ YY = Var [ y k+1 Y k] Σ XY = Σ T YX = Cov [ x k+1, y k+1 Y k]. 43 / 66

44 Kalman filter The optimal estimator is given by: E [X Y ] = ˆx k+1 k+1 = ˆx k+1 k + Σ XY Σ 1 YY ( yk+1 ŷ k+1 k ) (6) and the variance of the estimation error is Σ X Y = P k+1 k+1 = Σ XX Σ XY Σ 1 YY Σ YX (7) 44 / 66

45 Conditional means Mean µ X µ X = E [ x k+1 Y k] = E [ Ax k + w k Y k] = AE [ x k Y k] + E [ w k Y k] = Aˆx k k = ˆx k+1 k Mean µ Y µ Y = E [ y k+1 Y k] = E [ Cx k+1 + v k+1 Y k] = CE [ x k+1 Y k] + E [ v k+1 Y k] = C ˆx k+1 k = CAˆx k k 45 / 66

46 Conditional variances Variance Σ XX Σ XX = Var [ x k+1 Y k] [ (xk+1 ) ( ) ] T = E ˆx k+1 k xk+1 ˆx k+1 k Y k [ (Axk ) ( ) ] T = E + w k Aˆx k k Axk + w k Aˆx k k Y k [ (xk ) ( ) ] T = AE ˆx k k xk ˆx k k Y k A T + +AE [( ) x k ˆx k k w T k Y k] + ( ) +E [w ] T k xk ˆx k k Y k A T + E [ w k wk T Y k] = AP k k A T + Q = P k+1 k 46 / 66

47 Conditional variances Variance Σ YY Σ YY = Var [ y k+1 Y k] [ (yk+1 ) ( ) ] T = E ŷ k+1 k yk+1 ŷ k+1 k Y k [ (Cxk+1 ) ( ) ] T = E + v k+1 C ˆx k+1 k Cxk+1 + v k+1 C ˆx k+1 k Y k [ (xk+1 ) ( ) ] T = CE ˆx k+1 k xk+1 ˆx k+1 k Y k C T + +CE [( ) x k+1 ˆx k+1 k v T k+1 Y k] + ( ) +E [v ] T k+1 xk+1 ˆx k+1 k Y k C T + E [ v k+1 vk+1 Y T k] = CP k+1 k C T + R 47 / 66

48 Conditional variances Covariance Σ XY = Σ T YX Σ XY = Cov [ x k+1, y k+1 Y k] [ (xk+1 ) ( ) ] T = E ˆx k+1 k yk+1 ŷ k+1 k Y k [ (Axk ) ( ) ] T = E Aˆx k k + w k CAxk CAˆx k k + v k+1 + Cw k Y k [ (xk ) ( ) ] T = AE ˆx k k xk ˆx k k Y k A T C T + E [ w k wk T Y k] C T = AP k k A T C T + QC T = P k+1 k C T 48 / 66

49 Summing up The random variable z = PDF p ( z Y k) N ([ ˆxk+1 k C ˆx k+1 k [ ] xk+1 conditioned on Y y k, has the following k+1 ] [ Pk+1 k P, k+1 k C T CP k+1 k CP k+1 k C T + R ]) with: ˆx k+1 k = Aˆx k k P k+1 k = AP k k A T + Q 49 / 66

50 Recursive formulation The last step is to compute p ( x k+1 Y k+1) N (ˆx k+1 k+1, P k+1 k+1 ) where the mean ˆx k+1 k+1 is the optimal estimation we are looking for and P k+1 k+1 the variance of the corresponding estimation error. Substituting the previous expression we end up with ˆx k+1 k+1 = ˆx k+1 k + P k+1 k C T ( CP k+1 k C T + R ) 1 ( yk+1 C ˆx k+1 k ) The Kalman gain is the matrix K k+1 = P k+1 k C T ( CP k+1 k C T + R ) 1 mapping the output estimation error into the correction of the prediction state ˆx k+1 k+1 = ˆx k+1 k + K k+1 ( yk+1 C ˆx k+1 k ) The variance of the estimation error is P k+1 k+1 = P k+1 k P k+1 k C T ( CP k+1 k C T + R ) 1 CPk+1 k 50 / 66

51 Recursive formulation Prediction step / A priori estimation ˆx k+1 k = Aˆx k k P k+1 k = AP k k A T + Q Estimation step / A posteriori estimation ˆx k+1 k+1 = ˆx k+1 k + K k+1 (y k+1 C ˆx k+1 k ) P k+1 k+1 = P k+1 k P k+1 k C T (CP k+1 k C T + R) 1 CP k+1 k Initial conditions ˆx 0 1 = x 0 P 0 1 = P 0 51 / 66

52 Kalman filter/predictor Kalman filter ˆx k+1 k+1 = Aˆx k k + K k+1 (y k+1 CAˆx k k ) P k+1 k = AP k k 1 A T AP k k 1 C T (CP k k 1 C T + R) 1 CP k k 1 A T + Q with ( ) 1 K k+1 = P k+1 k C T CP k+1 k C T + R The matrix recursive equation P k+1 k =... is called Riccati equation. Kalman predictor ˆx k+1 k = Aˆx k k 1 + K k (y k C ˆx k k 1 ) P k+1 k = AP k k 1 A T AP k k 1 C T (CP k k 1 C T + R) 1 CP k k 1 A T + Q with ( ) 1 K k = AP k k 1 C T CP k k 1 C T + R 52 / 66

53 Kalman filter/predictor Observations 1. All the information in y(i), i [0, k 1] is contained in the estimation state ˆx k 1 k 1 : the following conditional expectations are equal E[x k y k, y k 1,..., y 0 ] = E[x k ˆx k 1 k 1, y k ] 2. The optimal gain K k is time-varying even if the stochastic model is LTI. 3. There is a more general formulation of the Kalman filter where w k and v k are correlated. 4. The same recursive equation for the Kalman filter can be used with linear time-varying stochastic systems. 53 / 66

54 Steady-state Kalman filter What s happen when k? Does the estimation error converge to zero with minimal variance? Theorem Given the stochastic LTI model { xk+1 = Ax k + w k y k = Cx k + v k (8) with v k N (0, R), E[v k vh T ] = Rδ(k h) w k N (0, Q), E[w k wh T ] = Qδ(k h) x 0 N ( x 0, P 0) where v k, w k, x 0 are zero mean uncorrelated Gaussian random variables. (9) 54 / 66

55 Steady-state Kalman filter Theorem (...) Then 1. The Algebraic Riccati Equation (ARE): P = AP A T AP C T (CP C T + R) 1 CP A T + Q has a unique positive definite symmetric matrix solution P = P T > 0 2. P is stabilizable, i.e. (A K C) is asymptotically stable with ( 1 K = P C T CP C T + R). 3. lim k P(k k 1) = P holds for all initial conditions P(0 1) = P 0 = P T 0 0, if and only if 1. (A, C) is detectable, 2. (A, Q 1/2 ) is stabilizable. 55 / 66

56 Steady-state Kalman filter Kalman filter (LTI) ˆx k+1 k+1 = Aˆx k k + K (y k+1 CAˆx k k ) P = APA T APC T (CPC T + R) 1 CPA T + Q with ( ) 1 K = PC T CPC T + R Kalman predictor (LTI) ˆx k+1 k = Aˆx k k 1 + K (y k C ˆx k k 1 ) P = APA T APC T (CPC T + R) 1 CPA T + Q with K ( 1 = APC T CPC T + R). 56 / 66

57 Kalman smoother 57 / 66

58 Kalman Smoother Model: {A, C, Q, R} x k+1 = Ax k + w k y k = CX k + v k Data: sequence of N samples of the output y 0, y 1,..., y N STEP 1 Standard Kalman filtering (forward step) ˆx f k+1 k+1 = Aˆx f k k + K k+1(y k+1 CAˆx f k k ) ˆx f 0 0 = x 0 P f k k =... P f 0 0 = P 0 58 / 66

59 Kalman Smoother STEP 2 Smoothing (backward step) ˆx s k N = ˆx f k k + K k [ˆx s k+1 N ˆx f k+1 t ˆx s N N = ˆx f N N where the conditional covariance matrix P(t N) satisfies the time-backward matrix equation P k N = Pk k f + K ] k [P k+1 N Pk+1 k f P N N = PN N f. with K k = P f k k AT ( P f k+1 k ) 1 ] 59 / 66

60 Problem: speed estimation Given the measurement equation where y(t) = s(t) + v(t) s(t) is the signal we are interesting in (e.g. angular position), y(t) is the measurement given by a sensor (e.g. an encoder) v(t) is the addictive measurement noise We can face different kinds of estimation problems: - Filtering: determine the best estimation ŝ(t) of s(t) based on the measurements y( ) till time t (i.e. y(0), y(1),..., y(t)) - Prediction: determine the best estimation ŝ(t + h) of s(t + h) with h 1 based on the measurements y( ) till time t (i.e. y(0), y(1),..., y(t)) - Smoothing: determine the best estimation ŝ(t h) of s(t h) with h 1 based on the measurements y( ) till time t (i.e. y(0), y(1),..., y(t)) 60 / 66

61 Problem: speed estimation Let θ and ω be the angular position and velocity, respectively. Knowing nothing about the physical model that produces the signal s(t) we set the derivative of the velocity equal to a white noise. A stochastic process n is called white noise if its values n(t i ) and n(t j ) are uncorrelated i j, i.e. Corr{n(t i ), n(t j )} = Q(t i )δ(t i t j ) We also assume the n(t) is Gaussian with zero-mean and constant variance Q R for all t Kinematic model Measurement equation θ(t) = ω(t) ω(t) = n(t) y(t) = θ(t) + v(t) where v is another white noise and v(t) is Gaussian with zero-mean and constant variance R R. 61 / 66

62 Problem: speed estimation Kinematic model θ(t) = ω(t) ω(t) = n(t) Measurement equation y(t) = θ(t) + v(t) Remark. The process n takes into account the uncertainty on the model of the system, whereas v models the measurement noise superimposed to the real value θ Let x(t) be the vector state x(t) := [ ] θ(t) ω(t) 62 / 66

63 Problem: speed estimation The continuous-time state space model of our basic system is [ ] [ ẋ(t) = x(t) + n(t) 0 0 1] y(t) = [ 1 0 ] x(t) + v(t) Its discrete-time approximation with sample time T s is given by [ ] [ ] 1 T 2 Ts s x k+1 = x 0 1 k + 2 n k T s y k = [ 1 0 ] x k + v k 63 / 66

64 Problem: speed estimation The relationship between the two state space models [ ] [ ] 1 T 2 Ts s { x k+1 = x 0 1 k + 2 n k xk+1 = Ax T s y k = [ 1 0 ], k + w k y k = Cx k + v k x k + v k is [ ] 1 Ts A := 0 1 [ ] [ T 2 s w k := 2 n k N 0, T s 0] C := [ 1 0 ] v k := v k [ ] [ T 2 s 2 T s T 2 s 2 T s ] T Q Tuning of the filter The variance R depends on the encoder resolution (we can read it on the datasheet) whereas the matrix Q is chosen by the designer to try to explain the measurements in the best way. 64 / 66

65 Model with inputs If an input command u k enters within the stochastic model { xk+1 = Ax k + Bu k + w k y k = Cx k + v k, how do the filter equations change? Fortunately if u k is a function of past measurements (e.g. u k = f (y 0:k )) than we can simple add the term Bu k in the recursive equations: ˆx k+1 k = Aˆx k k 1 + Bu k + K k (y k C ˆx k k 1 ) P k+1 k = AP k k 1 A T AP k k 1 C T (CP k k 1 C T + R) 1 CP k k 1 A T + Q or ˆx k+1 k = Aˆx k k 1 + Bu k + K (y k C ˆx k k 1 ) P = APA T APC T (CPC T + R) 1 CPA T + Q 65 / 66

66 Important remark In the book Probabilistic Robotics the Kalman equations are obtained following a different approach (using first and second derivatives of opportune quadratic functions). Another difference is that they start with the linear Gaussian system which is the same of Observations: x k = Ax k 1 + Bu k + w k x k+1 = Ax k + Bu k+1 + w k+1 using w k+1 instead of w k does not change anything because w is white noise on the other hand, it would make a big difference using Bu k+1 instead of Bu k as we did: for this reason the authors of the book introduce the assumption that the control input u is a random process independent of the state and the measurement 66 / 66

RECURSIVE ESTIMATION AND KALMAN FILTERING

RECURSIVE ESTIMATION AND KALMAN FILTERING Chapter 3 RECURSIVE ESTIMATION AND KALMAN FILTERING 3. The Discrete Time Kalman Filter Consider the following estimation problem. Given the stochastic system with x k+ = Ax k + Gw k (3.) y k = Cx k + Hv

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

State Estimation of Linear and Nonlinear Dynamic Systems

State Estimation of Linear and Nonlinear Dynamic Systems State Estimation of Linear and Nonlinear Dynamic Systems Part I: Linear Systems with Gaussian Noise James B. Rawlings and Fernando V. Lima Department of Chemical and Biological Engineering University of

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

Brief Review on Estimation Theory

Brief Review on Estimation Theory Brief Review on Estimation Theory K. Abed-Meraim ENST PARIS, Signal and Image Processing Dept. abed@tsi.enst.fr This presentation is essentially based on the course BASTA by E. Moulines Brief review on

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

4 Derivations of the Discrete-Time Kalman Filter

4 Derivations of the Discrete-Time Kalman Filter Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof N Shimkin 4 Derivations of the Discrete-Time

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators Estimation theory Parametric estimation Properties of estimators Minimum variance estimator Cramer-Rao bound Maximum likelihood estimators Confidence intervals Bayesian estimation 1 Random Variables Let

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Factor Analysis and Kalman Filtering (11/2/04)

Factor Analysis and Kalman Filtering (11/2/04) CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used

More information

A Probability Review

A Probability Review A Probability Review Outline: A probability review Shorthand notation: RV stands for random variable EE 527, Detection and Estimation Theory, # 0b 1 A Probability Review Reading: Go over handouts 2 5 in

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1 Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

y(x) = x w + ε(x), (1)

y(x) = x w + ε(x), (1) Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued

More information

Lecture 2: Repetition of probability theory and statistics

Lecture 2: Repetition of probability theory and statistics Algorithms for Uncertainty Quantification SS8, IN2345 Tobias Neckel Scientific Computing in Computer Science TUM Lecture 2: Repetition of probability theory and statistics Concept of Building Block: Prerequisites:

More information

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adopted from Prof. H.R. Rabiee s and also Prof. R. Gutierrez-Osuna

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

Introduction to Probabilistic Graphical Models: Exercises

Introduction to Probabilistic Graphical Models: Exercises Introduction to Probabilistic Graphical Models: Exercises Cédric Archambeau Xerox Research Centre Europe cedric.archambeau@xrce.xerox.com Pascal Bootcamp Marseille, France, July 2010 Exercise 1: basics

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1 Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters Lecturer: Drew Bagnell Scribe:Greydon Foil 1 1 Gauss Markov Model Consider X 1, X 2,...X t, X t+1 to be

More information

System Identification, Lecture 4

System Identification, Lecture 4 System Identification, Lecture 4 Kristiaan Pelckmans (IT/UU, 2338) Course code: 1RT880, Report code: 61800 - Spring 2016 F, FRI Uppsala University, Information Technology 13 April 2016 SI-2016 K. Pelckmans

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object

More information

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Siwei Guo: s9guo@eng.ucsd.edu Anwesan Pal:

More information

Lecture 4: Least Squares (LS) Estimation

Lecture 4: Least Squares (LS) Estimation ME 233, UC Berkeley, Spring 2014 Xu Chen Lecture 4: Least Squares (LS) Estimation Background and general solution Solution in the Gaussian case Properties Example Big picture general least squares estimation:

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

ECE 541 Stochastic Signals and Systems Problem Set 9 Solutions

ECE 541 Stochastic Signals and Systems Problem Set 9 Solutions ECE 541 Stochastic Signals and Systems Problem Set 9 Solutions Problem Solutions : Yates and Goodman, 9.5.3 9.1.4 9.2.2 9.2.6 9.3.2 9.4.2 9.4.6 9.4.7 and Problem 9.1.4 Solution The joint PDF of X and Y

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Stochastic Processes. M. Sami Fadali Professor of Electrical Engineering University of Nevada, Reno

Stochastic Processes. M. Sami Fadali Professor of Electrical Engineering University of Nevada, Reno Stochastic Processes M. Sami Fadali Professor of Electrical Engineering University of Nevada, Reno 1 Outline Stochastic (random) processes. Autocorrelation. Crosscorrelation. Spectral density function.

More information

ECE 636: Systems identification

ECE 636: Systems identification ECE 636: Systems identification Lectures 3 4 Random variables/signals (continued) Random/stochastic vectors Random signals and linear systems Random signals in the frequency domain υ ε x S z + y Experimental

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

Introduction to Simple Linear Regression

Introduction to Simple Linear Regression Introduction to Simple Linear Regression Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 68 About me Faculty in the Department

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Statistics 910, #15 1. Kalman Filter

Statistics 910, #15 1. Kalman Filter Statistics 910, #15 1 Overview 1. Summary of Kalman filter 2. Derivations 3. ARMA likelihoods 4. Recursions for the variance Kalman Filter Summary of Kalman filter Simplifications To make the derivations

More information

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN Lecture Notes 5 Convergence and Limit Theorems Motivation Convergence with Probability Convergence in Mean Square Convergence in Probability, WLLN Convergence in Distribution, CLT EE 278: Convergence and

More information

Optimal control and estimation

Optimal control and estimation Automatic Control 2 Optimal control and estimation Prof. Alberto Bemporad University of Trento Academic year 2010-2011 Prof. Alberto Bemporad (University of Trento) Automatic Control 2 Academic year 2010-2011

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression

More information

2 Statistical Estimation: Basic Concepts

2 Statistical Estimation: Basic Concepts Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:

More information

System Identification, Lecture 4

System Identification, Lecture 4 System Identification, Lecture 4 Kristiaan Pelckmans (IT/UU, 2338) Course code: 1RT880, Report code: 61800 - Spring 2012 F, FRI Uppsala University, Information Technology 30 Januari 2012 SI-2012 K. Pelckmans

More information

Lecture Note 1: Probability Theory and Statistics

Lecture Note 1: Probability Theory and Statistics Univ. of Michigan - NAME 568/EECS 568/ROB 530 Winter 2018 Lecture Note 1: Probability Theory and Statistics Lecturer: Maani Ghaffari Jadidi Date: April 6, 2018 For this and all future notes, if you would

More information

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) 1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018 A.1. Summation operator (Wooldridge, App. A.1) 2 3 Summation operator For

More information

Chapter 17: Undirected Graphical Models

Chapter 17: Undirected Graphical Models Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Review (Probability & Linear Algebra)

Review (Probability & Linear Algebra) Review (Probability & Linear Algebra) CE-725 : Statistical Pattern Recognition Sharif University of Technology Spring 2013 M. Soleymani Outline Axioms of probability theory Conditional probability, Joint

More information

Kalman Filter. Predict: Update: x k k 1 = F k x k 1 k 1 + B k u k P k k 1 = F k P k 1 k 1 F T k + Q

Kalman Filter. Predict: Update: x k k 1 = F k x k 1 k 1 + B k u k P k k 1 = F k P k 1 k 1 F T k + Q Kalman Filter Kalman Filter Predict: x k k 1 = F k x k 1 k 1 + B k u k P k k 1 = F k P k 1 k 1 F T k + Q Update: K = P k k 1 Hk T (H k P k k 1 Hk T + R) 1 x k k = x k k 1 + K(z k H k x k k 1 ) P k k =(I

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Outline lecture 2 2(30)

Outline lecture 2 2(30) Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control

More information

Introduction to Estimation Methods for Time Series models. Lecture 1

Introduction to Estimation Methods for Time Series models. Lecture 1 Introduction to Estimation Methods for Time Series models Lecture 1 Fulvio Corsi SNS Pisa Fulvio Corsi Introduction to Estimation () Methods for Time Series models Lecture 1 SNS Pisa 1 / 19 Estimation

More information

Linear regression COMS 4771

Linear regression COMS 4771 Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between

More information

conditional cdf, conditional pdf, total probability theorem?

conditional cdf, conditional pdf, total probability theorem? 6 Multiple Random Variables 6.0 INTRODUCTION scalar vs. random variable cdf, pdf transformation of a random variable conditional cdf, conditional pdf, total probability theorem expectation of a random

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

A Crash Course on Kalman Filtering

A Crash Course on Kalman Filtering A Crash Course on Kalman Filtering Dan Simon Cleveland State University Fall 2014 1 / 64 Outline Linear Systems Probability State Means and Covariances Least Squares Estimation The Kalman Filter Unknown

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77 Linear Regression Chapter 3 September 27, 2016 Chapter 3 September 27, 2016 1 / 77 1 3.1. Simple linear regression 2 3.2 Multiple linear regression 3 3.3. The least squares estimation 4 3.4. The statistical

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

ESTIMATION THEORY. Chapter Estimation of Random Variables

ESTIMATION THEORY. Chapter Estimation of Random Variables Chapter ESTIMATION THEORY. Estimation of Random Variables Suppose X,Y,Y 2,...,Y n are random variables defined on the same probability space (Ω, S,P). We consider Y,...,Y n to be the observed random variables

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Bayesian Interpretations of Regularization

Bayesian Interpretations of Regularization Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S

More information

Parameter Estimation in a Moving Horizon Perspective

Parameter Estimation in a Moving Horizon Perspective Parameter Estimation in a Moving Horizon Perspective State and Parameter Estimation in Dynamical Systems Reglerteknik, ISY, Linköpings Universitet State and Parameter Estimation in Dynamical Systems OUTLINE

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

01 Probability Theory and Statistics Review

01 Probability Theory and Statistics Review NAVARCH/EECS 568, ROB 530 - Winter 2018 01 Probability Theory and Statistics Review Maani Ghaffari January 08, 2018 Last Time: Bayes Filters Given: Stream of observations z 1:t and action data u 1:t Sensor/measurement

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information