Statistical Filtering and Control for AI and Robotics. Part II. Linear methods for regression & Kalman filtering
|
|
- Julianna Beasley
- 5 years ago
- Views:
Transcription
1 Statistical Filtering and Control for AI and Robotics Part II. Linear methods for regression & Kalman filtering Riccardo Muradore 1 / 66
2 Outline Linear Methods for Regression Gaussian filter Stochastic model Kalman filtering Kalman smoother 2 / 66
3 References These lectures are based on the following books Sebastian Thrun, Wolfram Burgard and Dieter Fox, ProbabilisticRobotics, MIT Press, 2005 Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009 Several pictures from those books have been copied and pasted here 3 / 66
4 Linear Methods for Regression 4 / 66
5 probability Supervised learning: use the inputs (i.e. predictors, independent variables, features) to predict the values of the outputs (i.e. responses, dependent variables) This distinction in output type has led to a naming convention for the prediction tasks: regression when we predict quantitative outputs, and classification when we predict qualitative outputs. Notation: x R m random variable (x i R is its i-th component) x R m an observation of the random variable x R m (x i R is its i-th component) X R m N a collection of N observations (X T i R m is its i-th row) We will focus on the regression problem: this means that input and output vectors consist of qualitative measurements 5 / 66
6 Linear Models Input: x R m, x R m, X R N m Output: y R p, y R p, Y R N p Prediction: ŷ R p, ŷ R p, Ŷ Rp N Linear Model: ( from now on p = 1 ) where β R m Prediction y = f (x) = x T β ŷ = x T ˆβ where ˆβ R m is the matrix of coefficients that we have to determine Remark. If p = 1, the gradient f (x) = x f (x) = β is a vector pointing in the steepest uphill direction 6 / 66
7 Least Squares Let X R N m and Y R N a training set of data (collection of N pairs (x, y)) How to choice β? First of all we have to introduce an index as a function of β. Let RSS(β) be the residual sum of squares RSS(β) := We search for N (Y i X i β) T (Y i X i β) = (Y X β) T (Y X β) i=1 ˆβ := arg min RSS(β) β Computing the first and second derivative we get the normal equations β RSS(β) = 2X T (Y X β) 2 ββrss(β) = 2X T X 7 / 66
8 Least Squares If X T X is nonsingular (i.e. X has full column rank), the unique solution is given by the normal equations i.e. β RSS(β) = 0 X T (Y X β) = 0 ˆβ = (X T X ) 1 X T Y and the prediction of y given a new value x is ŷ = x T ˆβ Observations: We assume that the underlying model is linear Statistics of x and y do not play any role (it seems...) 8 / 66
9 Least Squares p > 1 Linear model Y = XB + E where X R N m, Y R N p, E R N p and B R m p The RSS takes the form RSS(B) := trace{(y XB) T (Y XB)} and the least square estimation of B is written in the same way ˆB = (X T X ) 1 X T Y Multiple outputs do not affect one another s least squares estimates If the component of the vector r.v e are correlated, i.e. e N (0, Σ), then we can define a weighted RSS RSS(B, Σ) := N (Y i X i B) T Σ 1 (Y i X i B) i=1 9 / 66
10 Geometric interpretation The normal equations X T (Y X β) = 0 means the estimation Ŷ = X ˆβ = X (X T X ) 1 X T Y is the orthogonal projection of Y into the subspace X 10 / 66
11 Statistical interpretation We now consider the r.v. x and y as input and output, respectively, and we seek a function f (x) for predicting y. The criterion should be now deal with stochastic quantities: we introduce the expected squared prediction error EPE (strictly related with the mean squared error MSE) EPE(f ) := E [ (y f (x)) T (y f (x)) ] = (y f (x)) T (y f (x))p(x, y)dxdy S x,s y where we implicitly assumed that x and y have a joint PDF. EPE(f ) is a L 2 loss function Conditioning on x we can re-write EPE(f ) as EPE(f ) := E x [ Ey x [ (y f (x)) T (y f (x)) x ] ] 11 / 66
12 Statistical interpretation We can determine f ( ) pointwise which means that f (x) = arg min c E y x [ (y c) T (y c) x = x ] f (x) = E [y x = x] i.e. the best f (x) is the conditional mean (according to the EPE criterion). Beautiful but, given the data X, Y how can we compute the conditional expectation?!? 12 / 66
13 Statistical interpretation Let us assume again f (x) = x T β then EPE(f ) := E [ (y x T β) T (y x T β) ] Differentiating w.r.t. β we end up with β = ( E[xx T ] ) 1 E[x T y] Computing the auto- and cross-correlation (i.e. using real numbers!) E[xx T ] N S xx := 1 N E[x T y] N S xy := 1 N N i=1 N i=1 Xi T X i = 1 N X T X X i Y T i = 1 N XY T 13 / 66
14 Statistical interpretation Then we get ˆβ = ( ) 1 1 N X T 1 X N XY T = ( X T X ) 1 XY T Again the normal equations!!! But now we can provide a statistical interpretation of ˆβ. Let y = x T β + e, e N (0, σ 2 ) be our model (p = 1), then ˆβ is a Gaussian variable ˆβ N (β, (X T X ) 1 σ 2 ) In fact, since ˆβ = ( X T X ) 1 X y ( X T X ) 1 X e ŷ = x T ˆβ + e 14 / 66
15 Gauss-Markov theorem Given the linear model y = x T β, Y = X β the least squares estimator ˆφ(x 0 ) = x0 T ˆβ of φ(x 0 ) = x0 T β is unbiased because E[x T 0 ˆβ] = x T 0 β Theorem If φ(x 0 ) is any other unbiased estimation (E[ φ(x 0 )] = x0 T β) then Var( ˆφ(x 0 )) Var( φ(x 0 )) Remark. Mean square error of a generic estimator φ (p = 1) MSE( φ) = E[( φ φ) 2 ] ( ) = Var( }{{ φ) + (E[ } φ] φ) 2 }{{} variance bias (*) = sum and subtract E[ φ]. 15 / 66
16 Gauss-Markov theorem Given the stochastic linear model y = x T β + e, e N (0, σ 2 ) and let φ(x 0 ) be the estimator for y 0 = φ(x 0 ) + e 0, φ(x 0 ) = x0 T β. The expected prediction error (EPE) of φ(x 0 ) is EPE( φ(x 0 )) = E[(y 0 φ(x 0 )) 2 ] = σ 2 + E[(x T 0 β φ(x 0 )) 2 ] = σ 2 + Var( φ) + (E[ φ] φ) 2 }{{} MSE 16 / 66
17 Bias-variance trade-off underfitting VS overfitting 17 / 66
18 Statistical models Statistical model: y = f (x) + e where y is a random error with zero mean (E[e] = 0) and is independent of x. This means that the relationship between y and x is not deterministic (f ( )) The additive r.v. e takes care of measurement noise, model uncertainty and non measured variables correlated with y as well We often assume that the random variables e are independent and identically distributed (i.i.d.) 18 / 66
19 Statistical models Assuming a linear basis expansion for f θ (x) parametrized by the unknowns collected within the vector θ K f θ (x) = h k (x)θ k where examples of h k (x) can be 1 h k (x) = x k h k (x) = (x k ) 2 h k (x) = sin(x k ) h k (x) = e x T β k The optimization problem to solve is ˆθ = arg min RSS(θ) = N (y i f θ (x i )) 2 θ Θ where RSS stands for Residual Sum of Squares 1 19 / 66
20 Statistical models Are there other kinds of criterion besides RSS, EPE? YES, A more general principle for estimation is maximum likelihood estimation Let p θ (y) be the PDF of the samples y 1,..., y N The log-probability (or log-likelihood) of the observed samples is L(θ) = N log p θ (y i ) 1 Principle of maximum likelihood: the most reasonable values for θ are those for which the probability of the observed samples is largest 20 / 66
21 Statistical models If the error e in the following statistical model y = f θ (x) + e is Gaussian, e N (0, σ 2 ), then the conditional probability is Then log-likelihood of the data is L(θ) = p(y x, θ) N (f θ (x), σ 2 ) N log p(y i f θ (x i ), θ) 1 = N 2 log(2π) N log σ 1 2σ 2 N i=1 (y i f θ (x i )) 2 Least squares for the additive error model is equivalent to maximum likelihood using the conditional probability (The yellow is the RSS(θ) ) 21 / 66
22 Penalty function and Regularization methods Penalty function, or regularization methods, introduces our knowledge about the type of functions f (x) we are looking for PRSS(f, λ) := RSS(f ) + λg(f ) where the functional g(f ) will force our knowledge (or desiderata) on f Example. One-dimension cubic smoothing spline is the solution of PRSS(f, λ) := N (y i f (x i )) 2 + λ [f (s)] 2 dx i=1 Remark. Penalty function methods have a Bayesian interpretation: g(f ) is the log-prior distribution PRSS(f, λ) is the log-posterior distribution the solution of arg min f PRSS(f, λ) is the posterior mode 22 / 66
23 Kernel Methods and Local Regression If we want a local regression estimation of f (x 0 ), we have to solve the problem ˆθ = arg min θ RSS(f θ, x 0 ) = N K λ (x 0, x i )(y i f θ (x i )) 2 i=1 where the kernel function K λ (x 0, x) weights the point x around x 0. The optimal estimation is fˆθ(x 0 ) An example of kernel function is the Gaussian kernel K λ (x 0, x) = 1 [ λ exp x x 0 2 ] Examples of f θ (x) are f θ (x) = θ 0, constant function f θ (x) = θ 0 + θ 1 x, linear regression 2λ 23 / 66
24 Basis functions The function f can be approximated using a set of M basis functions h m M f θ (x) = θ m h m (x) where θ = [θ 1 θ M ] Examples of basis functions: m=1 Radial basis functions: M f θ (x) = θ m K λm (µ m, x), K λ (µ, x) = e x µ 2 /2λ m=1 Single-layer feed-forward neural network f θ (x) = M θ m σ(αmx T 1 + b m ), σ(x) = 1 + e x m=1 Remark. Linear methods can then be used with nonlinear input-output transformation because the model is linear in the parameters θ 24 / 66
25 Subset selection The least squares estimates often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may improve the overall prediction accuracy. 25 / 66
26 Ridge Regression Ridge regression shrinks the regression coefficients by imposing a penalty on their size. The coefficients ˆβ ridge are obtained solving the minimization problem ˆβ ridge = arg min β { N (Y i X i β) T (Y i X i β) +λ i=1 } {{ } RSS(β) with λ 0, or the equivalent constrained problem The solution is m i=1 β 2 i }{{} g(β)=β T β N ˆβ ridge = arg min β (Y i X i β) T (Y i X i β) s. to i=1 m i=1 β 2 i t ˆβ ridge = (X T X + λi ) 1 X T Y } 26 / 66
27 Lasso The coefficients ˆβ lasso are obtained solving the minimization problem N m ˆβ lasso = arg min{ (Y i X i β) T (Y i X i β) +λ β i } β i=1 } {{ } i=1 }{{} RSS(β) g(β) with λ 0, or the equivalent constrained problem N ˆβ lasso = arg min β (Y i X i β) T (Y i X i β) s. to i=1 m β i t i=1 The are no closed form expression for ˆβ lasso Remark 1. The Ridge Regression uses a L 2 norm on β, whereas Lasso the L 1 norm. This means that the solution is nonlinear in the data. Remark 1. Decreasing t forces some of the coefficients to be set to zero (exactly). 27 / 66
28 Ridge regression VS Lasso Lasso Ridge β 1 + β 2 t β β2 2 t2 The red ellipses are the contours of the least squares error function 28 / 66
29 From r.v. to stochastic processes Definition (random variable) A random variable x: Ω E is a measurable function from the set of possible outcomes Ω to some set E. Ω is a probability space and E is a measurable space. Roughly speaking: A random variable x is a rule for assigning to every outcome ω of an experiments a number x(ω) Definition (stochastic process) Given a probability space (Ω, F, P) and a measurable space (S, Σ), an S-valued stochastic process is a collection of S-valued random variables on Ω, indexed by a totally ordered set T ( time ). That is, a stochastic process is a collection {x t : t T } where each x t is an S-valued random variable on Ω. The space S is then called the state space of the process. Roughly speaking: A stochastic process x t is a rule for assigning to every outcome ω a function x(t, ω) 29 / 66
30 From r.v. to stochastic processes {x t } has the following interpretations: It is a family of functions x t (ω) when t and ω are variables. It is a single time function (or a realization of the given process) x t ( ω) when t is a variable and ω = ω is fixed. It is a random variable if t = t is fixed and ω is variable, i.e. x t(ω) state of the process at time t. It is a number if t and ω are fixed If T = R, {x t } is a continuous-time process If T = Z, {x k } is a discrete-time process Even though the dynamics of the system is described by ODE, in the following we will consider discrete-time processes because the sensing system provides measurements at discrete moments. Remark The r.v. x k(ω) can be continuous even if k T = Z 30 / 66
31 Gaussian filter 31 / 66
32 State estimation Given the measurement y k, k = 0, 1,... related to an unknown state variable x k, we are interested in the following estimation problems - Filtering - h-step ahead Prediction y 0, y 1,..., y k ˆx k k - h-step backward Smoothing y 0, y 1,..., y k ˆx k+h k - Smoothing y 0, y 1,..., y k ˆx k h k y 0, y 1,..., y N ˆx k N 32 / 66
33 Gaussian filters Gaussian filters assume that the undergoing phenomena can be modeled by Gaussian distributions. This assumption allows to solve in recursive way the general Bayes filters formulation Why are Gaussian distributions so good? Gaussians are unimodal: they have a single maximum The statistics (mean, variance and higher order moments) of a Gaussian are described by two parameters: its mean and variance The linear combination of Gaussians is still Gaussian 33 / 66
34 Gaussian Random Variables Definition (Gaussian r.v.) An n-dimensional random variable X is Gaussian with mean µ R n and variance Σ R n n, Σ = Σ T > 0, X N (µ, Σ), if its probability density function (PDF) is given by p(x) = 1 (2π)n det Σ e 1 2 (x µ)t Σ 1 (x µ). This means that µ = E[X ] Σ = Var(X ) = E[(X E[X ])(X E[X ]) T ]. 34 / 66
35 Gaussian r.v. Theorem (Joint Gaussian r.v.) Let x R n and y R m be ([ joint Gaussian ] [ ]) µx Σxx Σ p(x, y) N, xy µ y Σ yx Σ yy Then the r.v. z = Ax + By is still Gaussian, i.e. z N (µ z, Σ z ), where µ z = E[Ax + By] = Aµ x + Bµ y Σ z = E [(Ax ] + By Aµ x Bµ y ) (Ax + By Aµ x Bµ y ) T [ ( [ ]) ( [ ]) ] T x µx x µx = E [A B] [A B] y µ y y µ y [ [ ] [ ] ] T x µx x µx = [A B]E [A B] T y µ y y µ y [ ] [ ] Σxx Σ = [A B] xy A T Σ yx Σ yy B T 35 / 66
36 Gaussian r.v. Theorem (...) the Gaussian random variable x conditioned on the Gaussian random variable y is still a Gaussian random variable. The PDF of x given y is where if Σ yy > 0. p(x y) = N (µ x y, Σ x y ) (1) µ x y = µ x + Σ xy Σ 1 yy (y µ y ) (2) Σ x y = Σ xx Σ xy Σ 1 yy Σ yx (3) 36 / 66
37 Optimal Estimator Theorem (Minimum Variance Estimator) Let x R n, y R m be two r.v. (non necessariamente Gaussiane), and g : R m R n a measurable function. We define ˆx g = g(y) as the estimator of x given y through the function g, and e g = x g(y) = x ˆx g the corresponding estimation error. The estimator ˆx = E[x y] = ĝ(y) is optimal because it minimizes the error variance, i.e. Var(e) = E[(x ˆx)(x ˆx) T ] E[(x ˆx g )(x ˆx g ) T ] = Var(e g ), g( ) where e = x ˆx is the error of the optimal estimator. The error of the optimal estimator and the estimation are uncorrelated E[eĝ(y) T ] = / 66
38 Stochastic model 38 / 66
39 LTI Stochastic model We focus now on the state-space representation of a generic Linear Time-Invariant (LTI) stochastic model: { xk+1 = Ax k + w k where: y k = Cx k + v k v k N (0, R), E[v k vh T ] = Rδ(k h) w k N (0, Q), E[w k wh T ] = Qδ(k h) x 0 N ( x 0, P 0 ) and v k, w k, x 0 are uncorrelated zero-mean Gaussian r.v. E[v k w T h ] = 0 E[x 0 v T k ] = 0 E[x 0 w T k ] = 0 The state-space model is a way to describe the dynamical evolution of a stochastic process 39 / 66
40 LTI Stochastic model From the evolution of the state and of the output at time t we also have x k = A k k0 x 0 + y k = CA k k0 x 0 + k 1 i=k 0 A k i 1 w i k 1 i=k 0 CA k i 1 w i + v k E[x k w T h ] = 0, h k E[x k v T h ] = 0 E[y k v T h ] = Qδ(k h) 40 / 66
41 Kalman filtering 41 / 66
42 Kalman filter The Kalman filter (or minimum variance filter) is defined as: ˆx k+1 k+1 = E [x k+1 y 0,..., y k+1 ] = E [ x k+1 y k+1, Y k] (4) where Y k = (y k,..., y 1, y 0 ). Goal: we need a recursive expression for ˆx k+1 k+1 without using µ x y = µ x + Σ xy Σ 1 yy (y µ y ) (5) at any time instant k, i.e. when a new measurement is available. The explicit expression for E [X Y ] is easy to derive from [ (5) if X e ] Y ΣXX Σ are joint Gaussian with means µ X, µ Y and variances XY. Σ YX Σ YY 42 / 66
43 Kalman filter To rewrite ˆx k+1 k+1 = E [x k+1 y 0,..., y k+1 ] = E [ x k+1 y k+1, Y k] in the form E [X Y ] we introduce the following conditional random variables X = x k+1 Y k Y = y k+1 Y k and compute the following means, variances and covariances: µ X = E [ x k+1 Y k] µ Y = E [ y k+1 Y k] P k+1 k = Σ XX = Var [ x k+1 Y k] Σ YY = Var [ y k+1 Y k] Σ XY = Σ T YX = Cov [ x k+1, y k+1 Y k]. 43 / 66
44 Kalman filter The optimal estimator is given by: E [X Y ] = ˆx k+1 k+1 = ˆx k+1 k + Σ XY Σ 1 YY ( yk+1 ŷ k+1 k ) (6) and the variance of the estimation error is Σ X Y = P k+1 k+1 = Σ XX Σ XY Σ 1 YY Σ YX (7) 44 / 66
45 Conditional means Mean µ X µ X = E [ x k+1 Y k] = E [ Ax k + w k Y k] = AE [ x k Y k] + E [ w k Y k] = Aˆx k k = ˆx k+1 k Mean µ Y µ Y = E [ y k+1 Y k] = E [ Cx k+1 + v k+1 Y k] = CE [ x k+1 Y k] + E [ v k+1 Y k] = C ˆx k+1 k = CAˆx k k 45 / 66
46 Conditional variances Variance Σ XX Σ XX = Var [ x k+1 Y k] [ (xk+1 ) ( ) ] T = E ˆx k+1 k xk+1 ˆx k+1 k Y k [ (Axk ) ( ) ] T = E + w k Aˆx k k Axk + w k Aˆx k k Y k [ (xk ) ( ) ] T = AE ˆx k k xk ˆx k k Y k A T + +AE [( ) x k ˆx k k w T k Y k] + ( ) +E [w ] T k xk ˆx k k Y k A T + E [ w k wk T Y k] = AP k k A T + Q = P k+1 k 46 / 66
47 Conditional variances Variance Σ YY Σ YY = Var [ y k+1 Y k] [ (yk+1 ) ( ) ] T = E ŷ k+1 k yk+1 ŷ k+1 k Y k [ (Cxk+1 ) ( ) ] T = E + v k+1 C ˆx k+1 k Cxk+1 + v k+1 C ˆx k+1 k Y k [ (xk+1 ) ( ) ] T = CE ˆx k+1 k xk+1 ˆx k+1 k Y k C T + +CE [( ) x k+1 ˆx k+1 k v T k+1 Y k] + ( ) +E [v ] T k+1 xk+1 ˆx k+1 k Y k C T + E [ v k+1 vk+1 Y T k] = CP k+1 k C T + R 47 / 66
48 Conditional variances Covariance Σ XY = Σ T YX Σ XY = Cov [ x k+1, y k+1 Y k] [ (xk+1 ) ( ) ] T = E ˆx k+1 k yk+1 ŷ k+1 k Y k [ (Axk ) ( ) ] T = E Aˆx k k + w k CAxk CAˆx k k + v k+1 + Cw k Y k [ (xk ) ( ) ] T = AE ˆx k k xk ˆx k k Y k A T C T + E [ w k wk T Y k] C T = AP k k A T C T + QC T = P k+1 k C T 48 / 66
49 Summing up The random variable z = PDF p ( z Y k) N ([ ˆxk+1 k C ˆx k+1 k [ ] xk+1 conditioned on Y y k, has the following k+1 ] [ Pk+1 k P, k+1 k C T CP k+1 k CP k+1 k C T + R ]) with: ˆx k+1 k = Aˆx k k P k+1 k = AP k k A T + Q 49 / 66
50 Recursive formulation The last step is to compute p ( x k+1 Y k+1) N (ˆx k+1 k+1, P k+1 k+1 ) where the mean ˆx k+1 k+1 is the optimal estimation we are looking for and P k+1 k+1 the variance of the corresponding estimation error. Substituting the previous expression we end up with ˆx k+1 k+1 = ˆx k+1 k + P k+1 k C T ( CP k+1 k C T + R ) 1 ( yk+1 C ˆx k+1 k ) The Kalman gain is the matrix K k+1 = P k+1 k C T ( CP k+1 k C T + R ) 1 mapping the output estimation error into the correction of the prediction state ˆx k+1 k+1 = ˆx k+1 k + K k+1 ( yk+1 C ˆx k+1 k ) The variance of the estimation error is P k+1 k+1 = P k+1 k P k+1 k C T ( CP k+1 k C T + R ) 1 CPk+1 k 50 / 66
51 Recursive formulation Prediction step / A priori estimation ˆx k+1 k = Aˆx k k P k+1 k = AP k k A T + Q Estimation step / A posteriori estimation ˆx k+1 k+1 = ˆx k+1 k + K k+1 (y k+1 C ˆx k+1 k ) P k+1 k+1 = P k+1 k P k+1 k C T (CP k+1 k C T + R) 1 CP k+1 k Initial conditions ˆx 0 1 = x 0 P 0 1 = P 0 51 / 66
52 Kalman filter/predictor Kalman filter ˆx k+1 k+1 = Aˆx k k + K k+1 (y k+1 CAˆx k k ) P k+1 k = AP k k 1 A T AP k k 1 C T (CP k k 1 C T + R) 1 CP k k 1 A T + Q with ( ) 1 K k+1 = P k+1 k C T CP k+1 k C T + R The matrix recursive equation P k+1 k =... is called Riccati equation. Kalman predictor ˆx k+1 k = Aˆx k k 1 + K k (y k C ˆx k k 1 ) P k+1 k = AP k k 1 A T AP k k 1 C T (CP k k 1 C T + R) 1 CP k k 1 A T + Q with ( ) 1 K k = AP k k 1 C T CP k k 1 C T + R 52 / 66
53 Kalman filter/predictor Observations 1. All the information in y(i), i [0, k 1] is contained in the estimation state ˆx k 1 k 1 : the following conditional expectations are equal E[x k y k, y k 1,..., y 0 ] = E[x k ˆx k 1 k 1, y k ] 2. The optimal gain K k is time-varying even if the stochastic model is LTI. 3. There is a more general formulation of the Kalman filter where w k and v k are correlated. 4. The same recursive equation for the Kalman filter can be used with linear time-varying stochastic systems. 53 / 66
54 Steady-state Kalman filter What s happen when k? Does the estimation error converge to zero with minimal variance? Theorem Given the stochastic LTI model { xk+1 = Ax k + w k y k = Cx k + v k (8) with v k N (0, R), E[v k vh T ] = Rδ(k h) w k N (0, Q), E[w k wh T ] = Qδ(k h) x 0 N ( x 0, P 0) where v k, w k, x 0 are zero mean uncorrelated Gaussian random variables. (9) 54 / 66
55 Steady-state Kalman filter Theorem (...) Then 1. The Algebraic Riccati Equation (ARE): P = AP A T AP C T (CP C T + R) 1 CP A T + Q has a unique positive definite symmetric matrix solution P = P T > 0 2. P is stabilizable, i.e. (A K C) is asymptotically stable with ( 1 K = P C T CP C T + R). 3. lim k P(k k 1) = P holds for all initial conditions P(0 1) = P 0 = P T 0 0, if and only if 1. (A, C) is detectable, 2. (A, Q 1/2 ) is stabilizable. 55 / 66
56 Steady-state Kalman filter Kalman filter (LTI) ˆx k+1 k+1 = Aˆx k k + K (y k+1 CAˆx k k ) P = APA T APC T (CPC T + R) 1 CPA T + Q with ( ) 1 K = PC T CPC T + R Kalman predictor (LTI) ˆx k+1 k = Aˆx k k 1 + K (y k C ˆx k k 1 ) P = APA T APC T (CPC T + R) 1 CPA T + Q with K ( 1 = APC T CPC T + R). 56 / 66
57 Kalman smoother 57 / 66
58 Kalman Smoother Model: {A, C, Q, R} x k+1 = Ax k + w k y k = CX k + v k Data: sequence of N samples of the output y 0, y 1,..., y N STEP 1 Standard Kalman filtering (forward step) ˆx f k+1 k+1 = Aˆx f k k + K k+1(y k+1 CAˆx f k k ) ˆx f 0 0 = x 0 P f k k =... P f 0 0 = P 0 58 / 66
59 Kalman Smoother STEP 2 Smoothing (backward step) ˆx s k N = ˆx f k k + K k [ˆx s k+1 N ˆx f k+1 t ˆx s N N = ˆx f N N where the conditional covariance matrix P(t N) satisfies the time-backward matrix equation P k N = Pk k f + K ] k [P k+1 N Pk+1 k f P N N = PN N f. with K k = P f k k AT ( P f k+1 k ) 1 ] 59 / 66
60 Problem: speed estimation Given the measurement equation where y(t) = s(t) + v(t) s(t) is the signal we are interesting in (e.g. angular position), y(t) is the measurement given by a sensor (e.g. an encoder) v(t) is the addictive measurement noise We can face different kinds of estimation problems: - Filtering: determine the best estimation ŝ(t) of s(t) based on the measurements y( ) till time t (i.e. y(0), y(1),..., y(t)) - Prediction: determine the best estimation ŝ(t + h) of s(t + h) with h 1 based on the measurements y( ) till time t (i.e. y(0), y(1),..., y(t)) - Smoothing: determine the best estimation ŝ(t h) of s(t h) with h 1 based on the measurements y( ) till time t (i.e. y(0), y(1),..., y(t)) 60 / 66
61 Problem: speed estimation Let θ and ω be the angular position and velocity, respectively. Knowing nothing about the physical model that produces the signal s(t) we set the derivative of the velocity equal to a white noise. A stochastic process n is called white noise if its values n(t i ) and n(t j ) are uncorrelated i j, i.e. Corr{n(t i ), n(t j )} = Q(t i )δ(t i t j ) We also assume the n(t) is Gaussian with zero-mean and constant variance Q R for all t Kinematic model Measurement equation θ(t) = ω(t) ω(t) = n(t) y(t) = θ(t) + v(t) where v is another white noise and v(t) is Gaussian with zero-mean and constant variance R R. 61 / 66
62 Problem: speed estimation Kinematic model θ(t) = ω(t) ω(t) = n(t) Measurement equation y(t) = θ(t) + v(t) Remark. The process n takes into account the uncertainty on the model of the system, whereas v models the measurement noise superimposed to the real value θ Let x(t) be the vector state x(t) := [ ] θ(t) ω(t) 62 / 66
63 Problem: speed estimation The continuous-time state space model of our basic system is [ ] [ ẋ(t) = x(t) + n(t) 0 0 1] y(t) = [ 1 0 ] x(t) + v(t) Its discrete-time approximation with sample time T s is given by [ ] [ ] 1 T 2 Ts s x k+1 = x 0 1 k + 2 n k T s y k = [ 1 0 ] x k + v k 63 / 66
64 Problem: speed estimation The relationship between the two state space models [ ] [ ] 1 T 2 Ts s { x k+1 = x 0 1 k + 2 n k xk+1 = Ax T s y k = [ 1 0 ], k + w k y k = Cx k + v k x k + v k is [ ] 1 Ts A := 0 1 [ ] [ T 2 s w k := 2 n k N 0, T s 0] C := [ 1 0 ] v k := v k [ ] [ T 2 s 2 T s T 2 s 2 T s ] T Q Tuning of the filter The variance R depends on the encoder resolution (we can read it on the datasheet) whereas the matrix Q is chosen by the designer to try to explain the measurements in the best way. 64 / 66
65 Model with inputs If an input command u k enters within the stochastic model { xk+1 = Ax k + Bu k + w k y k = Cx k + v k, how do the filter equations change? Fortunately if u k is a function of past measurements (e.g. u k = f (y 0:k )) than we can simple add the term Bu k in the recursive equations: ˆx k+1 k = Aˆx k k 1 + Bu k + K k (y k C ˆx k k 1 ) P k+1 k = AP k k 1 A T AP k k 1 C T (CP k k 1 C T + R) 1 CP k k 1 A T + Q or ˆx k+1 k = Aˆx k k 1 + Bu k + K (y k C ˆx k k 1 ) P = APA T APC T (CPC T + R) 1 CPA T + Q 65 / 66
66 Important remark In the book Probabilistic Robotics the Kalman equations are obtained following a different approach (using first and second derivatives of opportune quadratic functions). Another difference is that they start with the linear Gaussian system which is the same of Observations: x k = Ax k 1 + Bu k + w k x k+1 = Ax k + Bu k+1 + w k+1 using w k+1 instead of w k does not change anything because w is white noise on the other hand, it would make a big difference using Bu k+1 instead of Bu k as we did: for this reason the authors of the book introduce the assumption that the control input u is a random process independent of the state and the measurement 66 / 66
RECURSIVE ESTIMATION AND KALMAN FILTERING
Chapter 3 RECURSIVE ESTIMATION AND KALMAN FILTERING 3. The Discrete Time Kalman Filter Consider the following estimation problem. Given the stochastic system with x k+ = Ax k + Gw k (3.) y k = Cx k + Hv
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationData Mining Stat 588
Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic
More informationState Estimation of Linear and Nonlinear Dynamic Systems
State Estimation of Linear and Nonlinear Dynamic Systems Part I: Linear Systems with Gaussian Noise James B. Rawlings and Fernando V. Lima Department of Chemical and Biological Engineering University of
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More informationBrief Review on Estimation Theory
Brief Review on Estimation Theory K. Abed-Meraim ENST PARIS, Signal and Image Processing Dept. abed@tsi.enst.fr This presentation is essentially based on the course BASTA by E. Moulines Brief review on
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More informationOverview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation
Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated
More information4 Derivations of the Discrete-Time Kalman Filter
Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof N Shimkin 4 Derivations of the Discrete-Time
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationEstimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators
Estimation theory Parametric estimation Properties of estimators Minimum variance estimator Cramer-Rao bound Maximum likelihood estimators Confidence intervals Bayesian estimation 1 Random Variables Let
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationCOMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)
COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationFactor Analysis and Kalman Filtering (11/2/04)
CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used
More informationA Probability Review
A Probability Review Outline: A probability review Shorthand notation: RV stands for random variable EE 527, Detection and Estimation Theory, # 0b 1 A Probability Review Reading: Go over handouts 2 5 in
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationInverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1
Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationy(x) = x w + ε(x), (1)
Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued
More informationLecture 2: Repetition of probability theory and statistics
Algorithms for Uncertainty Quantification SS8, IN2345 Tobias Neckel Scientific Computing in Computer Science TUM Lecture 2: Repetition of probability theory and statistics Concept of Building Block: Prerequisites:
More informationReview (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology
Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adopted from Prof. H.R. Rabiee s and also Prof. R. Gutierrez-Osuna
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationGaussian with mean ( µ ) and standard deviation ( σ)
Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (
More informationIntroduction to Probabilistic Graphical Models: Exercises
Introduction to Probabilistic Graphical Models: Exercises Cédric Archambeau Xerox Research Centre Europe cedric.archambeau@xrce.xerox.com Pascal Bootcamp Marseille, France, July 2010 Exercise 1: basics
More informationIntroduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak
Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,
More informationPhysics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester
Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationStatistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1
Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters Lecturer: Drew Bagnell Scribe:Greydon Foil 1 1 Gauss Markov Model Consider X 1, X 2,...X t, X t+1 to be
More informationSystem Identification, Lecture 4
System Identification, Lecture 4 Kristiaan Pelckmans (IT/UU, 2338) Course code: 1RT880, Report code: 61800 - Spring 2016 F, FRI Uppsala University, Information Technology 13 April 2016 SI-2016 K. Pelckmans
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationMachine Learning 4771
Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object
More informationECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering
ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Siwei Guo: s9guo@eng.ucsd.edu Anwesan Pal:
More informationLecture 4: Least Squares (LS) Estimation
ME 233, UC Berkeley, Spring 2014 Xu Chen Lecture 4: Least Squares (LS) Estimation Background and general solution Solution in the Gaussian case Properties Example Big picture general least squares estimation:
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationECE 541 Stochastic Signals and Systems Problem Set 9 Solutions
ECE 541 Stochastic Signals and Systems Problem Set 9 Solutions Problem Solutions : Yates and Goodman, 9.5.3 9.1.4 9.2.2 9.2.6 9.3.2 9.4.2 9.4.6 9.4.7 and Problem 9.1.4 Solution The joint PDF of X and Y
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION
COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationStochastic Processes. M. Sami Fadali Professor of Electrical Engineering University of Nevada, Reno
Stochastic Processes M. Sami Fadali Professor of Electrical Engineering University of Nevada, Reno 1 Outline Stochastic (random) processes. Autocorrelation. Crosscorrelation. Spectral density function.
More informationECE 636: Systems identification
ECE 636: Systems identification Lectures 3 4 Random variables/signals (continued) Random/stochastic vectors Random signals and linear systems Random signals in the frequency domain υ ε x S z + y Experimental
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationIntroduction to Simple Linear Regression
Introduction to Simple Linear Regression Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 68 About me Faculty in the Department
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationStatistics 910, #15 1. Kalman Filter
Statistics 910, #15 1 Overview 1. Summary of Kalman filter 2. Derivations 3. ARMA likelihoods 4. Recursions for the variance Kalman Filter Summary of Kalman filter Simplifications To make the derivations
More informationLecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN
Lecture Notes 5 Convergence and Limit Theorems Motivation Convergence with Probability Convergence in Mean Square Convergence in Probability, WLLN Convergence in Distribution, CLT EE 278: Convergence and
More informationOptimal control and estimation
Automatic Control 2 Optimal control and estimation Prof. Alberto Bemporad University of Trento Academic year 2010-2011 Prof. Alberto Bemporad (University of Trento) Automatic Control 2 Academic year 2010-2011
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationGaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak
Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression
More information2 Statistical Estimation: Basic Concepts
Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:
More informationSystem Identification, Lecture 4
System Identification, Lecture 4 Kristiaan Pelckmans (IT/UU, 2338) Course code: 1RT880, Report code: 61800 - Spring 2012 F, FRI Uppsala University, Information Technology 30 Januari 2012 SI-2012 K. Pelckmans
More informationLecture Note 1: Probability Theory and Statistics
Univ. of Michigan - NAME 568/EECS 568/ROB 530 Winter 2018 Lecture Note 1: Probability Theory and Statistics Lecturer: Maani Ghaffari Jadidi Date: April 6, 2018 For this and all future notes, if you would
More informationEC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)
1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018 A.1. Summation operator (Wooldridge, App. A.1) 2 3 Summation operator For
More informationChapter 17: Undirected Graphical Models
Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationReview (Probability & Linear Algebra)
Review (Probability & Linear Algebra) CE-725 : Statistical Pattern Recognition Sharif University of Technology Spring 2013 M. Soleymani Outline Axioms of probability theory Conditional probability, Joint
More informationKalman Filter. Predict: Update: x k k 1 = F k x k 1 k 1 + B k u k P k k 1 = F k P k 1 k 1 F T k + Q
Kalman Filter Kalman Filter Predict: x k k 1 = F k x k 1 k 1 + B k u k P k k 1 = F k P k 1 k 1 F T k + Q Update: K = P k k 1 Hk T (H k P k k 1 Hk T + R) 1 x k k = x k k 1 + K(z k H k x k k 1 ) P k k =(I
More informationLeast Absolute Shrinkage is Equivalent to Quadratic Penalization
Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr
More informationRirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology
Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign
More informationDirect Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II
1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationOutline lecture 2 2(30)
Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control
More informationIntroduction to Estimation Methods for Time Series models. Lecture 1
Introduction to Estimation Methods for Time Series models Lecture 1 Fulvio Corsi SNS Pisa Fulvio Corsi Introduction to Estimation () Methods for Time Series models Lecture 1 SNS Pisa 1 / 19 Estimation
More informationLinear regression COMS 4771
Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between
More informationconditional cdf, conditional pdf, total probability theorem?
6 Multiple Random Variables 6.0 INTRODUCTION scalar vs. random variable cdf, pdf transformation of a random variable conditional cdf, conditional pdf, total probability theorem expectation of a random
More informationLecture 2: From Linear Regression to Kalman Filter and Beyond
Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationA Crash Course on Kalman Filtering
A Crash Course on Kalman Filtering Dan Simon Cleveland State University Fall 2014 1 / 64 Outline Linear Systems Probability State Means and Covariances Least Squares Estimation The Kalman Filter Unknown
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationLinear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77
Linear Regression Chapter 3 September 27, 2016 Chapter 3 September 27, 2016 1 / 77 1 3.1. Simple linear regression 2 3.2 Multiple linear regression 3 3.3. The least squares estimation 4 3.4. The statistical
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10
COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationESTIMATION THEORY. Chapter Estimation of Random Variables
Chapter ESTIMATION THEORY. Estimation of Random Variables Suppose X,Y,Y 2,...,Y n are random variables defined on the same probability space (Ω, S,P). We consider Y,...,Y n to be the observed random variables
More informationData Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
More informationBayesian Interpretations of Regularization
Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S
More informationParameter Estimation in a Moving Horizon Perspective
Parameter Estimation in a Moving Horizon Perspective State and Parameter Estimation in Dynamical Systems Reglerteknik, ISY, Linköpings Universitet State and Parameter Estimation in Dynamical Systems OUTLINE
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More information01 Probability Theory and Statistics Review
NAVARCH/EECS 568, ROB 530 - Winter 2018 01 Probability Theory and Statistics Review Maani Ghaffari January 08, 2018 Last Time: Bayes Filters Given: Stream of observations z 1:t and action data u 1:t Sensor/measurement
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More information