Mathematical Formulation of Our Example

Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1

Combining Evidence Suppose our robot obtains another observation, where the index is the point in time. Question: How can we integrate this new information? Formally, we want to estimate. Using Bayes formula with background knowledge:?? Computer Vision 2

Markov Assumption If we know the state of the door at time then the measurement does not give any further information about. Formally: and are conditional independent given. This means: This is called the Markov Assumption. Computer Vision 3

Example with Numbers Assume we have a second sensor: Then: (from above) lowers the probability that the door is open Computer Vision 4

General Form Measurements: Markov assumption: and are conditionally independent given the state. Recursion Computer Vision 5

Example: Sensing and Acting Now the robot senses the door state and acts (it opens or closes the door). Computer Vision 6

State Transitions The outcome of an action is modeled as a random variable where in our case means state after closing the door. State transition example: If the door is open, the action close door succeeds in 90% of all cases. Computer Vision 7

The Outcome of Actions For a given action we want to know the probability. We do this by integrating over all possible previous states. If the state space is discrete: If the state space is continuous: Computer Vision 8

Back to the Example Computer Vision 9

Sensor Update and Action Update So far, we learned two different ways to update the system state: Sensor update: Action update: Now we want to combine both: Definition 2.1: Let be a sequence of sensor measurements and actions until time. Then the belief of the current state is defined as Computer Vision 10

Graphical Representation We can describe the overall process using a Dynamic Bayes Network: This incorporates the following Markov assumptions: p(z t x 0:t,u 1:t,z 1:t )=p(z t x t ) p(x t x 0:t 1,u 1:t,z 1:t 1 )=p(x t x t 1,u t ) (measurement) (state) Computer Vision 11

The Overall Bayes Filter (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) Computer Vision 12

The Bayes Filter Algorithm Algorithm Bayes_filter : 1. if is a sensor measurement then 2. 3. for all do 4. 5. 6. for all do 7. else if is an action then 8. for all do 9. return Computer Vision 13

Bayes Filter Variants The Bayes filter principle is used in Kalman filters Particle filters Hidden Markov models Dynamic Bayesian networks Partially Observable Markov Decision Processes (POMDPs) Computer Vision 14

Summary Probabilistic reasoning is necessary to deal with uncertain information, e.g. sensor measurements Using Bayes rule, we can do diagnostic reasoning based on causal knowledge The outcome of a robot s action can be described by a state transition diagram Probabilistic state estimation can be done recursively using the Bayes filter using a sensor and a motion update A graphical representation for the state estimation problem is the Dynamic Bayes Network Computer Vision 15

Prof. Daniel Cremers 2. Regression

Categories of Learning (Rep.) Learning Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the test data Reinforcement Learning No supervision, but a reward function Discriminant Function Discriminativ e Model Generative Model 17

Mathematical Formulation (Rep.) Suppose we are given a set of objects and a set o of object categories (classes). In the learning task we search for a mapping such that similar elements in are mapped to similar elements in. Difference between regression and classification: In regression, is continuous, in classification it is discrete Regression learns a function, classification usually learns class labels For now we will treat regression 18

Basis Functions In principal, the elements of can be anything (e.g. real numbers, graphs, 3D objects). To be able to treat these objects mathematically we need functions R M that map from to. We call these the basis functions. We can also interpret the basis functions as functions that extract features from the input data. Features reflect the properties of the objects (width, height, etc.). 19

Simple Example: Linear Regression Assume: Given: data points (identity) Goal: predict the value t of a new example x Parametric formulation: t 5 f f(x, w) =w 0 + w 1 x t 3 t 4 t 1 t 2 f(x, w ) x 1 x 2 x 3 x 4 x 5 20

Linear Regression To determine the function f, we need an error function: E(w) = 1 2 Sum of Squared Errors We search for parameters s.th. is minimal: re(w) = NX (f(x i, w) t i ) 2 NX (f(x i, w) t i )rf(x i, w) = (0 0) f(x, w) =w 0 + w 1 x ) rf(x i, w) =(1 x i ) 21

Linear Regression To evaluate the function y, we need an error function: E(w) = 1 2 We search for parameters s.th. is minimal: re(w) = NX (f(x i, w) t i ) 2 NX Sum of Squared Errors (f(x i, w) t i )rf(x i, w) = (0 0) f(x, w) =w 0 + w 1 x ) rf(x i, w) =(1 x i ) Using vector notation: x i =(1 x i ) T ) f(x i, w) =w T x i 22

Now we have: Given: data points Polynomial Regression Assume we are given M basis functions MX f(x, w) =w 0 + w j j (x) =w T (x) f Data Set Size Model Complexity 24

Polynomial Regression We have defined: Therefore: re(w) =w T f(x, w) =w T (x) E(w) = 1 NX (w T (x i ) t i ) 2 2 X N (x i ) (x i ) T! T NX t i Basis functions (x i ) T T 1 Outer Product 1 25

Polynomial Regression We have defined: T Therefore: re(w) =w T f(x, w) =w T (x) E(w) = 1 NX (w T (x i ) t i ) 2 2 X N (x i ) (x i ) T! NX t i (x i ) T T 1 T 2 1 2 26

Polynomial Regression We have defined: T Therefore: re(w) =w T f(x, w) =w T (x) E(w) = 1 NX (w T (x i ) t i ) 2 2 X N (x i ) (x i ) T! NX t i (x i ) T 27

Polynomial Regression Thus, we have: where It follows: Normal Equation Pseudoinverse + 28

Computing the Pseudoinverse Mathematically, a pseudoinverse every matrix. + exists for However: If solution of is (close to) singular the direct is numerically unstable. Therefore: Singular Value Decomposition (SVD) is used: = UDV T where matrices U and V are orthogonal matrices D is a diagonal matrix + = VD + U T Then: where contains the D + reciprocal of all non-zero elements of D 29

A Simple Example j(x) =x j 30

A Simple Example j(x) =x j Overfitting 31

Varying the Sample Size 32

The Resulting Model Parameters 0.200 0.150 0.100 0.050 0.000 23 15 8 0-8 -15-23 -30 0.60 0.45 0.30 0.15 0.00-0.15-0.30-0.45-0.60 8E+05 6E+05 4E+05 2E+05 0E+00-2E+05-4E+05-6E+05-8E+05 33

Observations The higher the model complexity grows, the better is the fit to the data If the model complexity is too high, all data points are explained well, but the resulting model oscillates very much. It can not generalize well. This is called overfitting. By increasing the size of the data set (number of samples), we obtain a better fit of the model More complex models have larger parameters Problem: How can we find a good model complexity for a given data set with a fixed size? 34

Regularization We observed that complex models yield large parameters, leading to oscillation. Idea: Minimize the error function and the magnitude of the parameters simultaneously We do this by adding a regularization term : Ẽ(w) = 1 2 NX (w T (x i ) t i ) 2 + kwk 2 2 where λ rules the influence of the regularization. 35

Regularization As above, we set the derivative to zero: X N rẽ(w) = (w T (x i ) t i ) (x i ) T + w T. = 0 T With regularization, we can find a complex model for a small data set. However, the problem now is to find an appropriate regularization coefficient λ. 36

Regularized Results 37

The Problem from a Different View Point Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 38

Maximum Likelihood Estimation Aim: we want to find the w that maximizes p. is the likelihood of the measured data given a model. Intuitively: Find parameters w that maximize the probability of measuring the already measured data t. Maximum Likelihood Estimation We can think of this as fitting a model w to the data t. Note: σ is also part of the model and can be estimated. For now, we assume σ is known. 39

Maximum Likelihood Estimation Given data points: Assumption: points are drawn independently from p: where: Instead of maximizing p we can also maximize its logarithm (monotonicity of the logarithm) 40

Maximum Likelihood Estimation i 2 2 i Constant for all w Is equal to The parameters that maximize the likelihood are equal to the minimum of the sum of squared errors 41

Maximum Likelihood Estimation i i w ML := arg max w ln p(t x, w, ) = arg min w E(w) =( T ) 1 T t The ML solution is obtained using the Pseudoinverse 42