Mathematical Formulation of Our Example

Similar documents
Computer Vision Group Prof. Daniel Cremers. 3. Regression

Markov localization uses an explicit, discrete representation for the probability of all position in the state space.

Introduction to Mobile Robotics Probabilistic Robotics

PATTERN RECOGNITION AND MACHINE LEARNING

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

STA 4273H: Statistical Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Naïve Bayes classification

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Graphical Models

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CSCI-567: Machine Learning (Spring 2019)

Notes on Machine Learning for and

Probabilistic Robotics

L11: Pattern recognition principles

CS 6375 Machine Learning

Introduction to Machine Learning Midterm, Tues April 8

Bayesian Machine Learning

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Lecture : Probabilistic Machine Learning

Parameter Estimation. Industrial AI Lab.

CS534 Machine Learning - Spring Final Exam

Autonomous Navigation for Flying Robots

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Statistical Learning. Philipp Koehn. 10 November 2015

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Statistical learning. Chapter 20, Sections 1 4 1

CS Machine Learning Qualifying Exam

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Pattern Recognition and Machine Learning

PATTERN CLASSIFICATION

Bayesian Networks Inference with Probabilistic Graphical Models

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Linear Dynamical Systems

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Practice Page 2 of 2 10/28/13

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Chris Bishop s PRML Ch. 8: Graphical Models

Introduction to Machine Learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Probabilistic Graphical Models

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Probabilistic Graphical Models for Image Analysis - Lecture 1

Non-parametric Methods

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CMU-Q Lecture 24:

Modeling and state estimation Examples State estimation Probabilities Bayes filter Particle filter. Modeling. CSC752 Autonomous Robotic Systems

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Probabilistic Robotics. Slides from Autonomous Robots (Siegwart and Nourbaksh), Chapter 5 Probabilistic Robotics (S. Thurn et al.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Lecture 5

Final Exam, Machine Learning, Spring 2009

An Introduction to Statistical and Probabilistic Linear Models

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Logistic Regression. COMP 527 Danushka Bollegala

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Computational Genomics

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Density Estimation. Seungjin Choi

INTRODUCTION TO PATTERN RECOGNITION

Introduction to Machine Learning

CS6220: DATA MINING TECHNIQUES

Course Introduction. Probabilistic Modelling and Reasoning. Relationships between courses. Dealing with Uncertainty. Chris Williams.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Lecture 7

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Lecture 4: Probabilistic Learning

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Probabilistic Reasoning in Deep Learning

November 28 th, Carlos Guestrin 1. Lower dimensional projections

Ch 4. Linear Models for Classification

Parametric Techniques

Linear Regression and Its Applications

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

p(d θ ) l(θ ) 1.2 x x x

Machine Learning, Fall 2009: Midterm

Data Mining Techniques

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

COURSE INTRODUCTION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

STA 4273H: Statistical Machine Learning

COS513 LECTURE 8 STATISTICAL CONCEPTS

MODULE -4 BAYEIAN LEARNING

Transcription:

Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1

Combining Evidence Suppose our robot obtains another observation, where the index is the point in time. Question: How can we integrate this new information? Formally, we want to estimate. Using Bayes formula with background knowledge:?? Computer Vision 2

Markov Assumption If we know the state of the door at time then the measurement does not give any further information about. Formally: and are conditional independent given. This means: This is called the Markov Assumption. Computer Vision 3

Example with Numbers Assume we have a second sensor: Then: (from above) lowers the probability that the door is open Computer Vision 4

General Form Measurements: Markov assumption: and are conditionally independent given the state. Recursion Computer Vision 5

Example: Sensing and Acting Now the robot senses the door state and acts (it opens or closes the door). Computer Vision 6

State Transitions The outcome of an action is modeled as a random variable where in our case means state after closing the door. State transition example: If the door is open, the action close door succeeds in 90% of all cases. Computer Vision 7

The Outcome of Actions For a given action we want to know the probability. We do this by integrating over all possible previous states. If the state space is discrete: If the state space is continuous: Computer Vision 8

Back to the Example Computer Vision 9

Sensor Update and Action Update So far, we learned two different ways to update the system state: Sensor update: Action update: Now we want to combine both: Definition 2.1: Let be a sequence of sensor measurements and actions until time. Then the belief of the current state is defined as Computer Vision 10

Graphical Representation We can describe the overall process using a Dynamic Bayes Network: This incorporates the following Markov assumptions: p(z t x 0:t,u 1:t,z 1:t )=p(z t x t ) p(x t x 0:t 1,u 1:t,z 1:t 1 )=p(x t x t 1,u t ) (measurement) (state) Computer Vision 11

The Overall Bayes Filter (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) Computer Vision 12

The Bayes Filter Algorithm Algorithm Bayes_filter : 1. if is a sensor measurement then 2. 3. for all do 4. 5. 6. for all do 7. else if is an action then 8. for all do 9. return Computer Vision 13

Bayes Filter Variants The Bayes filter principle is used in Kalman filters Particle filters Hidden Markov models Dynamic Bayesian networks Partially Observable Markov Decision Processes (POMDPs) Computer Vision 14

Summary Probabilistic reasoning is necessary to deal with uncertain information, e.g. sensor measurements Using Bayes rule, we can do diagnostic reasoning based on causal knowledge The outcome of a robot s action can be described by a state transition diagram Probabilistic state estimation can be done recursively using the Bayes filter using a sensor and a motion update A graphical representation for the state estimation problem is the Dynamic Bayes Network Computer Vision 15

Prof. Daniel Cremers 2. Regression

Categories of Learning (Rep.) Learning Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the test data Reinforcement Learning No supervision, but a reward function Discriminant Function Discriminativ e Model Generative Model 17

Mathematical Formulation (Rep.) Suppose we are given a set of objects and a set o of object categories (classes). In the learning task we search for a mapping such that similar elements in are mapped to similar elements in. Difference between regression and classification: In regression, is continuous, in classification it is discrete Regression learns a function, classification usually learns class labels For now we will treat regression 18

Basis Functions In principal, the elements of can be anything (e.g. real numbers, graphs, 3D objects). To be able to treat these objects mathematically we need functions R M that map from to. We call these the basis functions. We can also interpret the basis functions as functions that extract features from the input data. Features reflect the properties of the objects (width, height, etc.). 19

Simple Example: Linear Regression Assume: Given: data points (identity) Goal: predict the value t of a new example x Parametric formulation: t 5 f f(x, w) =w 0 + w 1 x t 3 t 4 t 1 t 2 f(x, w ) x 1 x 2 x 3 x 4 x 5 20

Linear Regression To determine the function f, we need an error function: E(w) = 1 2 Sum of Squared Errors We search for parameters s.th. is minimal: re(w) = NX (f(x i, w) t i ) 2 NX (f(x i, w) t i )rf(x i, w) = (0 0) f(x, w) =w 0 + w 1 x ) rf(x i, w) =(1 x i ) 21

Linear Regression To evaluate the function y, we need an error function: E(w) = 1 2 We search for parameters s.th. is minimal: re(w) = NX (f(x i, w) t i ) 2 NX Sum of Squared Errors (f(x i, w) t i )rf(x i, w) = (0 0) f(x, w) =w 0 + w 1 x ) rf(x i, w) =(1 x i ) Using vector notation: x i =(1 x i ) T ) f(x i, w) =w T x i 22

Linear Regression To evaluate the function y, we need an error function: E(w) = 1 2 We search for parameters s.th. is minimal: re(w) = NX (f(x i, w) t i ) 2 NX Sum of Squared Errors (f(x i, w) t i )rf(x i, w) = (0 0) f(x, w) =w 0 + w 1 x ) rf(x i, w) =(1 x i ) Using vector notation: x i =(1 x i ) T ) f(x i, w) =w T x i re(w) = NX w T x i x T i NX t i x T i = (0 0) ) w T N X x i x T i = NX t i x T i {z } =:A T {z } =:b T 23

Now we have: Given: data points Polynomial Regression Assume we are given M basis functions MX f(x, w) =w 0 + w j j (x) =w T (x) f Data Set Size Model Complexity 24

Polynomial Regression We have defined: Therefore: re(w) =w T f(x, w) =w T (x) E(w) = 1 NX (w T (x i ) t i ) 2 2 X N (x i ) (x i ) T! T NX t i Basis functions (x i ) T T 1 Outer Product 1 25

Polynomial Regression We have defined: T Therefore: re(w) =w T f(x, w) =w T (x) E(w) = 1 NX (w T (x i ) t i ) 2 2 X N (x i ) (x i ) T! NX t i (x i ) T T 1 T 2 1 2 26

Polynomial Regression We have defined: T Therefore: re(w) =w T f(x, w) =w T (x) E(w) = 1 NX (w T (x i ) t i ) 2 2 X N (x i ) (x i ) T! NX t i (x i ) T 27

Polynomial Regression Thus, we have: where It follows: Normal Equation Pseudoinverse + 28

Computing the Pseudoinverse Mathematically, a pseudoinverse every matrix. + exists for However: If solution of is (close to) singular the direct is numerically unstable. Therefore: Singular Value Decomposition (SVD) is used: = UDV T where matrices U and V are orthogonal matrices D is a diagonal matrix + = VD + U T Then: where contains the D + reciprocal of all non-zero elements of D 29

A Simple Example j(x) =x j 30

A Simple Example j(x) =x j Overfitting 31

Varying the Sample Size 32

The Resulting Model Parameters 0.200 0.150 0.100 0.050 0.000 23 15 8 0-8 -15-23 -30 0.60 0.45 0.30 0.15 0.00-0.15-0.30-0.45-0.60 8E+05 6E+05 4E+05 2E+05 0E+00-2E+05-4E+05-6E+05-8E+05 33

Observations The higher the model complexity grows, the better is the fit to the data If the model complexity is too high, all data points are explained well, but the resulting model oscillates very much. It can not generalize well. This is called overfitting. By increasing the size of the data set (number of samples), we obtain a better fit of the model More complex models have larger parameters Problem: How can we find a good model complexity for a given data set with a fixed size? 34

Regularization We observed that complex models yield large parameters, leading to oscillation. Idea: Minimize the error function and the magnitude of the parameters simultaneously We do this by adding a regularization term : Ẽ(w) = 1 2 NX (w T (x i ) t i ) 2 + kwk 2 2 where λ rules the influence of the regularization. 35

Regularization As above, we set the derivative to zero: X N rẽ(w) = (w T (x i ) t i ) (x i ) T + w T. = 0 T With regularization, we can find a complex model for a small data set. However, the problem now is to find an appropriate regularization coefficient λ. 36

Regularized Results 37

The Problem from a Different View Point Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 38

Maximum Likelihood Estimation Aim: we want to find the w that maximizes p. is the likelihood of the measured data given a model. Intuitively: Find parameters w that maximize the probability of measuring the already measured data t. Maximum Likelihood Estimation We can think of this as fitting a model w to the data t. Note: σ is also part of the model and can be estimated. For now, we assume σ is known. 39

Maximum Likelihood Estimation Given data points: Assumption: points are drawn independently from p: where: Instead of maximizing p we can also maximize its logarithm (monotonicity of the logarithm) 40

Maximum Likelihood Estimation i 2 2 i Constant for all w Is equal to The parameters that maximize the likelihood are equal to the minimum of the sum of squared errors 41

Maximum Likelihood Estimation i i w ML := arg max w ln p(t x, w, ) = arg min w E(w) =( T ) 1 T t The ML solution is obtained using the Pseudoinverse 42