Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Similar documents
Introduction into Bayesian statistics

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 1: Bayesian Framework Basics

Density Estimation. Seungjin Choi

Lecture : Probabilistic Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Introduction to Bayesian Statistics

Lecture 2: From Linear Regression to Kalman Filter and Beyond

STAT J535: Introduction

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Sequential Data and Markov Models

Naïve Bayes classification

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Unsupervised Learning

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

COS513 LECTURE 8 STATISTICAL CONCEPTS

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Probabilistic and Bayesian Machine Learning

Bayesian RL Seminar. Chris Mansley September 9, 2008

Learning Bayesian network : Given structure and completely observed data

Lecture 4: Probabilistic Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning 4771

David Giles Bayesian Econometrics

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Overfitting, Bias / Variance Analysis

Statistical learning. Chapter 20, Sections 1 3 1

Introduction to Probabilistic Machine Learning

Probability and Estimation. Alan Moses

PATTERN RECOGNITION AND MACHINE LEARNING

Time Series and Dynamic Models

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Bayesian Inference and MCMC

an introduction to bayesian inference

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

GAUSSIAN PROCESS REGRESSION

Part 1: Expectation Propagation

Statistical Machine Learning Lecture 1: Motivation

CMU-Q Lecture 24:

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical learning. Chapter 20, Sections 1 4 1

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Probability and Inference

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Bayesian Machine Learning

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Bayesian Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Decision and Bayesian Learning

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Bayesian Learning (II)

Bayesian Methods: Naïve Bayes

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Computational Genomics

Hidden Markov Models

Answers and expectations

COMP90051 Statistical Machine Learning

Grundlagen der Künstlichen Intelligenz

CSC321 Lecture 18: Learning Probabilistic Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Bayesian inference

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Statistical learning. Chapter 20, Sections 1 3 1

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Introduction to Mobile Robotics Probabilistic Robotics

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

Machine Learning using Bayesian Approaches

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Linear Models for Regression

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Lecture 9: PGM Learning

GWAS IV: Bayesian linear (variance component) models

Parametric Techniques Lecture 3

Probability, Entropy, and Inference / More About Inference

Introduction to Machine Learning

HMM part 1. Dr Philip Jackson

Parameter Estimation

Mathematical Formulation of Our Example

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Gaussian Processes for Machine Learning

Bayesian Decision Theory

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Bayesian Regression Linear and Logistic Regression

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Parametric Techniques

Probability Theory Review

Computational Cognitive Science

Transcription:

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words, we want to make a prediction for new observation. The prediction is based on understanding the whole phenomenon or imitating the phenomenon. To formalize our considerations we use random variables. Figure 1: Idea of representing a state of the world by relationships among different quantities. We would like to measure our belief about the world s state x. Cox Axioms about b(x): 1. Strengths of belief (degrees of plausibility) are represented by real numbers, e.g., 0 b(x) 1. 2. Qualitative correspondence with common sense, i.e., b(x) + b( x) = 1. 3. Consistency: If a conclusion can be reasoned in several ways, then each way should lead to the same answer, i.e., b(x, y z) = b(x z) b(y x, z) = b(y z) b(x y, z). It turns out that the belief function must satisfy the rules of probability theory: 1

sum rule: p(x) = y p(x, y) product rule: p(x, y) = p(x y) p(y) Let us consider an example for discrete random variables: p(x, y) y = 1y = 2 x = 3 0.3 0.2 0.5 p(x) x = 2 x = 1 0.2 0.1 0.1 0.1 0.3 0.2 p(y) 0.6 0.4 Figure 2: Exemplary discrete distributions. Example of application of product rule: p(x y = 2) = p(x, y = 2) p(y = 2) p(x, y) y = 1y = 2 p(x y = 2) x = 3 0.3 0.2 0.5 x = 2 x = 1 0.2 0.1 0.1 0.1 0.25 0.25 p(y) 0.6 0.4 p(x, y = 2) Figure 3: Exemplary application of the product rule. 2

A probability distribution for continuous random variables is given by a probability density function (PDF). We are interested in a random variable taking values in (a, b): p(x (a, b)) = b a p(x)dx The integral of PDF p(x) equals 1 and the pdf fulfills the rules of the probability theory: sum rule: p(x) = p(x, y)dy product rule: p(x, y) = p(x y)p(y) Figure 4: Exemplary pdf and cumulative distribution function (CDF). 2 Inference We distinguish two kinds of random variables: Input variables: x Output variables: y These variables have joint distribution p(x, y), which is unknown. However, we assume that there is a dependency between x and y. We assume that this dependency can be approximated by a function y = f(x), i.e., for given x there is exactly one value y. 3

Figure 5: Idea of inference, i.e., there is a dependency between inputs and outputs. Determining y basing on x is called decision making, inference or prediction. In order to find f(x) we aim at minimizing the risk functional: R[f] = L(y, f(x)) p(x, y)dxdy [ ] = E x,y L(y, f(x)). L denotes a loss function: L(y, f(x)) = { 1, if y f(x) 0, w p.p. (classification) L(y, f(x)) = ( y f(x) ) 2 (regression) It can be shown that in order to minimize R[f] it is sufficient to minimize E y [ L(y, f(x)) x ]. f (x) = arg max p(y x) y [ ] f (x) = E y y x = y p(y x)dy (classification) (regression) 3 Modeling The most general fashion of representing the relation between x and y is the joint distribution p(x, y). The conditional distribution p(y x), which is further used in inference, can be 4

expressed as follows: p(y x) = p(x, y) p(x) = p(x, y) p(x, y) y We assume that the real distribution p(x, y) can be modeled by p(x, y θ ), where parameters θ are unknown. We know the form of the model p(x, y θ) only. For instance, p(x, y θ) = N (x, y µ, Σ) is a normal distribution with parameters θ = {µ, Σ}. Figure 6: Idea of modeling. Generative models we aim at modeling p(x y, θ) and p(y θ). Then p(x, y θ) = p(x y, θ) p(y θ), and p(y x, θ) = p(x y, θ) p(y θ) y p(x y, θ) p(y θ). Discriminative models the conditional distribution of the output is modeled directly, p(y x, θ). Discriminant functions the considered dependency is modeled as a function y = f(x; θ); we do not use probabilities. 4 Learning There are N independent examples D = {(x 1, y 1 ),..., (x N, y N )}, generated from the real distribution p(x, y). 5

Learning aims at optimizing objective function of fitting p(x, y θ) to data D with respect to (wrt) θ. We define likelihood of parameters for given data: p(d θ) = N p(x n, y n θ) n=1 The likelihood determines the plausibility of generating data D from the considered model with parameters θ. The uncertainty of parameters θ is modeled by a priori distribution prior p(θ). The rules of probability theory (Bayes rule) allows to modify the uncertainty of parameters by including observations, i.e., one obtains a posteriori distribution (posterior) of the following form: p(θ D) = p(d θ)p(θ) p(d) posterior likelihood prior It can be shown that if data D n, consisting of n data points, was generated from some true θ, then under some regularity conditions, as long as p(θ ) > 0: lim p(θ D n) = δ(θ θ ) n Figure 7: Idea of including parameters uncertainty in modeling. Frequentist learning determination of point estimate of parameters θ: maximum likelihood estimation, ML): θ ML = arg max p(d θ), θ 6

maximum a posteriori estimation, MAP): θ MAP = arg max p(θ D). θ Bayesian learning determination of predictive distribution, i.e., marginalizing out parameters: 5 Dynamical systems p(y x, D) = p(y x, θ) }{{} model p(θ D) dθ. }{{} posterior As far, we have focussed primarily on phenomena which are time-independent, i.e., data that were assumed to be independent and identically distributed (i.i.d.). For many applications, however, the i.i.d. assumption will be a poor one. Here we consider a particularly important class of such data sets, namely those that describe sequential data. These often arise through measurement of time series, for example the rainfall measurements on successive days at a particular location, or the daily values of a currency exchange rate, or the acoustic features at successive time frames used for speech recognition. Sequential data can also arise in contexts other than time series, for example the sequence of nucleotide base pairs along a strand of DNA or the sequence of characters in an English sentence. It is useful to distinguish between stationary and nonstationary sequential distributions. In the stationary case, the data evolves in time, but the distribution from which it is generated remains the same. For the more complex nonstationary situation, the generative distribution itself is evolving with time. There are different ways to model sequential data, for example: deterministic modelling: differential equations (continuous domain): difference equations (discrete domain): probabilistic modelling: dx dt = f(x) x n+1 = f(x n ) 7

Markov models, i.e., the distribution over the current state depends on the previous ones, for instance first-order Markov model: p(x n+1 x 1,..., x n ) = p(x n+1 x n ) and its likelihood function: N p(x 1,..., x N ) = p(x 1 ) p(x n x n 1 ) Dynamical Systems (noises: η x, η y ): x n+1 = f(x n, η x ) y n+1 = g(x n+1, η y ) and its special case Linear Dynamical Systems (where we assume Gaussian noises η x and η y ): p(x n+1 x n ) = N (x n+1 Ax n, Σ x ) p(y n+1 x n+1 ) = N (y n+1 Bx n+1, Σ y ) n=2 8