Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words, we want to make a prediction for new observation. The prediction is based on understanding the whole phenomenon or imitating the phenomenon. To formalize our considerations we use random variables. Figure 1: Idea of representing a state of the world by relationships among different quantities. We would like to measure our belief about the world s state x. Cox Axioms about b(x): 1. Strengths of belief (degrees of plausibility) are represented by real numbers, e.g., 0 b(x) 1. 2. Qualitative correspondence with common sense, i.e., b(x) + b( x) = 1. 3. Consistency: If a conclusion can be reasoned in several ways, then each way should lead to the same answer, i.e., b(x, y z) = b(x z) b(y x, z) = b(y z) b(x y, z). It turns out that the belief function must satisfy the rules of probability theory: 1
sum rule: p(x) = y p(x, y) product rule: p(x, y) = p(x y) p(y) Let us consider an example for discrete random variables: p(x, y) y = 1y = 2 x = 3 0.3 0.2 0.5 p(x) x = 2 x = 1 0.2 0.1 0.1 0.1 0.3 0.2 p(y) 0.6 0.4 Figure 2: Exemplary discrete distributions. Example of application of product rule: p(x y = 2) = p(x, y = 2) p(y = 2) p(x, y) y = 1y = 2 p(x y = 2) x = 3 0.3 0.2 0.5 x = 2 x = 1 0.2 0.1 0.1 0.1 0.25 0.25 p(y) 0.6 0.4 p(x, y = 2) Figure 3: Exemplary application of the product rule. 2
A probability distribution for continuous random variables is given by a probability density function (PDF). We are interested in a random variable taking values in (a, b): p(x (a, b)) = b a p(x)dx The integral of PDF p(x) equals 1 and the pdf fulfills the rules of the probability theory: sum rule: p(x) = p(x, y)dy product rule: p(x, y) = p(x y)p(y) Figure 4: Exemplary pdf and cumulative distribution function (CDF). 2 Inference We distinguish two kinds of random variables: Input variables: x Output variables: y These variables have joint distribution p(x, y), which is unknown. However, we assume that there is a dependency between x and y. We assume that this dependency can be approximated by a function y = f(x), i.e., for given x there is exactly one value y. 3
Figure 5: Idea of inference, i.e., there is a dependency between inputs and outputs. Determining y basing on x is called decision making, inference or prediction. In order to find f(x) we aim at minimizing the risk functional: R[f] = L(y, f(x)) p(x, y)dxdy [ ] = E x,y L(y, f(x)). L denotes a loss function: L(y, f(x)) = { 1, if y f(x) 0, w p.p. (classification) L(y, f(x)) = ( y f(x) ) 2 (regression) It can be shown that in order to minimize R[f] it is sufficient to minimize E y [ L(y, f(x)) x ]. f (x) = arg max p(y x) y [ ] f (x) = E y y x = y p(y x)dy (classification) (regression) 3 Modeling The most general fashion of representing the relation between x and y is the joint distribution p(x, y). The conditional distribution p(y x), which is further used in inference, can be 4
expressed as follows: p(y x) = p(x, y) p(x) = p(x, y) p(x, y) y We assume that the real distribution p(x, y) can be modeled by p(x, y θ ), where parameters θ are unknown. We know the form of the model p(x, y θ) only. For instance, p(x, y θ) = N (x, y µ, Σ) is a normal distribution with parameters θ = {µ, Σ}. Figure 6: Idea of modeling. Generative models we aim at modeling p(x y, θ) and p(y θ). Then p(x, y θ) = p(x y, θ) p(y θ), and p(y x, θ) = p(x y, θ) p(y θ) y p(x y, θ) p(y θ). Discriminative models the conditional distribution of the output is modeled directly, p(y x, θ). Discriminant functions the considered dependency is modeled as a function y = f(x; θ); we do not use probabilities. 4 Learning There are N independent examples D = {(x 1, y 1 ),..., (x N, y N )}, generated from the real distribution p(x, y). 5
Learning aims at optimizing objective function of fitting p(x, y θ) to data D with respect to (wrt) θ. We define likelihood of parameters for given data: p(d θ) = N p(x n, y n θ) n=1 The likelihood determines the plausibility of generating data D from the considered model with parameters θ. The uncertainty of parameters θ is modeled by a priori distribution prior p(θ). The rules of probability theory (Bayes rule) allows to modify the uncertainty of parameters by including observations, i.e., one obtains a posteriori distribution (posterior) of the following form: p(θ D) = p(d θ)p(θ) p(d) posterior likelihood prior It can be shown that if data D n, consisting of n data points, was generated from some true θ, then under some regularity conditions, as long as p(θ ) > 0: lim p(θ D n) = δ(θ θ ) n Figure 7: Idea of including parameters uncertainty in modeling. Frequentist learning determination of point estimate of parameters θ: maximum likelihood estimation, ML): θ ML = arg max p(d θ), θ 6
maximum a posteriori estimation, MAP): θ MAP = arg max p(θ D). θ Bayesian learning determination of predictive distribution, i.e., marginalizing out parameters: 5 Dynamical systems p(y x, D) = p(y x, θ) }{{} model p(θ D) dθ. }{{} posterior As far, we have focussed primarily on phenomena which are time-independent, i.e., data that were assumed to be independent and identically distributed (i.i.d.). For many applications, however, the i.i.d. assumption will be a poor one. Here we consider a particularly important class of such data sets, namely those that describe sequential data. These often arise through measurement of time series, for example the rainfall measurements on successive days at a particular location, or the daily values of a currency exchange rate, or the acoustic features at successive time frames used for speech recognition. Sequential data can also arise in contexts other than time series, for example the sequence of nucleotide base pairs along a strand of DNA or the sequence of characters in an English sentence. It is useful to distinguish between stationary and nonstationary sequential distributions. In the stationary case, the data evolves in time, but the distribution from which it is generated remains the same. For the more complex nonstationary situation, the generative distribution itself is evolving with time. There are different ways to model sequential data, for example: deterministic modelling: differential equations (continuous domain): difference equations (discrete domain): probabilistic modelling: dx dt = f(x) x n+1 = f(x n ) 7
Markov models, i.e., the distribution over the current state depends on the previous ones, for instance first-order Markov model: p(x n+1 x 1,..., x n ) = p(x n+1 x n ) and its likelihood function: N p(x 1,..., x N ) = p(x 1 ) p(x n x n 1 ) Dynamical Systems (noises: η x, η y ): x n+1 = f(x n, η x ) y n+1 = g(x n+1, η y ) and its special case Linear Dynamical Systems (where we assume Gaussian noises η x and η y ): p(x n+1 x n ) = N (x n+1 Ax n, Σ x ) p(y n+1 x n+1 ) = N (y n+1 Bx n+1, Σ y ) n=2 8