T Machine Learning: Basic Principles

Similar documents
T Machine Learning: Basic Principles

Bayesian Machine Learning

Naïve Bayes classification

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Parametric Techniques

Lecture 4: Probabilistic Learning

CS-E3210 Machine Learning: Basic Principles

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

COS513 LECTURE 8 STATISTICAL CONCEPTS

Parametric Techniques Lecture 3

Introduction to Machine Learning

Density Estimation. Seungjin Choi

Overview. Machine Learning, Chapter 2: Concept Learning

Introduction to Machine Learning

Bayesian Networks BY: MOHAMAD ALSABBAGH

The Naïve Bayes Classifier. Machine Learning Fall 2017

Probability Based Learning

Stephen Scott.

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Approaches Data Mining Selected Technique

CS6220: DATA MINING TECHNIQUES

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Graphical Models and Kernel Methods

Machine Learning

Probabilistic Graphical Networks: Definitions and Basic Results

Lecture : Probabilistic Machine Learning

Chris Bishop s PRML Ch. 8: Graphical Models

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Probability. CS 3793/5233 Artificial Intelligence Probability 1

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models (I)

Conditional Independence

PMR Learning as Inference

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Notes on Machine Learning for and

Why Probability? It's the right way to look at the world.

Bayesian Networks Basic and simple graphs

Lecture 9: Bayesian Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Learning Features of Bayesian learning methods:

Bayesian Methods: Naïve Bayes

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

COMP 328: Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Learning Bayesian Networks

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CPSC 540: Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Introduction to Bayesian Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Machine Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Concept Learning through General-to-Specific Ordering

Review: Bayesian learning and inference

CSCE 478/878 Lecture 6: Bayesian Learning

Machine Learning

Lecture 6: Graphical Models: Learning

CS 343: Artificial Intelligence

CS6220: DATA MINING TECHNIQUES

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Concept Learning Mitchell, Chapter 2. CptS 570 Machine Learning School of EECS Washington State University

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to Bayesian Learning. Machine Learning Fall 2018

Multivariate Bayesian Linear Regression MLAI Lecture 11

Learning Bayesian network : Given structure and completely observed data

Probabilistic Classification

Statistics: Learning models from data

Machine Learning Summer School

Bayesian networks. Independence. Bayesian networks. Markov conditions Inference. by enumeration rejection sampling Gibbs sampler

Bayesian Inference and MCMC

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

CS 188: Artificial Intelligence. Bayes Nets

Introduction into Bayesian statistics

Bayesian Learning in Undirected Graphical Models

Introduction to Probability and Statistics (Continued)

Introduction to Probabilistic Machine Learning

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

{ p if x = 1 1 p if x = 0

Probability and Estimation. Alan Moses

Expect Values and Probability Density Functions

Machine Learning Gaussian Naïve Bayes Big Picture

Overfitting, Bias / Variance Analysis

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Probability Theory for Machine Learning. Chris Cremer September 2015

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

A.I. in health informatics lecture 3 clinical reasoning & probabilistic inference, II *

Transcription:

Machine Learning: Basic Principles Bayesian Networks Laboratory of Computer and Information Science (CIS) Department of Computer Science and Engineering Helsinki University of Technology (TKK) Autumn 2007

Outline Bayesian Networks Reminders Inference Finding the Structure of the Network 1 Bayesian Networks Reminders Inference Finding the Structure of the Network 2 3 Estimates from Posterior Bias and Variance Conclusion

Rules of Probability Bayesian Networks Reminders Inference Finding the Structure of the Network P(E, F ) = P(F, E): probability of both E and F happening. P(E) = F P(E, F ) (sum rule, marginalization) P(E, F ) = P(F E)P(E) (product rule, conditional probability) Consequence: P(F E) = P(E F )P(F )/P(E) (Bayes formula) We say E and F are independent if P(E, F ) = P(E)P(F ) (for all E and F ). We say E and F are conditionally independent given G if P(E, F G) = P(E G)P(F G), or equivalently P(E F, G) = P(E G).

Bayesian Networks Reminders Inference Finding the Structure of the Network Bayesian network is a directed acyclic graph (DAG) that describes a joint distribution over the vertices X 1,...,X d such that d P(X 1,..., X d ) = P(X i parents(x i )), i=1 where parents(x i ) are the set of vertices from which there is an edge to X i. C A B P(A, B, C) = P(A C)P(B C)P(C). (A and B are conditionally independent given C.)

Outline Bayesian Networks Reminders Inference Finding the Structure of the Network 1 Bayesian Networks Reminders Inference Finding the Structure of the Network 2 3 Estimates from Posterior Bias and Variance Conclusion

Inference in Bayesian Networks Reminders Inference Finding the Structure of the Network When structure of the Bayesian network and the probability factors are known, one usually wants to do inference by computing conditional probabilities. This can be done with the help of the sum and product rules. Example: probability of the cat being on roof if it is cloudy, P(F C)? P(S C)=0.1 P(S ~C)=0.5 Sprinkler P(W R,S)=0.95 P(W R,~S)=0.90 P(W ~R,S)=0.90 P(W ~R,~S)=0.10 Cloudy Wet grass P(C)=0.5 P(R C)=0.8 P(R ~C)=0.1 Rain P(F R)=0.1 P(F ~R)=0.7 roof Figure 3.5 of Alpaydin (2004). Figure 3.5: Rain not only makes the grass wet but also disturbs the cat who normally makes noise on the roof. From: E. Alpaydın. 2004. Introduction to Machine Learning. c The MIT Press.

Inference in Bayesian Networks Reminders Inference Finding the Structure of the Network Example: probability of the cat being on roof if it is cloudy, P(F C)? S, R and W are unknown or hidden variables. F and C are observed variables. Conventionally, we denote the observed variables by gray nodes (see figure on the right). We use the product rule P(F C) = P(F, C)/P(C), where P(C) = F P(F, C). We must sum over or marginalize over hidden variables S, R and W : P(F, C) = P(C, S, R, W, F ). S R W Sprinkler Wet grass Cloudy Rain roof P(C, S, R, W, F ) = P(F R)P(W S, R)P(S C)P(R C)P(C)

Inference in Bayesian Networks Reminders Inference Finding the Structure of the Network P(F, C) = P(C, S, R, W, F ) + P(C, S, R, W, F ) +P(C, S, R, W, F ) + P(C, S, R, W, F ) +P(C, S, R, W, F ) + P(C, S, R, W, F ) +P(C, S, R, W, F ) + P(C, S, R, W, F ) Sprinkler Cloudy Rain We obtain similar formula for P(F, C), P( F, C) and P( F, C). Wet grass roof Notice: we have used shorthand F to denote F = 1 and F to denote F = 0. In principle, we know the numeric value of each joint distribution, hence we can compute the probabilities. P(C, S, R, W, F ) = P(F R)P(W S, R)P(S C)P(R C)P(C)

Inference in Bayesian Networks Reminders Inference Finding the Structure of the Network There are 2 5 terms in the sums. Generally: marginalization is NP-hard, the most staightforward approach would involve a computation of O(2 d ) terms. We can often do better by smartly re-arranging the sums and products. Behold: Do the marginalization over W first: P(C, S, R, F ) = W P(F R)P(W S, R)P(S C)P(R C)P(C) = P(F R) W [P(W S, R)]P(S C)P(R C)P(C) = P(F R)P(S C)P(R C)P(C). Sprinkler Wet grass Cloudy Rain roof P(C, S, R, W, F ) = P(F R)P(W S, R)P(S C)P(R C)P(C)

Inference in Bayesian Networks Reminders Inference Finding the Structure of the Network Now we can marginalize over S easily: P(C, R, F ) = S P(F R)P(S C)P(R C)P(C) = P(F R) S [P(S C)]P(R C)P(C) = P(F R)P(R C)P(C). We must still marginalize over R: P(C, F ) = P(F R)P(R C)P(C) + P(F R)P( R C)P(C) = 0.1 0.8 0.5 + 0.7 0.2 0.5 = 0.11. P(C, F ) = P( F R)P(R C)P(C)+P( F R)P( R C)P(C) = 0.9 0.8 0.5 + 0.3 0.2 0.5 = 0.39. P(C) = P(C, F ) + P(C, F ) = 0.5. P(F C) = P(C, F )/P(C) = 0.22. P( F C) = P(C, F )/P(C) = 0.78. Sprinkler Wet grass Cloudy Rain roof P(C, S, R, W, F ) = P(F R)P(W S, R)P(S C)P(R C)P(C)

Bayesian Networks: Inference Reminders Inference Finding the Structure of the Network To do inference in Bayesian networks one has to marginalize over variables. For example: P(X 1 ) = X 2... X d P(X 1,..., X d ). If we have Boolean arguments the sum has O(2 d ) terms. This is inefficient! Generally, marginalization is a NP-hard problem. If Bayesian Network is a tree: Sum-Product Algorithm (a special case being Belief Propagation). If Bayesian Network is close to a tree: Junction Tree Algorithm. Otherwise: approximate methods (variational approximation, MCMC etc.)

Sum-Product Algorithm Reminders Inference Finding the Structure of the Network Idea: sum of products is difficult to compute. Product of sums is easy to compute, if sums have been re-arranged smartly. Example: disconnected Bayesian network with d vertices, computing P(X 1 ). sum of products: P(X 1 ) = X 2... X d P(X 1 )... P(X d ). product of sums: P(X 1 ) = P(X 1 ) ( X 2 P(X 2 ) )... ( X d P(X d ) ) = P(X 1 ). Sum-Product Algorithm works if the Bayesian Network is directed tree. For details, see e.g., Bishop (2006).

Sum-Product Algorithm Example Reminders Inference Finding the Structure of the Network D A B C P(A, B, C, D) = P(A D)P(B D)P(C D)P(D) Task: compute P(D) = A B C P(A, B, C, D).

Sum-Product Algorithm Example Reminders Inference Finding the Structure of the Network D D P(A D) P(B D) P(C D) P(D) A B C A B C P(A, B, C, D) = P(A D)P(B D)P(C D)P(D) Factor graph is composed of vertices (ellipses) and factors (squares), describing the factors of the joint probability. The Sum-Product Algorithm re-arranges the product (check!):!!! X X X P(D) = P(A D) P(B D) P(C D) P(D) = X A A B X X P(A, B, C, D). (1) B C C

Observations Bayesian Networks Reminders Inference Finding the Structure of the Network Bayesian network forms a partial order of the vertices. To find (one) total ordering of vertices: remove a vertex with no outgoing edges (zero out-degree) from the network and output the vertex. Iterate until the network is empty. (This way you can also check that the network is DAG.) If all variables are Boolean, storing a full Bayesian network of d vertices or full joint distribution as a look-up table takes O(2 d ) bytes. If the highest number of incoming edges (in-degree) is k, then storing a Bayesian network of d vertices as a look-up table takes O(d2 k ) bytes. When computing marginals, disconnected parts of the network do not contribute. Conditional independence is easy to see.

Bayesian Network: Classification D01"*803'("$E)&F*G' H.0**8+8#0$8)3 Reminders Inference Finding the Structure of the Network 280;3)*$8# >'!H'"#I'$ %&'()* +,-(#./0(+1)#12(#&+34 > / " #! H!"!! " H " >! H " /!! " Alpaydin (2004) Ch 3 / slides!"#$%&"'()$"*'+)&','-./012!3'4556'73$&)2%#$8)3'$)'90#:83"'!"0&383;'< =:"'97='>&"**'?@AC!"

Naive Bayes Classifier Reminders Inference Finding the Structure of the Network P(C) C p(x 1 C) p(x d C) p(x 2 C) x 1 x 2 x d Figure 3.7 Alpaydin (2004). Figure X i 3.7: are conditionally Naive Bayes independent classifier given C. is a Bayesian network P(X for, C) = classification P(x 1 C)P(x 2 assuming C)... P(x d independent C)P(C). inputs. From: E. Alpaydın. 2004. Introduction to

Naive Bayes Classifier Reminders Inference Finding the Structure of the Network C C X 1... X N Equivalently: X N Plate is used as a shorthand notation for repetition. The number of repetitions is in the bottom right corner. Gray nodes denote observed variables.

Outline Bayesian Networks Reminders Inference Finding the Structure of the Network 1 Bayesian Networks Reminders Inference Finding the Structure of the Network 2 3 Estimates from Posterior Bias and Variance Conclusion

Finding the Structure of the Network Reminders Inference Finding the Structure of the Network Often, the network structure is given by an expert. In probabilistic modeling, the network structure defines the structure of the model. Finding an optimal Bayesian network structure is NP-hard Idea: Go through all possible network structures M and compute the likelihood of data X given the network structure P(X M). Choose the network complexity appropriately. Choose network that, for a given network complexity, gives the best likelihood. The Bayesian approach: choose structure M that maximizes P(M X ) P(X M)P(M), where P(M) is a prior probability for network structure M (more complex networks should have smaller prior probability).

Finding a Network Bayesian Networks Reminders Inference Finding the Structure of the Network Full Bayesian network of d vertices and d(d 1)/2 edges describes the training set fully and the test set probably poorly. As before, in finding the network structure, we must control the complexity so that the the model generalizes. Usually one must resort to approximate solutions to find the network structure (e.g., deal package in R). A feasible exact algorithm exists for up to d = 32 variables, with a running time of o(d 2 2 d 2 ). See Silander et al. (2006) A Simple Optimal Approach for Finding the Globally Optimal Bayesian Network Structure. In Proc 22nd UAI. (pdf)

Finding a Network Bayesian Networks Reminders Inference Finding the Structure of the Network AirTemp Humidity Wind Sky EnjoySport Water Forecast Network found by Bene at http://b-course.hiit.fi/bene t Sky AirTemp Humidity Wind Water Forecast EnjoySport 1 Sunny Warm Normal Strong Warm Same 1 2 Sunny Warm High Strong Warm Same 1 3 Rainy Cold High Strong Warm Change 0 4 Sunny Warm High Strong Cool Change 1

Outline Bayesian Networks 1 Bayesian Networks Reminders Inference Finding the Structure of the Network 2 3 Estimates from Posterior Bias and Variance Conclusion

Boys or Girls? Bayesian Networks Figure: Sex ratio by country population aged below 15. Blue represents more women, red more men than the world average of 1.06 males/female. Image from Wikimedia Commons, author Dbachmann, GFDLv1.2.

Bayesian Networks The world average probability that a newborn child is a boy (X = 1) is about θ = 0.512 [probability of a girl (X = 0) is then 1 θ = 0.488]. Bernoulli process: P(X = x θ) = θ x (1 θ) 1 x, x {0, 1}. Assume we observe the genders of N newborn children, X = {x t } N t=1. What is the sex ratio? Joint distribution: P(x 1,..., x N, θ) = P(x 1 θ)... P(x N θ)p(θ). Notice we must fix some prior for θ, P(θ). θ... 1 N X X Equivalently: θ X N

Outline Bayesian Networks 1 Bayesian Networks Reminders Inference Finding the Structure of the Network 2 3 Estimates from Posterior Bias and Variance Conclusion

Comparing Models Bayesian Networks The likelihood ratio (Bayes factor) is defined by BF (θ 2 ; θ 1 ) = P(X θ 2) P(X θ 1 ) If we believe before seeing any data that the probability of model θ 1 is P(θ 1 ) and of model θ 2 is P(θ 2 ) then the ratio of their posterior probabilities is given by P(θ 2 X ) P(θ 1 X ) = P(θ 2) P(θ 1 ) BF (θ 1; θ 2 ) This ratio allows us to compare our degrees of beliefs into two models. Posterior probability density allows us to compare our degrees of beliefs between infinite number of models after observing the data.

Discrete vs. Continuous Random Variables The Bernoulli parameter θ is a real number in [0, 1]. Previously we considered binary (0/1) random variables. Generalization to multinomial random variables that can have values 1, 2,..., K is straightforward. Generalization to continuous random variable: divide the interval [0, 1] to K equally sized intervals of width θ = 1/K. Define probability density p(θ) such that the probability of θ being in interval S i = [(i 1) θ, i θ], i {1,..., K}, is P(θ S i ) = p(θ ) θ, where θ is some point in S i. At limit θ 0: E P(θ) [f (θ)] = θ P(θ)f (θ) E p(θ) [f (θ)] = dθp(θ)f (θ).

Discrete vs. Continuous Random Variables P( θ) p( θ) 0 1 θ θ P(θ [(i 1) θ, i θ]) = p(θ ) θ. At limit θ 0: E P(θ) [f (θ)] = P(θ)f (θ) E p(θ) [f (θ)] = θ dθp(θ)f (θ).

Estimating the Sex Ratio Task: estimate the Bernoulli parameter θ, given N observations of the genders of newborns in an unnamed country. Assume the true Bernoulli parameter to be estimated in the unnamed country is θ = 0.55, the global average being 51.2%. Posterior probability density after seeing N newborns in X = {x t } N t=1 : p(θ X ) = p(x θ)p(θ) p(x ) N [ ] p(θ) θ xt (1 θ) 1 xt. t=1

Estimating the Sex Ratio What is our degree of belief in the gender ratio, before seeing any data (prior probability density p(θ))? N=0 flat prior (P=0.55) empirical prior (P=0.78) boundary prior (P=0.51) Very agnostic view: p(θ) = 1 (flat prior). Something similar than elsewhere (empirical prior). Conspiracy theory prior: all newborns are almost all boys or all girls (boundary prior). 0.0 0.2 0.4 0.6 0.8 1.0 True θ = 0.55 is shown by the red dotted line. The densities have been scaled to have a maximum of one. θ

Estimating the Sex Ratio Posterior probability density N=0 flat prior (P=0.55) empirical prior (P=0.78) boundary prior (P=0.51) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=1 flat prior (P=0.30) empirical prior (P=0.75) boundary prior (P=0.07) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=2 flat prior (P=0.57) empirical prior (P=0.78) boundary prior (P=0.55) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=3 flat prior (P=0.76) empirical prior (P=0.81) boundary prior (P=0.79) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=4 flat prior (P=0.59) empirical prior (P=0.78) boundary prior (P=0.58) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=8 flat prior (P=0.83) empirical prior (P=0.84) boundary prior (P=0.85) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=16 flat prior (P=0.47) empirical prior (P=0.75) boundary prior (P=0.45) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=32 flat prior (P=0.72) empirical prior (P=0.83) boundary prior (P=0.71) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=64 flat prior (P=0.86) empirical prior (P=0.89) boundary prior (P=0.85) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=128 flat prior (P=0.91) empirical prior (P=0.93) boundary prior (P=0.90) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=256 flat prior (P=0.80) empirical prior (P=0.87) boundary prior (P=0.80) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=512 flat prior (P=0.59) empirical prior (P=0.70) boundary prior (P=0.59) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=1024 flat prior (P=0.36) empirical prior (P=0.45) boundary prior (P=0.36) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=2048 flat prior (P=0.42) empirical prior (P=0.49) boundary prior (P=0.42) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Estimating the Sex Ratio Posterior probability density N=4096 flat prior (P=0.12) empirical prior (P=0.14) boundary prior (P=0.11) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Observations Bayesian Networks With few data points the results are strongly dependent on the prior assumptions (inductive bias). As the number of data points grow, the results converge to the same answer. The conspiracy theory fades out quickly as we notice that there are both male and female babies. The only zero posterior probability is on hypothesis θ = 0 and θ = 1. It takes quite a lot observations to pin the result down to a reasonable accuracy. The posterior probability can be very small number. Therefore, we usually work with logs of probabilities.

Outline Bayesian Networks Estimates from Posterior Bias and Variance Conclusion 1 Bayesian Networks Reminders Inference Finding the Structure of the Network 2 3 Estimates from Posterior Bias and Variance Conclusion

Predictions from the Posterior Estimates from Posterior Bias and Variance Conclusion The posterior represents our best knowledge. Predictor for new data point: p(x X ) = E p(θ X ) [p(x θ)] = dθp(x θ)p(θ X ). The calculation of the integral may be infeasible. Solution: estimate θ by ˆθ and use the predictor p(x X ) p(x ˆθ).

Estimations from the Posterior Estimates from Posterior Bias and Variance Conclusion Definition (Maximum Likelihood Estimate) ˆθ ML = arg max log p(x θ). θ Maximum a Posteriori Estimate (N=8) flat prior (P=0.83) empirical prior (P=0.84) boundary prior (P=0.85) Definition (Maximum a Posteriori Estimate) ˆθ MAP = arg max log p(θ X ). θ (With flat prior MAP Estimate reduces to the ML Estimate.) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Bernoulli Density Bayesian Networks Estimates from Posterior Bias and Variance Conclusion Two states, x {0, 1}, one parameter θ [0, 1]. P(X = x θ) = θ x (1 θ) 1 x. N P(X θ) = θ xt (1 θ) 1 xt. t=1 L = log P(X θ) = t x t log θ + ( N t x t ) log (1 θ). L θ = 0 ˆθ ML = 1 N x t. t

Multinomial Density Bayesian Networks Estimates from Posterior Bias and Variance Conclusion K states, x {1,..., K}, K real parameters θ i 0 with constraint K k=1 θ k = 1. One observation is an integer k in {1,..., K} and it is represented by x i = δ ik. K P(X = i θ) = θ x k k. P(X θ) = L = log P(X θ) = N k=1 K t=1 k=1 N t=1 k=1 L θ k = 0 ˆθ kml = 1 N θ xt k k. K xk t log θ k. xk t. t

Gaussian Density Bayesian Networks Estimates from Posterior Bias and Variance Conclusion A real number x is Gaussian (normal) distributed with mean µ and variance σ 2 or x N(µ, σ 2 ) if its density function is p(x µ, σ 2 ) = 1 exp 2πσ 2 «(x µ)2. 2σ 2 L = log P(X µ, σ 2 ) = N P N 2 log (2π) N log σ t=1 `x t µ 2. 2σ ( 2 P m = 1 N ML : N t=1 x t s 2 P = 1 N N t=1 `x t m 2 0.0 0.1 0.2 0.3 0.4 N(0,1) 4 2 0 2 4 x p(x µ = 0, σ 2 = 1)

Outline Bayesian Networks Estimates from Posterior Bias and Variance Conclusion 1 Bayesian Networks Reminders Inference Finding the Structure of the Network 2 3 Estimates from Posterior Bias and Variance Conclusion

Bias and Variance Bayesian Networks Estimates from Posterior Bias and Variance Conclusion Setup: unknown parameter θ is estimated by d(x ) based on a sample X. Example: estimate σ 2 by d = s 2. Bias: b θ (d) = E [d] θ. [ Variance: E (d E [d]) 2]. Mean square error of the estimator r(d, θ): r(d, θ) = E [(d θ) 2] = (E [d] θ) 2 + E [(d E [d]) 2] = Bias 2 + Variance. d i variance E[d] bias Figure 4.1 of Alpaydin (2004). Figure 4.1: θ is the parameter to be estimated are several estimates (denoted by ) over diff samples. Bias is the difference between the exp value of d and θ. Variance is how much d i are scattered around the expected value. We woul both to be small. From: E. Alpaydın. 2004. Introduction to Machine Learning. θ c The MIT

Bias and Variance Unbiased estimator of variance Bayesian Networks Estimates from Posterior Bias and Variance Conclusion Estimator is unbiased if b θ (d) = 0. Assume X is sampled from a Gaussian distribution. Estimate σ 2 by s 2 : s 2 = 1 N t (x t m) 2. We obtain: [ E p(x µ,σ 2 ) s 2 ] = N 1 N σ2. s 2 is not unbiased estimator, but ˆσ 2 = 1 N 1 N N 1 s2 is: N ( x t m ) 2. t=1 s 2 is however asymptotically unbiased (that is, bias vanishes when N ).

Bayes Estimator Bayesian Networks Estimates from Posterior Bias and Variance Conclusion Bayes estimator: ˆθ Bayes = E p(θ X ) [θ] = dθθp(θ X ). Example: x t N(θ, σ0 2 ), t {1,..., N}, and θ N(µ, σ 2 ), where µ, σ 2 and σ0 2 are known constants. Task: estimate θ. P! 1 p(x θ) = exp (2πσ t `x t θ 2, 0 2 )N/2 2σ0 2 «1 p(θ) = exp (θ µ)2. 2πσ 2 2σ 2 It can be shown that p(θ X ) is Gaussian distributed with N/σ0 ˆθ 2 Bayes = E p(θ X ) [θ] = N/σ0 2 + m+ 1/σ 2 1/σ2 N/σ0 2 µ. + 1/σ2 µ σ θ x N σ 0

Outline Bayesian Networks Estimates from Posterior Bias and Variance Conclusion 1 Bayesian Networks Reminders Inference Finding the Structure of the Network 2 3 Estimates from Posterior Bias and Variance Conclusion

About Estimators Bayesian Networks Estimates from Posterior Bias and Variance Conclusion Point estimates collapse information contained in the posterior distribution into one point. Advantages of point estimates: Computations are easier: no need to do the integral. Point estimate may be more interpretable. Point estimates may be good enough. (If the model is approximate anyway it may make no sense to compute the integral exactly.) Alternative to point estimates: do the integral analytically or using approximate methods (MCMC, variational methods etc.). One should always use test set to validate the results. The best estimate is the one performing best in the validation/test set.

Conclusion Bayesian Networks Estimates from Posterior Bias and Variance Conclusion Next lecture: More about Model Selection (Alpaydin (2004) Ch 4) Problem session on 5 October.