Learning Bayesian network : Given structure and completely observed data
|
|
- Oswald Rodgers
- 5 years ago
- Views:
Transcription
1 Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani
2 Learning problem Target: true distribution P that maybe correspond to M = K, θ Hypothesis space: specified probabilistic graphical models Data: set of instances sampled from P Learning goal: selecting a model M to construct the best approximation to M according to a performance metric 2
3 Learning tasks on graphical models Parameter learning / structure learning Completely observable / partially observable data Directed model / undirected model 3
4 Parameter learning in directed models Complete data We assume that the structure of the model is known consider learning parameters for a B with a given structure Goal: estimate CPDs from a dataset D = {x 1,..., x () } of independent, identically distributed (i.i.d.) training samples. Each training sample x n = x 1 (n),, xl (n) is a vector that every element x i (n) is known (no missing values, no hidden variables) 4
5 Density estimation review We use density estimation to solve this learning problem Density estimation: Estimating the probability density function P(x), given a set of data points x i drawn from it. Parametric methods: Assume that P(x) in terms of a specific functional form which has a number of adjustable parameters MLE and Bayesian estimate MLE: eed to determine θ given {x 1,, x () } MLE overfitting problem Bayesian estimate: Probability distribution P(θ) over spectrum of hypotheses eeds prior distribution on parameters i=1 5
6 Density estimation: Graphical model i.i.d assumption θ X (1) X (2) X () θ X (i) i = 1,..., α hyperparametrs α θ α θ X (1) X (2) X () X (n) n = 1,..., 6
7 Maximum Likelihood Estimation (MLE) Likelihood is the conditional probability of observations D = x (1), x (2),, x () given the value of parameters θ Assuming i.i.d. (independent, identically distributed) samples P D θ = P x (1),, x () θ = likelihood of θ w.r.t. the samples P(x (n) θ) Maximum Likelihood estimation 7 θ ML = argmax θ θ ML = argmax θ i=1 P D θ = argmax θ ln p x (i) θ P(x (n) θ) MLE has closed form solution for many parametric distributions
8 MLE: Bernoulli distribution Given: D = x (1), x (2),, x (), m heads (1), m tails (0): p x θ = θ x 1 θ 1 x p x = 1 θ = θ p D θ = p(x n θ) = x n θ 1 θ 1 x n ln p D θ = ln p(x n θ) = {x n ln θ + (1 x n ) ln 1 θ } ln p D θ θ = 0 θ ML = i=1 x (n) = m
9 MLE: Multinomial distribution Multinomial distribution (on variable with K state): Parameter space: θ = θ 1,, θ K θ i 0,1 K k=1 θ k = 1 P x θ = K x θ k k k=1 P x k = 1 = θ k θ 2 9 Variable: 1-of-K coding x = x 1,, x K x k {0,1} K k=1 x k = 1 θ 1 θ 3 θ 1 + θ 2 + θ 3 = 1 where θ i 0,1 that is a simplex showing the set of valid parameters
10 MLE: Multinomial distribution D = x (1), x (2),, x () P D θ = P(x n θ) = K k=1 θ k x k (n) = L θ, λ = ln p D θ + λ(1 θ k ) K k=1 K k=1 θ k x k (n) m k = K k=1 x k (n) m k = θ k = x k (n) = m k 10
11 MLE: Gaussian with unknown μ ln P(x n μ) = 1 2 ln 2πσ 1 2σ 2 x n μ 2 ln P D μ μ = 0 μ ln p x (n) μ = 0 1 σ 2 x(n) μ = 0 μ ML = 1 x n 12
12 Bayesian approach Parameters θ as random variables with a priori distribution utilizes the available prior information about the unknown parameter As opposed to ML estimation, it does not seek a specific point estimate of the unknown parameter vector θ Samples D convert the prior densities P θ density P θ D into a posterior Keep track of beliefs about θ s values and uses these beliefs for reaching conclusions 13
13 Maximum A Posteriori (MAP) estimation MAP estimation θ MAP = argmax p θ D θ Since p θ D p D θ p(θ) θ MAP = argmax θ p D θ p(θ) Example of prior distribution: p θ = (θ 0, σ 2 ) 14
14 Bayesian approach: Predictive distribution Given a set of samples D = x i i=1, a prior distribution on the parameters P(θ), and the form of the distribution P x θ We find P θ D and use it to specify P x = P(x D) on new data as an estimate of P(x): P x D = P x, θ D dθ = P x D, θ P θ D dθ = P x θ P θ D dθ Predictive distribution Analytical solutions exist for very special forms of the involved functions 15
15 Conjugate Priors We consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties Choosing a prior such that the posterior distribution that is proportional to p(d θ)p(θ) will have the same functional form as the prior. α, D α P(θ α ) P D θ P(θ α) Having the same functional form 16
16 Prior for Bernoulli Likelihood Beta distribution over θ [0,1]: Beta θ α 1, α 0 θ α θ α 0 1 Beta θ α 1, α 0 = Γ(α 0 + α 1 ) Γ(α 0 )Γ(α 1 ) θα θ α 0 1 E θ = α 1 α 0 + α α 1 1 θ = α α 1 1 most probable θ Beta distribution is the conjugate prior of Bernoulli: P x θ = θ x 1 θ 1 x 17
17 Beta distribution 18
18 Benoulli likelihood: posterior Given: D = x (1), x (2),, x (), m heads (1), m tails (0) p θ D p D θ p(θ) = i=1 θ x i 1 θ 1 x i Beta θ α 1, α 0 θ m+α1 1 1 θ m+α 0 1 p θ D Beta θ α 1, α 0 α 1 = α 1 + m α 0 = α 0 + m θ α θ α 0 1 m = x (i) i=1 19
19 Example Prior Beta: α 0 = α 1 = 2 p x θ = θ x 1 θ 1 x Bernoulli p x = 1 θ θ Posterior Beta:α 1 = 5, α 0 = 2 Given: D = x (1), x (2),, x () : m heads (1), m tails (0) θ α 0 = α 1 = 2 20 θ θ MAP = argmax θ D = 1,1,1 = 3, m = 3 P θ D = α 1 1 α α 0 1 = 4 5
20 Benoulli: Predictive distribution Training samples: D = x (1),, x () P θ = Beta θ α 1, α 0 θ α1 1 1 θ α 0 1 P θ D = Beta θ α 1 + m, α 0 + m θ α1+m 1 1 θ α 0+ m 1 21 P x D = P x θ P θ D dθ = E P θ D P(x θ) P x = 1 D = E P θ D θ = α 1 + m α 0 + α 1 +
21 Dirichlet distribution Input space: θ = θ 1,, θ K T θ k 0,1 K k=1 θ k = 1 = P θ α Γ(α) K α θ k 1 k k=1 K α θ k 1 k Γ α 1 Γ(α K ) k=1 α = K k=1 α k E θ k = α α k θ k = α k 1 α K 22
22 Dirichlet distribution: Examples α = [0.1,0.1,0.1] α = [1,1,1] α = [10,10,10] Dirichlet parameters determine both the prior beliefs and their strength. The larger values of α correspond to more confidence on the prior belief (i.e., more imaginary samples) 23
23 Dirichlet distribution: Example α = [2,2,2] α = [20,2,2] 24
24 Multinomial distribution: Prior Dirichlet distribution is the conjugate prior of Multinomial P θ D, α P D θ P θ α K m θ k +α k 1 k k=1 m = m 1,, m K T sufficient statistics of data P θ D, α = Dir θ α + m P θ α θ~dir(α 1,, α K ) P θ D, α θ~dir(α 1 + m 1,, α K + m K ) 25
25 Multinomial: Predictive distribution Training samples: D = x (1),, x () P θ = Dir θ α 1,, α K P θ D = Dir θ α 1 + m 1,, α K + m K P x D = P x θ P θ D dθ = E P θ D P(x θ) P x k = 1 D = E P θ D θ k = α k + m k α + Larger α more confidence in our prior 26 α k + m k α + = α α + α k α + α + m k
26 Multinomial: Predictive distribution P x D = Multi α 1 + m 1 α +,, α K + m K α + Bayesian prediction combines sufficient statistics from imaginary Dirichlet samples and real data samples 27
27 Example: MLE vs. Bayesian Effect of different priors on smoothing parameter estimates θ~beta(10,10) θ~beta(1,1) 28 [Koller & Friedman Book]
28 Bayesian Estimation Gaussian distribution with unknown μ (known σ) P x μ ~(μ, σ 2 ) P(μ)~(μ 0, σ 0 2 ) P μ D P D μ P μ = P(x n μ P(μ) P μ D exp 1 2 σ σ 0 2 μ2 2 x n σ 2 + μ 0 σ 0 2 μ 29
29 Bayesian Estimation Gaussian distribution with unknown μ (known σ) P μ D ~(μ, σ 2 ) μ = 1 2 σ 0 σ σ 2 σ 2 = 1 σ σ 2 x n + σ 2 σ σ 2 μ 0 P x D = P x μ P μ D dμ exp 1 2 x μ 2 σ 2 + σ 2 P x D ~(μ, σ 2 + σ 2 ) 30
30 Conjugate prior for Gaussian distribution Known μ, unknown σ 2 The conjugate prior for λ = 1/σ 2 is a Gamma with shape a 0 and rate (inverse scale) b 0 The conjugate prior for σ 2 is Inverse-Gamma Unknown μ and unknown σ 2 The conjugate prior P μ, λ is ormal-gamma Multivariate case: The conjugate prior P μ, Λ is ormal-wishart Gam λ a, b = 1 Γ(a) ba x a 1 exp bλ InvGam σ 2 a, b = 1 Γ(a) ba x a 1 exp( b σ 2) P μ, λ = μ μ 0, βλ 1 Gam (λ a 0, b 0 ) 31
31 Bayesian estimation: Summary Bayesian approach treats parameters as random variables Learning is then a special case of inference Asymptotically equivalent to MLE For some parametric distributions, has closed form (that is obtained based on prior parameters and sufficient statistics from data) 32
32 Learning in Bayesian networks Learning MLE Likelihood decomposes on nodes conditional distributions Bayesian We can make some assumptions (global & local independencies) to simplify the learning 33
33 Likelihood decomposition: Example Directed factorization causes likelihood to locally decompose: P X θ = P X 1 θ 1 P X 2 X 1, θ 2 P X 3 X 1, θ 3 P X 4 X 2, X 3, θ 4 X 1 X 2 X 3 X 4 34
34 Decomposition of the likelihood If we assume the parameters for each CPD are disjoint (i.e., disjoint parameter sets θ Xi Pa Xi for different variables), and the data is complete, then the log-likelihood function decomposes on nodes: L θ; D = P D θ = = i V i V (n) (n) P X i PaXi, θi = L i θ i ; D = (n) (n) P X i PaXi, θi (n) P X (n) i PaXi, θ i L i (θ i ; D) i V 35 θ i = argmax L i θ i ; D θ = [ θ 1,, θ V ] θ i θ = argmax L θ; D θ θ i θ Xi Pa Xi
35 Local decomposition of the likelihood: Table-CPDs The choice of parameters given different value assignment to the parents are independent of each other Consider only data points that agree with the parent assignment: L i θ Xi Pa Xi ; D = θ (n) (n) Xi PaXi = u Val(Pa Xi ) x Val(X i ) θ x u (n) (n) I X i =x,paxi =u 36
36 Local decomposition of the likelihood: Table-CPDs Each row is multinomial distribution θ i θ Xi Pa Xi θ Xi u 1 θ Xi u L Pa Xi = u 1 Pa Xi = u L X i = 1 X i = K i θ 1 u1 i θ K u1 i θ 1 ul i θ K ul m i k, u = (n) (n) I X i = k, PaXi = u u k i θ k u = 1 i θ k u = m i k, u k m i k, u = (n) (n) I X i = k, PaXi = u (n) I Pa Xi = u 37
37 Overfitting For large instances Z Pa Xi Val(Z), most buckets will have very few umber of parameters will increases exponentially with Pa Xi Poor estimation With limited data, overfitting occurs we often get better generalization with simpler structures zero count problem or the sparse data problem e.g., consider a naïve bayes classifier and an d containing a word that does not occur in any training doc for the class c, then P d c = 0 38
38 ML Example: aïve Bayes classifier Training phase π Y (n) Y {1,, C} θ i = θ Xi Y = θ Xi 1,, θ Xi C Test phase θ i = θ Xi 1,, θ Xi C π Y θ i X i (n) n = 1,, X i i = 1,.., D π i = 1,, D θ 1 θ 2 Y θ D X 1 X 2 X D 39
39 ML Example: aïve Bayes classifier L θ; D = P(Y n π) D i=1 P(X i n Y n, θ Xi Y) π c = I(Y n = c) Discrete inputs or features (X i {1,, K}): i θ k c = p X i = k Y = c I(X i n =k,y n =c) I(Y n =c) 40
40 Recall: Maximum conditional likelihood Example: discriminative classifier eeds to learn the conditional distribution (not joint distribution) Given D = x n, y n and a parametric conditional distribution for P(y x; θ), we find: θ = argmax θ P(y n x n ; θ) We will also see maximum conditional likelihood for more general CPDs than tabular ones 41
41 Maximum conditional likelihood Linear regression P y x, w = y w T φ x, σ 2 Linear: φ x = [1, x 1,, x D ] D = x n, y n x w y L Y X w; D = w = argmax θ P Y (n) X n, w = Y n w T φ X n, σ 2 Y n w T φ X n, σ 2 = Φ T Φ 1 Φ T Y 42
42 Maximum conditional likelihood Logistic regression P y = 1 X = σ w T φ x Linear LR: φ x = [1, x 1,, x D ] P y x, w = Ber(y σ w T φ x ) D = x n, y n Y {0,1} w x y L Y X w, D = 43 = P y (n) x n, w = Ber y n σ w T φ x n σ w T φ x n y (n) 1 σ w T φ x n 1 y (n) σ z = exp z
43 MLE for Bayesian networks: Summary For a B with disjoint sets of parameters in CPDs, likelihood decomposes as product of local likelihood functions on nodes Thus, we optimize likelihoods on different nodes separately For table CPDs, local likelihood further decomposes as product of likelihood for multinomials, one for each parent combination Sparse data problem of MLE 44
44 Bayesian learning for Bs Global parameter independence Global parameter independence assumption: θ 1 P θ D = 1 P D = P θ = L i θ i ; D = P θ i D i V P D θ = P(θ i ) i V P D θ P θ = 1 P D (n) P X (n) i PaXi, θ i L i (θ i ; D) i V θ 2 L i θ i ; D P θ i i V Posteriors of θ are independent given complete data and independent priors X 2 (n) X 1 (n) X 3 (n) θ 3 θ i θ Xi Pa Xi 45
45 Bayesian learning for Bs Global parameter independence How can we find directly from the structure of the Bayesian network when P θ satisfies global independence? 46
46 Bayesian learning for Bs Global parameter independence Predictive distribution P x D = P x θ, D P θ D dθ P x D = P x θ P θ D dθ Instances are independent given the parameters (for complete data) = i V P x i Pa Xi, θ i P θ i D dθ i V = i V P x i Pa Xi, θ i P θ i D dθ i Solve the prediction problem for each CPD independently and then combine the results. 47 θ i θ Xi Pa Xi f x g y dxdy = f x dx g y dy
47 Local decomposition for table CPDs Prior P θ X PaX satisfies local parameter independence assumption if: P θ X PaX = P θ X u Pa X =u The groups of parameters in different rows of a CPD are independent The posterior of P θ X u D and P θ X u D are independent despite v structure on X, because Pa X acts like a multiplexer Y (n) Y {1,, C} θ X 1 θ X C 48 X (n) n = 1,,
48 Global and local parameter independence Let the data be complete and CPDs be tabular. If the prior P(θ) satisfies global and local parameter independence, then: Pa Xi = u K = Val(X i ) P θ D = i V P X i = k u, D = Pa Xi P(θ Xi Pa Xi D) i α k u K k =1 + m i (k, u) i α k u + m i (k, u) i θ Xi u~dir(α 1 u i,, α K u ) i θ Xi u D ~Dir(α 1 u i + m i (1, u),, α K u + m i (K, u)) 49
49 Priors for Bayesian learning Priors α Xi Pa Xi must be defined: α x u x Val X i, u Val(Pa Xi ) K 2 prior: a fixed prior for all hyper-parameters α x u = 1 We can store an α and a prior distribution P on the variables in the network: α Xi Pa Xi = αp X i, Pa Xi BDe prior: Define P as a set of independent marginals over X i s 50
50 aïve Bayes Example: Bayesian learning Y {1,, C} θ i = θ Xi Y = θ Xi 1,, θ Xi C Training phase α π α Xi c Y (n) θ Xi c c = 1,, C X i (n) i = 1,, D n = 1,, 51
51 aïve Bayes Example: Bayesian learning Y~Mult π (i.e., P Y = c π = π c ) P π α 1,, α C = Dir(π α 1,, α C ) P π D, α 1,, α C = Dir π α 1 + m 1,, α C + m c P Y = c D = C c =1 α c + m c α c + m c Discrete inputs or features (X i {1,, K}): P X i = k Y = c, D = K k =1 i α k c +m i (k,c) i α k c +m i (k,c) m c = I(Y n = c) m i k, c = I(X i n = k, Y n = c) 52
52 Global & local independence For nodes with no parents, parameters define a single distribution Bayesian or ML learning can be used as in the simple density estimation on a variable More generally, for tabular CPDs there are multiple categorical distributions per node, one for every combination of parent variables Learning objective decomposes into multiple terms, one for subset of training data with each parent configuration Apply independent Bayesian or ML learning to each 53
53 Shared parameters Sharing CPDs or sharing parameters within a single CPD: P X i Pa Xi, θ = P X j Pa Xj, θ P X i u 1, θ i s = P X j u 2, θ i s MLE: For networks with shared CPDs, sufficient statistics accumulate over all uses of CPD Aggregate sufficient statistics: collect all the instances generated from the same conditional distribution and combine sufficient statistics from them 54
54 Markov chain P X 1,, X T θ = P X 1 π T t=2 P(X t X t 1, A t 1,t ) X 1 X 2 X T 1 X T Shared Parameters: P X 1,, X T θ = P X 1 π T t=2 P(X t X t 1, A) 55
55 Markov chain π A = θ S S Initial state probability: X 1 X 2 X T 1 X T π i = P X 1 = i, 1 i K State transition probability: A ji = θ j i = P X t+1 = i X t = j, K i=1 A ji = 1 1 i, j K 56
56 Markov chain: Parameter learning by MLE P D θ = P X 1 (n) π T t=2 (n) (n) P(X t Xt 1, A) A ji = T (n) (n) t=2 I X t 1 = j, Xt = i T (n) t=2 I X t 1 = j π i = I X 1 (n) = i 57
57 Markov chain: Bayesian learning Dirichlet prior α on A A ji = P X t+1 = i X t = j A j,. ~Dir(α j,1,, α j,k ) Assign a Dirichlet prior to each row of the transition matrix A P X t+1 = i X t = j, D, α j,. = T (n) (n) t=2 I X t 1 = j, Xt = i + αj,i T (n) t=2 I X t 1 = j + K i =1 α j,i 58
58 Hyper-parameters in Bayesian approach We already consider either parameter independencies or parameter sharing Hierarchical priors gives us a flexible language to introduce dependencies in the priors over parameters. useful when we have small amount of examples relevant to each parameter but we believe that some parameters are reasonably similar. In such situations, hierarchical priors spread the effect of the observations between parameters with shared hyper-parameters. However, when we use hyper-parameters we transformed our learning problem into one that includes a hidden variable. 59
59 Hyper-parameters: Example [Koller & Friedman Book] The effect of the prior will be to shift both θ Y x 0 and θ Y x 1 to be closer to each other 60
60 Hyper-parameters: Example [Koller & Friedman Book] If we believe that two variables Y and Z depend on X in a similar (but not identical) way 61
61 References Koller and Friedman, Chapter
Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationDensity Estimation: ML, MAP, Bayesian estimation
Density Estimation: ML, MAP, Bayesian estimation CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Maximum-Likelihood Estimation Maximum
More informationBayesian Models in Machine Learning
Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationProbabilistic Graphical Models
Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially
More informationReadings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008
Readings: K&F: 16.3, 16.4, 17.3 Bayesian Param. Learning Bayesian Structure Learning Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 6 th, 2008 10-708 Carlos Guestrin 2006-2008
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationOverview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58
Overview of Course So far, we have studied The concept of Bayesian network Independence and Separation in Bayesian networks Inference in Bayesian networks The rest of the course: Data analysis using Bayesian
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationAn Introduction to Bayesian Machine Learning
1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems
More information4: Parameter Estimation in Fully Observed BNs
10-708: Probabilistic Graphical Models 10-708, Spring 2015 4: Parameter Estimation in Fully Observed BNs Lecturer: Eric P. Xing Scribes: Purvasha Charavarti, Natalie Klein, Dipan Pal 1 Learning Graphical
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 5 Bayesian Learning of Bayesian Networks CS/CNS/EE 155 Andreas Krause Announcements Recitations: Every Tuesday 4-5:30 in 243 Annenberg Homework 1 out. Due in class
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationHierarchical Models & Bayesian Model Selection
Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationMachine Learning Summer School
Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,
More informationBayesian RL Seminar. Chris Mansley September 9, 2008
Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in
More informationNaïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824
Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266
More informationCOS513 LECTURE 8 STATISTICAL CONCEPTS
COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming
More informationIntroduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak
Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,
More informationIntroduction to Probability and Statistics (Continued)
Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause Homework 3 out tonight Start early!! Announcements Project milestones due today Please email to TAs 2 Parameter learning
More informationLecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.
Lecture 4 Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza October 6, 2017 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 1 / 46 Outline
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationIntroduc)on to Bayesian methods (con)nued) - Lecture 16
Introduc)on to Bayesian methods (con)nued) - Lecture 16 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate Outline of lectures Review of
More informationChapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)
HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationIntroduction to Machine Learning
Introduction to Machine Learning Generative Models Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationLecture 2: Priors and Conjugacy
Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014 Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I.
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide
More informationComputational Cognitive Science
Computational Cognitive Science Lecture 8: Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk Based on slides by Sharon Goldwater October 14, 2016 Frank Keller Computational
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.
School of Computer Science 10-701 Introduction to Machine Learning aïve Bayes Readings: Mitchell Ch. 6.1 6.10 Murphy Ch. 3 Matt Gormley Lecture 3 September 14, 2016 1 Homewor 1: due 9/26/16 Project Proposal:
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationNPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic
NPFL108 Bayesian inference Introduction Filip Jurčíček Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic Home page: http://ufal.mff.cuni.cz/~jurcicek Version: 21/02/2014
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More information{ p if x = 1 1 p if x = 0
Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =
More informationSome slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2
Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationRepresentation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2
Representation Stefano Ermon, Aditya Grover Stanford University Lecture 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 1 / 32 Learning a generative model We are given a training
More informationThe Monte Carlo Method: Bayesian Networks
The Method: Bayesian Networks Dieter W. Heermann Methods 2009 Dieter W. Heermann ( Methods)The Method: Bayesian Networks 2009 1 / 18 Outline 1 Bayesian Networks 2 Gene Expression Data 3 Bayesian Networks
More informationProbability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014
Probability Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh August 2014 (All of the slides in this course have been adapted from previous versions
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationIntroduction to Probabilistic Graphical Models
Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)
More informationIntroduction to Bayesian inference
Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 4 Learning Bayesian Networks CS/CNS/EE 155 Andreas Krause Announcements Another TA: Hongchao Zhou Please fill out the questionnaire about recitations Homework 1 out.
More informationIntroduction to Machine Learning
Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationStatistical learning. Chapter 20, Sections 1 3 1
Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2017. Tom M. Mitchell. All rights reserved. *DRAFT OF September 16, 2017* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is
More informationGenerative and Discriminative Approaches to Graphical Models CMSC Topics in AI
Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single
More informationMachine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber
Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to
More informationGraphical Models. Andrea Passerini Statistical relational learning. Graphical Models
Andrea Passerini passerini@disi.unitn.it Statistical relational learning Probability distributions Bernoulli distribution Two possible values (outcomes): 1 (success), 0 (failure). Parameters: p probability
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationNon-parametric Methods
Non-parametric Methods Machine Learning Alireza Ghane Non-Parametric Methods Alireza Ghane / Torsten Möller 1 Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection
More informationStatistical learning. Chapter 20, Sections 1 4 1
Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More information10708 Graphical Models: Homework 2
10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning
More informationProbabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April
Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Oct, 21, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models CPSC
More informationProbabilistic Graphical Models (I)
Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random
More informationSequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007
Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember
More informationLecture 2: Conjugate priors
(Spring ʼ) Lecture : Conjugate priors Julia Hockenmaier juliahmr@illinois.edu Siebel Center http://www.cs.uiuc.edu/class/sp/cs98jhm The binomial distribution If p is the probability of heads, the probability
More informationBayesian Methods. David S. Rosenberg. New York University. March 20, 2018
Bayesian Methods David S. Rosenberg New York University March 20, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 1 / 38 Contents 1 Classical Statistics 2 Bayesian
More informationECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4
ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationCurve Fitting Re-visited, Bishop1.2.5
Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the
More information