Learning Markov Networks. Presented by: Mark Berlin, Barak Gross

Size: px
Start display at page:

Download "Learning Markov Networks. Presented by: Mark Berlin, Barak Gross"

Transcription

1 Learning Markov Networks Presented by: Mark Berlin, Barak Gross

2 Introduction We shall egi, pehaps Eugene Onegin, Chapter VI Off did he take, I folloed at his heels. Inferno, Canto II

3 Reminder Until now we considered Bayesian Networks Intuitive Easily Decomposable Local Markov Networks are a wholly different story

4 Example Consider the simple Markov Network A B C Factors:,,, Partition function: =,,,, Log-likelihood: = ln, + ln, ln =, ln, +, ln,,, ln Z couples the factors together and precludes decomposition Transition to Bayesian representation is clearly not a panacea, as the MN and BN models are not equivalent

5 Programme Maximum Likelihood Spoiler: Bad news ahead Dealing with Missing Data Alternative Objectives Pseudo-Likelihood Contrasting Divergence

6 Max likelihood Who controls the past controls the futue. George Orwell

7 Log-linear models Let us now focus on the particular case where the features f are indicators, i.e. of the form 0,0, = = { = } The weights become the factor values we look for, of the form 0, 0

8 Log likelihood The log-likelihood thus becomes: = = is the number of samples ln ln is a set of values for the set of variables which is a clique in the Markov Network

9 Log likelihood, cont. = The first term is linear in What about ln? ln

10 Deriving = exp ln = exp = exp = [ ]

11 Deriving, cont. = exp ln = exp = C,

12 Log likelihood, cont. We have derived the Hessian of ln has elements of the form 2 ln = C, A covariance matrix is always positive semidefinite Therefore, ln is convex

13 Log likelihood, cont. = ln The log-likelihood function is a sum of a linear element and a concave element it is concave Therefore, it has no local optima, only a global maximum (albeit a non-unique one)

14 Max likelihood = ln Apparently, what remains is to compute the gradient and find its zeroes = [ ] Meaning the maximum is attained when = [ ]

15 Theorem 20.1 Max likelihood, cont.

16 Max likelihood, cont. Good news: The maximum is unique It is attained when the empirical expectancy of all features matches their a priori expectancy Bad news: impossible to compute analytically Gradient ascent to the rescue

17 Max likelihood, cont. Gradient ascent to the rescue Guaranteed to converge to the maximum = [ ] How to compute?

18 Max likelihood Gradient ascent = [ ] The first element, for indicator features, is the empirical frequency of the relevant event

19 Max likelihood Gradient ascent = [ ] The second element, for indicator features, is of the form, Computed using inference on the graph Single inference pass yields all probabilities

20 Max likelihood Gradient ascent = [ ] The second element, for indicator features, is of the form, Computed using inference on the graph But such computation is to be performed afresh for each step of the ascent!

21 Max likelihood = Max Entropy Idea: we want to re-construct the distribution corresponding to the given empirical expectations of the features We demand maximal entropy as a sign that we add no other constraints (no more information)

22 Max likelihood = Max Entropy Formally: Find maximizing H Subject to, = Tus out that Theorem 20.2

23 Max Likelihood = Max Entropy Formally: Find maximizing H Subject to, = The distribution given by the MLE clearly satisfies the constraints Now we need to prove it maximizes the entropy Point of notation: we denote

24 Max Likelihood = Max Entropy Let be another distribution adhering to the same constraints. Then: ln = ln = ln = ln = ln = ln

25 Max Likelihood = Max Entropy Let be another distribution adhering to the same constraints. Then: H H = ln + ln = ln + ln = Meaning, H > H, Q.E.D.

26 MLE Prior Reminder: MLE by itself is prone to overfitting Therefore, a prior distribution is taken in order to bias the solution toward a prior model

27 MLE Prior Two priors: Gaussian (L 2 ): = Laplacian (L 1 ): = 2 2 exp exp

28 MLE Prior Idea: both priors penalize too large Since we do not want to assume too much dependence on a single feature Gaussian Prior: penalizes large values more but there is no incentive to get to 0 Laplacian Prior: exactly the opposite Resulting in more sparse constructions

29 MLE Prior Note: in log form, both priors are concave Therefore, they can be added to with no need to change the algorithm The parameters regulating the width of the prior(s) reflect how important it is for us to drive them to 0 Method for selection Cross-Validation: select values, run on part of the data, check vs. the remaining data

30 MLE Conjugate Prior = exp ln In order for the posterior to be of the same form, the prior has to be exp ln Which might be construed as a prior data of size yielding as the expected value per feature

31 Missing Data All the business of a is to endeavour to find out what you don't know by what you do The Duke of Wellington

32 MLE with missing data As previously, let us denote: o[m]: vector of observed values in the m th sample h[m]: vector of hidden values in the m th sample The log likelihood then becomes: = ln, = ln, ln

33 MLE with missing data, cont. Let us have a closer look at the term, It has the form of a partition function It is the partition function on the reduction of the original network by the observation and thus adheres to the derivation of ln presented previously

34 MLE with missing data, cont. The derivand is a partition function, and thus: ln, = ~ H, Leading to the conclusion that the gradient of the loglikelihood is = ~ H,

35 MLE with missing data, cont. = ~ H, The second term, as before, requires an inference computation The first term requires a computation of inference on the reduced network per instance of fo a sigle iteatio of the aset

36 Expectation Maximization EM: an alternative approach Efficient for BNs (previous lecture) Main idea: bootstrapping Start with some initial Compute corresponding distribution for missing data H Based on the full data <, H >, compute a new etc until convergence

37 Expectation Maximization, cont. For BNs: Assess probabilities for each, Based on these probabilities, compute: + =,

38 Expectation Maximization, cont. For MNs: Assess probabilities for each feature (E-step): = ~ H, Done using inference per each Compute next ased o that ho?

39 Expectation Maximization, cont. Computing an optimal from a set of full data in MN is done using gradient ascent iolig uig ifeee ultiple ties for a single iteration of the EM

40 Missing Data: GA vs. EM In both methods, we need inference In GA: M+1 times per each step In EM: (M times + 1 time per step of the GA) per step of the algorithm

41 Alternative Objectives A lee a goes ot oe a outai, ut athe aoud it. A Russian proverb

42 Log likelihood revisited Reminder: Interpretation: = ln ln We strive to increase the difference between the log-measures of the data and the aggregate of all instances Problem: the aggregate (second term, ) is exponential Idea: define a simpler objective

43 Pseudolikelihood Consider the probability of a single instance = =, = =,, =, = =

44 Pseudolikelihood =, = = From here we can derive the pseudo-likelihood: = ln,

45 Pseudolikelihood = ln, The main advantage: it is easier to compute, as there is less coupling and summation Since =, =, =,, The number of elements to sum is, which is clearly sub-exponential

46 Pseudolikelihood Let us now further analyze the summands: ln = ln, ln, =, ln exp,

47 Pseudolikelihood ln =, ln exp, Sce Sce The expression above is the log-likelihood of a MN over conditioned on the rest Meaig it is oae And the pseudo-likelihood, being the sum of such terms, is concave as well! Gadiet Aset agai

48 Pseudolikelihood ln =, ln exp, Sce Sce ln =,, exp, Sce exp, Sce =, ~, Note: when Scope, it does not affect the value of the feature, and so the expression becomes 0

49 Pseudolikelihood ln =, ~, = Sce ~, The computation is much simpler All expectations (summations) are local Finding the maximum of the PL is tractable But does it do us a good?

50 Pseudolikelihood Theorem (20.3): Assume the data is generated by a log-linear model of the form described previously. Then, as M goes to infinity, is the argument for the PL global optimum, with probability approaching 1. Idea of proof: Show the gradient is 0 at (why is that sufficient?)

51 Pseudolikelihood = Sce ~, The first term is the empirical expectancy, which as, goes to

52 Pseudolikelihood = Sce ~, The second term is: = ~, =, = =,

53 Pseudolikelihood = Sce ~, = ~, =,, = =

54 Pseudolikelihood = Sce ~, The first term, as, goes to The second term, as and =, goes to Ergo: as, the gradient at = is 0, QED

55 Pseudolikelihood, concluded An alternative objective to the MLE Tractable The same as MLE as the data size increases But! A large data sample is required for the PL to reflect on the real MLE

56 Contrasting Divergence Main idea: increase difference between the logprobability of the observed data and some othe alue, epesetig the old Global Partition Function (MLE) Single-Variable Partition Function (PL) Log-Probability of Perturbed Data (CD)

57 Contrasting Divergence CD is about the difference between the original data set and a perturbed data set Formally: = ~ ln ~ ln The difference between the empirical expectations on the log-probability

58 Contrasting Divergence CD is about the difference between the original data set and a perturbed data set How to choose this data set? The contrasting data set needs to represent a data sample characteristic of the current So that we strive to increase the probability of the original sampled data relative to the current result, which serves as the contrast in this iterative process

59 Contrasting Divergence The contrasting data set needs to represent a data sample characteristic of the current How? Gibbs Sampling starting from Log saplig util oegee too epesie Short chain is good enough and yields better convergence

60 Contrasting Divergence How do we compute the optimum? Gradient Ascent yet again = ~ ln ~ ln = = = ~ ~

61 Contrasting Divergence = ~ ~ Note: as, the elements of the gradient converge to the gradient of the max log likelihood At the limit of the Markov chain, the CD converges to the actual MLE

62 Checkpoint No this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning. Winston Churchill

63 Checkpoint Maximum likelihood for MN Not easily decomposable due to the Partition Function olule usig Gadiet Aset ut euies running inference on the MN for each step Priors: Gaussian vs. Laplacian Both aim to reduce too strong dependency of overall probability on single feature Conjugate Prior

64 Checkpoint MLE with missing data GA: requires running inference per observation per step EM: reduces to Gradient Ascent requiring slightly less runs of inference Alternative goals Pseudo-likelihood: tractable, requires sufficiently large data sample Contrasting Divergence: tractable, does not require much sampling

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Learning MN Parameters with Approximation. Sargur Srihari

Learning MN Parameters with Approximation. Sargur Srihari Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief

More information

Learning Parameters of Undirected Models. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Log-linear Parameterization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient

More information

Learning Parameters of Undirected Models. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Difficulties due to Global Normalization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Notes on Markov Networks

Notes on Markov Networks Notes on Markov Networks Lili Mou moull12@sei.pku.edu.cn December, 2014 This note covers basic topics in Markov networks. We mainly talk about the formal definition, Gibbs sampling for inference, and maximum

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Markov Networks.

Markov Networks. Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 10 Undirected Models CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due this Wednesday (Nov 4) in class Project milestones due next Monday (Nov 9) About half

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Structured Variational Inference

Structured Variational Inference Structured Variational Inference Sargur srihari@cedar.buffalo.edu 1 Topics 1. Structured Variational Approximations 1. The Mean Field Approximation 1. The Mean Field Energy 2. Maximizing the energy functional:

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm by Korbinian Schwinger Overview Exponential Family Maximum Likelihood The EM Algorithm Gaussian Mixture Models Exponential

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING

More information

Basic Sampling Methods

Basic Sampling Methods Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy

More information

Density Estimation: ML, MAP, Bayesian estimation

Density Estimation: ML, MAP, Bayesian estimation Density Estimation: ML, MAP, Bayesian estimation CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Maximum-Likelihood Estimation Maximum

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Estimating Latent Variable Graphical Models with Moments and Likelihoods Estimating Latent Variable Graphical Models with Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University June 18, 2014 Chaganty, Liang (Stanford University) Moments and Likelihoods

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

ECE521 Lecture 19 HMM cont. Inference in HMM

ECE521 Lecture 19 HMM cont. Inference in HMM ECE521 Lecture 19 HMM cont. Inference in HMM Outline Hidden Markov models Model definitions and notations Inference in HMMs Learning in HMMs 2 Formally, a hidden Markov model defines a generative process

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Technical Details about the Expectation Maximization (EM) Algorithm

Technical Details about the Expectation Maximization (EM) Algorithm Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used

More information

Contrastive Divergence

Contrastive Divergence Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Inference as Optimization

Inference as Optimization Inference as Optimization Sargur Srihari srihari@cedar.buffalo.edu 1 Topics in Inference as Optimization Overview Exact Inference revisited The Energy Functional Optimizing the Energy Functional 2 Exact

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Stochastic Variational Inference

Stochastic Variational Inference Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Bayesian Networks Structure Learning (cont.)

Bayesian Networks Structure Learning (cont.) Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic

More information

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Inductive Principles for Restricted Boltzmann Machine Learning

Inductive Principles for Restricted Boltzmann Machine Learning Inductive Principles for Restricted Boltzmann Machine Learning Benjamin Marlin Department of Computer Science University of British Columbia Joint work with Kevin Swersky, Bo Chen and Nando de Freitas

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

Review and Motivation

Review and Motivation Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which

More information

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale

More information

12 : Variational Inference I

12 : Variational Inference I 10-708: Probabilistic Graphical Models, Spring 2015 12 : Variational Inference I Lecturer: Eric P. Xing Scribes: Fattaneh Jabbari, Eric Lei, Evan Shapiro 1 Introduction Probabilistic inference is one of

More information