Introduction to Probability for Graphical Models

Similar documents
Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Bayesian Models in Machine Learning

STA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2

Recitation 2: Probability

PMR Learning as Inference

Conditioning and Independence

Random variables. Lecture 5 - Discrete Distributions. Discrete Probability distributions. Example - Discrete probability model

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Naïve Bayes classification

Learning Sequence Motif Models Using Gibbs Sampling

Graphical Models (Lecture 1 - Introduction)

4. Score normalization technical details We now discuss the technical details of the score normalization method.

CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 10: The Bayesian way to fit models. Geoffrey Hinton

Bayesian classification CISC 5800 Professor Daniel Leeds

Bayesian Networks Practice

Probability Theory for Machine Learning. Chris Cremer September 2015

Probability and Estimation. Alan Moses

Classical AI and ML research ignored this phenomena Another example

Collaborative Place Models Supplement 1

Bayesian Methods: Naïve Bayes

Discrete Binary Distributions

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

The Poisson Regression Model

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Classical and Bayesian inference

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

BERNOULLI TRIALS and RELATED PROBABILITY DISTRIBUTIONS

Lecture 7: Linear Classification Methods

Chapter 7: Special Distributions

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Chapter 7 Sampling and Sampling Distributions. Introduction. Selecting a Sample. Introduction. Sampling from a Finite Population

CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 11: Bayesian learning continued. Geoffrey Hinton

Uniform Law on the Unit Sphere of a Banach Space

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Probability Densities in Data Mining

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Convex Analysis and Economic Theory Winter 2018

Advanced topics from statistics

Expectation Maximization Mixture Models HMMs

Chapter 1: PROBABILITY BASICS

1 Random Experiments from Random Experiments

Short course A vademecum of statistical pattern recognition techniques with applications to image and video analysis. Agenda

Basics on Probability. Jingrui He 09/11/2007

Sampling. Inferential statistics draws probabilistic conclusions about populations on the basis of sample statistics

COS513 LECTURE 8 STATISTICAL CONCEPTS

Review of probabilities

Learning Bayesian network : Given structure and completely observed data

CSC321 Lecture 18: Learning Probabilistic Models

Computational Perception. Bayesian Inference

Introduction to Machine Learning

Compute f(x θ)f(θ) dθ

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Named Entity Recognition using Maximum Entropy Model SEEM5680

2. A Basic Statistical Toolbox

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Gaussian processes A hands-on tutorial

Lecture 1: Probability Fundamentals

7. Two Random Variables

Machine Learning

1 Probability Spaces and Random Variables

Research Note REGRESSION ANALYSIS IN MARKOV CHAIN * A. Y. ALAMUTI AND M. R. MESHKANI **

ECE521 week 3: 23/26 January 2017

MLE/MAP + Naïve Bayes

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

Introduction to Probability and Statistics (Continued)

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CS 630 Basic Probability and Information Theory. Tim Campbell

Probabilistic and Bayesian Machine Learning

Review of Probability Theory II

Department of Large Animal Sciences. Outline. Slide 2. Department of Large Animal Sciences. Slide 4. Department of Large Animal Sciences

Statistical Data Analysis Stat 3: p-values, parameter estimation

The Longest Run of Heads

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material

Bayesian Networks Practice

ECE 534 Information Theory - Midterm 2

Language as a Stochastic Process

Review of Probabilities and Basic Statistics

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

ORF 245 Fundamentals of Statistics Joint Distributions

Lecture 2: Priors and Conjugacy

Introduction to Machine Learning CMU-10701

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE 6960: Adv. Random Processes & Applications Lecture Notes, Fall 2010

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Introduction to Probabilistic Machine Learning

GWAS IV: Bayesian linear (variance component) models

General Linear Model Introduction, Classes of Linear models and Estimation

Machine Learning

Transcription:

Introduction to Probability for Grahical Models CSC 4 Kaustav Kundu Thursday January 4, 06 *Most slides based on Kevin Swersky s slides, Inmar Givoni s slides, Danny Tarlow s slides, Jaser Snoek s slides, Sam Roweis s review of robability, Bisho s book, and some images from Wikiedia

Outline Basics Probability rules Eonential family models Maimum likelihood Conjugate Bayesian inference time ermitting

Why Reresent Uncertainty? The world is full of uncertainty What will the weather be like today? Will I like this movie? Is there a erson in this image? We re trying to build systems that understand and ossibly interact with the real world We often can t rove something is true, but we can still ask how likely different outcomes are or ask for the most likely elanation Sometimes robability gives a concise descrition of an otherwise comle henomenon.

Why Use Probability to Reresent Uncertainty? Write down simle, reasonable criteria that you'd want from a system of uncertainty common sense stuff, and you always get robability. Co Aioms Co 946; See Bisho, Section..3 We will restrict ourselves to a relatively informal discussion of robability theory.

Notation A random variable X reresents outcomes or states of the world. We will write to mean ProbabilityX Samle sace: the sace of all ossible outcomes may be discrete, continuous, or mied is the robability mass density function Assigns a number to each oint in samle sace Non-negative, sums integrates to Intuitively: how often does occur, how much do we believe in.

Joint Probability Distribution ProbX, Yy Probability of X and Yy, y Conditional Probability Distribution ProbXYy Probability of X given Yy y,y/y

The Rules of Probability Sum Rule marginalization/summing out: Product/Chain Rule:,...,,..., 3 N y y N,...,...,...,, N N N y y

Bayes Rule One of the most imortant formulas in robability theory This gives us a way of reversing conditional robabilities ' ' ' y y y y y

Indeendence Two random variables are said to be indeendent iff their joint distribution factors Two random variables are conditionally indeendent given a third if they are indeendent after conditioning on the third, y y y y y z z z y z z y z y,,

Continuous Random Variables Outcomes are real values. Probability density functions define distributions. E.g., P, σ e πσ σ Continuous joint distributions: relace sums with integrals, and everything holds E.g., Marginalization and conditional robability P, z P, y, z y y P, z y P y

Summarizing Probability Distributions It is often useful to give summaries of distributions without defining the whole distribution E.g., mean and variance Mean: E[ ] d Variance: var E[ ] E[ ]E[] d

Summarizing Probability Distributions It is often useful to give summaries of distributions without defining the whole distribution E.g., mean and variance Mean: E[ ] d Variance: var E[ ] E[ ]E[] d

Eonential Family Family of robability distributions Many of the standard distributions belong to this family Bernoulli, Binomial/Multinomial, Poisson, Normal Gaussian, Beta/Dirichlet, Share many imortant roerties e.g. They have a conjugate rior we ll get to that later. Imortant for Bayesian statistics

Definition The eonential family of distributions over, given arameter η eta is the set of distributions of the form T η h g ηe{ η u } -scalar/vector, discrete/continuous η natural arameters u some function of sufficient statistic gη normalizer T g η h e{ η u } d h base measure often constant

Sufficient Statistics Vague definition: called so because they comletely summarize a distribution. Less vague: they are the only art of the distribution that interacts with the arameters and are therefore sufficient to estimate the arameters.

Eamle : Bernoulli Binary random variable - heads Coin toss X {0,} [0,]

Eamle : Bernoulli } e{ln } ln ln e{ u + ln η σ η η σ η η + g e u h } e{ u g h T η η η e η η σ η

Eamle : Multinomial value k k For a single observation die toss Sometimes called Categorical For multile observations integer counts on N trials k [0,], k Prob came out 3 times, came out once,, 6 came out 7 times if I tossed a die 0 times M N! P,..., M! k k k k k M k M k k N

e{ Eamle : Multinomial P,..., M M k ln k k } observation T η h g ηe{ η u } M k T η e η k k h u Parameters are not indeendent due to constraint of summing to, there s a slightly more involved notation to address that, see Bisho.4

Eamle 3: Normal Gaussian Gaussian Normal Distribution, σ e πσ σ

Eamle 3: Normal Gaussian Distribution, σ e πσ σ is the mean σ is the variance Can verify these by comuting integrals. E.g., πσ e σ d

Eamle 3: Normal Gaussian Distribution Multivariate Gaussian P, π / e T

Eamle 3: Normal Gaussian Distribution Multivariate Gaussian / T, π e is now a vector is the mean vector Σ is the covariance matri

Imortant Proerties of Gaussians All marginals of a Gaussian are again Gaussian Any conditional of a Gaussian is Gaussian The roduct of two Gaussians is again Gaussian Even the sum of two indeendent Gaussian RVs is a Gaussian. Beyond the scoe of this tutorial, but very imortant: marginalization and conditioning rules for multivariate Gaussians.

Gaussian marginalization visualization

Eonential Family Reresentation } e{ 4 e } e{ e, + + σ σ η η η π σ σ σ πσ σ πσ σ } e{ u g h T η η η h η g T η u

Eamle: Maimum Likelihood For a D Gaussian Suose we are given a data set of samles of a Gaussian random variable X, D{,, N } and told that the variance of the data is σ N What is our best guess of? *Need to assume data is indeendent and identically distributed i.i.d.

Eamle: Maimum Likelihood For a D Gaussian What is our best guess of? We can write down the likelihood function: N N i i d, σ e i i πσ σ We want to choose the that maimizes this eression Take log, then basic calculus: differentiate w.r.t., set derivative to 0, solve for to get samle mean ML N N i i

Eamle: Maimum Likelihood For a D Gaussian σ ML N ML Maimum Likelihood

ML estimation of model arameters for Eonential Family D η,..., N h n lnd η η gη N e{η T u n n...,set to 0, solve for gη } ln g η N u ML N n n Can in rincile be solved to get estimate for eta. The solution for the ML estimator deends on the data only through sum over u, which is therefore called sufficient statistic What we need to store in order to estimate arameters.

Bayesian Probabilities d θ θ θ d d d θ is the likelihood function θ is the rior robability of or our rior belief over θ our beliefs over what models are likely or not before seeing any data d d θ P θ dθ is the normalization constant or artition function θ d is the osterior distribution Readjustment of our rior beliefs in the face of data

Eamle: Bayesian Inference For a D Gaussian Suose we have a rior belief that the mean of some random variable X is 0 and the variance of our belief is σ 0 We are then given a data set of samles of X, d{,, N } and somehow know that the variance of the data is σ What is the osterior distribution over our belief about the value of?

Eamle: Bayesian Inference For a D Gaussian N

Eamle: Bayesian Inference For a D Gaussian σ 0 N 0 Prior belief

Eamle: Bayesian Inference For a D Gaussian Remember from earlier is the likelihood function is the rior robability of or our rior belief over d d d d N i N i i i P d e, σ πσ σ 0 0 0 0 0 e, σ πσ σ

Eamle: Bayesian Inference For a D Gaussian D D D Normal N, σ N where N σ Nσ 0 + σ + Nσ 0 0 Nσ 0 + σ ML σ N σ + N 0 σ

Eamle: Bayesian Inference For a D Gaussian σ 0 N 0 Prior belief

Eamle: Bayesian Inference For a D Gaussian σ ML σ 0 N 0 ML Prior belief Maimum Likelihood

Eamle: Bayesian Inference For a D Gaussian σ N N N Prior belief Maimum Likelihood Posterior Distribution

Conjugate Priors Notice in the Gaussian arameter estimation eamle that the functional form of the osterior was that of the rior Gaussian Priors that lead to that form are called conjugate riors For any member of the eonential family there eists a conjugate rior that can be written like ν T η χ, ν f χ, ν g η e{ νη χ} Multily by likelihood to obtain osterior u to normalization of the form N η D, χ, ν g η ν + N e{ η u + νχ} Notice the addition to the sufficient statistic ν is the effective number of seudoobservations. T n n

Conjugate Priors - Eamles Beta for Bernoulli/binomial Dirichlet for categorical/multinomial Normal for Normal