PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Curve Fitting Re-visited, Bishop1.2.5

Bayesian Models in Machine Learning

ECE285/SIO209, Machine learning for physical applications, Spring 2017

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Stat 5101 Lecture Notes

Machine Learning Overview

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Machine Learning using Bayesian Approaches

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

STA 414/2104: Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Lecture 11: Probability Distributions and Parameter Estimation

MIXTURE MODELS AND EM

CS 361: Probability & Statistics

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Machine Learning Techniques for Computer Vision

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

CS 361: Probability & Statistics

PMR Learning as Inference

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Pattern Recognition and Machine Learning

Recent Advances in Bayesian Inference Techniques

STA 4273H: Statistical Machine Learning

Bayesian Methods for Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

Lecture 2: Priors and Conjugacy

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

p L yi z n m x N n xi

Lecture 2: Conjugate priors

Stat 5101 Lecture Slides: Deck 8 Dirichlet Distribution. Charles J. Geyer School of Statistics University of Minnesota

Computational Cognitive Science

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Bayesian Methods: Naïve Bayes

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Probability Distributions

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Probability Distributions

CSC321 Lecture 18: Learning Probabilistic Models

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Expectation Maximization

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning

Quantitative Biology II Lecture 4: Variational Methods

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Introduction to Probability and Statistics (Continued)

MLE/MAP + Naïve Bayes

Language as a Stochastic Process

The Bayes classifier

COS513 LECTURE 8 STATISTICAL CONCEPTS

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Probability and Estimation. Alan Moses

Introduction to Machine Learning

Probabilistic Graphical Models

MLE/MAP + Naïve Bayes

Bayesian Learning (II)

Multivariate Bayesian Linear Regression MLAI Lecture 11

Introduction to Probabilistic Machine Learning

Learning Bayesian network : Given structure and completely observed data

Naïve Bayes classification

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

More Spectral Clustering and an Introduction to Conjugacy

Introduction to Machine Learning. Lecture 2

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Part 6: Multivariate Normal and Linear Models

The Expectation-Maximization Algorithm

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

2.3. The Gaussian Distribution

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Variational Mixture of Gaussians. Sargur Srihari

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Computational Cognitive Science

Stat 5101 Notes: Brand Name Distributions

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

ECE521 week 3: 23/26 January 2017

2 Inference for Multinomial Distribution

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

A Few Special Distributions and Their Properties

Qualifier: CS 6375 Machine Learning Spring 2015

Lecture 2: Repetition of probability theory and statistics

Discrete Binary Distributions

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

STA414/2104 Statistical Methods for Machine Learning II

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

Probabilistic Graphical Models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Data Mining Techniques. Lecture 3: Probability

5. Conditional Distributions

BAYESIAN DECISION THEORY. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Transcription:

PROBABILITY DISTRIBUTIONS

Credits 2 These slides were sourced and/or modified from: Christopher Bishop, Microsoft UK

Parametric Distributions 3 Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting

4 Binary Variables Coin flipping: heads=1, tails=0 Bernoulli Distribution

END OF LECTURE MON SEPT 20, 2010

Guidelines for Paper Presentations 6 Everyone should read the paper prior to the presentation and be prepared to discuss it. What is the objective? What tools from the course are being used? What did you not understand?

Guidelines for Paper Presentations 7 For the presenter: Your presentation should be around 10 minutes long no more than 15! (About 10 slides) What is the objective? What tools from the course are being used and how? What are the key ideas? What are the unsolved problems? Be prepared to answer questions from other students.

8 Binary Variables N coin flips: Binomial Distribution

9 Binomial Distribution

Parameter Estimation 10 ML for Bernoulli Given:

11 Parameter Estimation Example: Prediction: all future tosses will land heads up Overfitting to D

12 Beta Distribution Distribution over. where Γ(x) = Note that 0 u x 1 e u du Γ(x + 1) = xγ(x) Γ(1) = 1 Γ(x + 1) = x! when x is an integer.

13 Bayesian Bernoulli The Beta distribution provides the conjugate prior for the Bernoulli distribution.

14 Beta Distribution

15 Prior Likelihood = Posterior

16 Properties of the Posterior As the size N of the data set increases

17 Multinomial Variables 1-of-K coding scheme:

18 ML Parameter estimation Given: To ensure, use a Lagrange multiplier, λ See Appendix E for a review of Lagrange multipliers.

19 The Multinomial Distribution,,, for j k where N m 1,m 2,,m K N! m 1!m 2!,m K!

20 The Dirichlet Distribution Since K µ k = 1 k =1 Conjugate prior for the multinomial distribution.

21 Bayesian Multinomial

22 Bayesian Multinomial

23 The Gaussian Distribution

Central Limit Theorem 24 The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. Example: N uniform [0,1] random variables.

25 Geometry of the Multivariate Gaussian where Δ Mahalanobis distance from µ to x Eigenvector equation: Σu i = λ i u i where (u i,λ i ) are the ith eigenvector and eigenvalue of Σ. Note that Σ real and symmetric λ i real. Proof? See Appendix C for a review of matrices and eigenvectors.

26 Geometry of the Multivariate Gaussian Δ = Mahalanobis distance from µ to x where (u i,λ i ) are the ith eigenvector and eigenvalue of Σ. or y = U(x - µ)

27 Moments of the Multivariate Gaussian thanks to anti-symmetry of z

28 Moments of the Multivariate Gaussian

29 Partitioned Gaussian Distributions

30 Partitioned Conditionals and Marginals

31 Partitioned Conditionals and Marginals

Maximum Likelihood for the Gaussian 32 Given i.i.d. data, the log likelihood function is given by Sufficient statistics

Maximum Likelihood for the Gaussian 33 Set the derivative of the log likelihood function to zero, and solve to obtain Similarly Recall: If x and a are vectors, then ( x xt a) = ( x at x) = a

34 Maximum Likelihood for the Gaussian Under the true distribution Hence define

Bayesian Inference for the Gaussian (Univariate Case) 35 Assume σ 2 is known. Given i.i.d. data, the likelihood function for µ is given by This has a Gaussian shape as a function of µ (but it is not a distribution over µ ).

Bayesian Inference for the Gaussian (Univariate Case) 36 Combined with a Gaussian prior over µ, this gives the posterior Completing the square over µ, we see that

Bayesian Inference for the Gaussian 37 where Note: Shortcut: Get Δ 2 in form aµ 2 2bµ + c = a(µ b / a) 2 + const and identify µ N = b / a 1 σ N 2 = a

38 Bayesian Inference for the Gaussian Example: for N = 0, 1, 2 and 10.

39 Bayesian Inference for the Gaussian Sequential Estimation The posterior obtained after observing N { 1 data points becomes the prior when we observe the N th data point.

40 Bayesian Inference for the Gaussian Now assume µ is known. The likelihood function for λ = 1/ σ 2 is given by This has a Gamma shape as a function of λ.

41 Bayesian Inference for the Gaussian The Gamma distribution

42 Bayesian Inference for the Gaussian Now we combine a Gamma prior, with the likelihood function for λ to obtain which we recognize as with

43 Bayesian Inference for the Gaussian If both µ and λ are unknown, the joint likelihood function is given by We need a prior with the same functional dependence on µ and λ.

44 Bayesian Inference for the Gaussian The Gaussian-gamma distribution

Bayesian Inference for the Gaussian 45 The Gaussian-gamma distribution

Bayesian Inference for the Gaussian 46 Multivariate conjugate priors µ unknown, Λ known: p( µ ) Gaussian. Λ unknown, µ known: p( Λ) Wishart, µ and Λ unknown: p( µ, Λ) Gaussian-Wishart,

47 Student s t-distribution where Infinite mixture of Gaussians.

48 Student s t-distribution

49 Student s t-distribution Robustness to outliers: Gaussian vs t-distribution.

50 Student s t-distribution The D-variate case: where. Properties:

Periodic variables 51 Examples: time of day, direction, We require

52 von Mises Distribution This requirement is satisfied by where is the 0 th order modified Bessel function of the 1 st kind. (The von Mises distribution is the intersection of an isotropic bivariate Gaussian with the unit circle)

53 von Mises Distribution

54 Maximum Likelihood for von Mises Given a data set, function is given by, the log likelihood Maximizing with respect to µ 0 we directly obtain Similarly, maximizing with respect to m we get which can be solved numerically for m ML.

55 Mixtures of Gaussians Old Faithful data set Duration of last eruption (min) Time to next eruption (min) Single Gaussian Mixture of two Gaussians

56 Mixtures of Gaussians Combine simple models into a complex model: Component Mixing coefficient K=3

57 Mixtures of Gaussians

58 Mixtures of Gaussians Determining parameters log likelihood µ, σ and π using maximum Log of a sum; no closed form maximum. Solution: use standard, iterative, numeric optimization methods or the expectation maximization algorithm (Chapter 9).