Lecture 11: Probability Distributions and Parameter Estimation

Similar documents
Gaussian Processes. Recommended reading: Rasmussen/Williams: Chapters 1, 2, 4, 5 Deisenroth & Ng (2015)[3] Data Analysis and Probabilistic Inference

Bayesian Models in Machine Learning

Bayesian Methods: Naïve Bayes

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Naïve Bayes classification

Lecture : Probabilistic Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Lecture 2: Conjugate priors

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Probability and Estimation. Alan Moses

COS513 LECTURE 8 STATISTICAL CONCEPTS

Machine Learning 4771

CSC321 Lecture 18: Learning Probabilistic Models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Lecture 2: Priors and Conjugacy

Computational Cognitive Science

Bayesian Methods for Machine Learning

Learning Bayesian network : Given structure and completely observed data

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Data Mining Techniques. Lecture 3: Probability

Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Bayesian RL Seminar. Chris Mansley September 9, 2008

CHAPTER 2 Estimating Probabilities

Introduction to Machine Learning. Lecture 2

Bayesian Learning (II)

Introduction to Probabilistic Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

COMP90051 Statistical Machine Learning

Basics on Probability. Jingrui He 09/11/2007

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

PMR Learning as Inference

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Computational Cognitive Science

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Lecture 5 September 19

Machine Learning

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Introduction to Machine Learning CMU-10701

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture 4: Probabilistic Learning

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Gaussians Linear Regression Bias-Variance Tradeoff

Density Estimation. Seungjin Choi

Machine Learning using Bayesian Approaches

Parametric Techniques Lecture 3

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

MLE/MAP + Naïve Bayes

Computational Perception. Bayesian Inference

Probabilistic and Bayesian Machine Learning

Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Parametric Techniques

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

6.867 Machine Learning

01 Probability Theory and Statistics Review

CS 361: Probability & Statistics

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

6.867 Machine Learning

Hierarchical Models & Bayesian Model Selection

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Nonparameteric Regression:

MLE/MAP + Naïve Bayes

Discrete Binary Distributions

Bayesian Hypotheses Testing

CS 361: Probability & Statistics

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Machine Learning

Introduction to Probability and Statistics (Continued)

Mathematics for Inference and Machine Learning

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Introduction to Machine Learning

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Unsupervised Learning

Parametric Models: from data to models

Stat 5101 Lecture Notes

Homework 5: Conditional Probability Models

Some Probability and Statistics

Introduction to Probability and Statistics (Continued)

Transcription:

Intelligent Data Analysis and Probabilistic Inference Lecture 11: Probability Distributions and Parameter Estimation Recommended reading: Bishop: Chapters 1.2, 2.1 2.3.4, Appendix B Duncan Gillies and Marc Deisenroth Department of Computing Imperial College London February 10, 2016

Key Concepts in Probability Theory Two fundamental rules: ż ppxq ppx, yqdy ppx, yq ppy xqppxq Sum rule/marginalization property Product rule Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 2

Key Concepts in Probability Theory Two fundamental rules: ż ppxq ppx, yqdy ppx, yq ppy xqppxq Sum rule/marginalization property Product rule Bayes Theorem (Probabilistic Inverse) ppy xq ppxq ppx yq, x : hypothesis, y : measurement ppyq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 2

Key Concepts in Probability Theory Two fundamental rules: ż ppxq ppx, yqdy ppx, yq ppy xqppxq Sum rule/marginalization property Product rule Bayes Theorem (Probabilistic Inverse) ppy xq ppxq ppx yq, x : hypothesis, y : measurement ppyq Posterior belief Prior belief x Likelihood (measurement model) y Marginal likelihood (normalization constant) Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 2

Mean and (Co)Variance Mean and covariance are often useful to describe properties of probability distributions (expected values and spread). Definition ż E x rxs xppxqdx : µ V x rxs E x rpx µqpx µq J s E x rxx J s E x rxse x rxs J : Σ Covrx, ys E x,y rxy J s E x rxse y rys J Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 3

Mean and (Co)Variance Mean and covariance are often useful to describe properties of probability distributions (expected values and spread). Definition ż E x rxs xppxqdx : µ V x rxs E x rpx µqpx µq J s E x rxx J s E x rxse x rxs J : Σ Covrx, ys E x,y rxy J s E x rxse y rys J Linear/Affine Transformations: y Ax ` b, Erys Vrys where E x rxs µ, V x rxs Σ Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 3

Mean and (Co)Variance Mean and covariance are often useful to describe properties of probability distributions (expected values and spread). Definition ż E x rxs xppxqdx : µ V x rxs E x rpx µqpx µq J s E x rxx J s E x rxse x rxs J : Σ Covrx, ys E x,y rxy J s E x rxse y rys J Linear/Affine Transformations: y Ax ` b, where E x rxs µ, V x rxs Σ Erys E x rax ` bs AE x rxs ` b Aµ ` b Vrys Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 3

Mean and (Co)Variance Mean and covariance are often useful to describe properties of probability distributions (expected values and spread). Definition ż E x rxs xppxqdx : µ V x rxs E x rpx µqpx µq J s E x rxx J s E x rxse x rxs J : Σ Covrx, ys E x,y rxy J s E x rxse y rys J Linear/Affine Transformations: y Ax ` b, where E x rxs µ, V x rxs Σ Erys E x rax ` bs AE x rxs ` b Aµ ` b Vrys V x rax ` bs V x raxs AV x rxsa J AΣA J Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 3

Mean and (Co)Variance Mean and covariance are often useful to describe properties of probability distributions (expected values and spread). Definition ż E x rxs xppxqdx : µ V x rxs E x rpx µqpx µq J s E x rxx J s E x rxse x rxs J : Σ Covrx, ys E x,y rxy J s E x rxse y rys J Linear/Affine Transformations: y Ax ` b, where E x rxs µ, V x rxs Σ Erys E x rax ` bs AE x rxs ` b Aµ ` b Vrys V x rax ` bs V x raxs AV x rxsa J AΣA J If x, y independent: V x,y rx ` ys V x rxs ` V y rys Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 3

Basic Probability Distributions Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 4

The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 p(x) Mean 95% confidence bound 0.25 0.2 p(x) 0.15 0.1 0.05 0 4 3 2 1 0 1 2 3 4 5 6 x Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 5

The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 p(x) Mean 95% confidence bound 3 2 Mean 95% confidence bound 0.25 1 p(x) 0.2 0.15 x 2 0 1 0.1 2 0.05 3 0 4 3 2 1 0 1 2 3 4 5 6 x 5 4 3 2 1 0 1 2 3 x 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 5

The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 Data p(x) Mean 95% confidence interval 3 2 Data Mean 95% confidence bound 0.25 1 p(x) 0.2 0.15 x 2 0 1 0.1 2 0.05 3 0 4 3 2 1 0 1 2 3 4 5 6 x 5 4 3 2 1 0 1 2 3 x 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 5

Conditional 3 2 «ff µx Joint p(x,y) ppx, yq N, µ y «ff Σxx Σ xy Σ yx Σ yy 1 0 y -1-2 -3-4 -5-6 -4-2 0 2 4 x Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 6

Conditional 3 2 1 0 Joint p(x,y) Observation ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy y -1-2 -3-4 -5-6 -4-2 0 2 4 x Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 6

Conditional 3 2 Joint p(x,y) Observation y Conditional p(x y) ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy y 1 0-1 -2-3 -4 ppx yq N `µ x y, Σ x y µ x y µ x ` Σ xy Σ 1 yy py µ y q Σ x y Σ xx Σ xy Σ 1 yy Σ yx -5-6 -4-2 0 2 4 x Conditional ppx yq is also Gaussian Computationally convenient Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 6

Marginal y 3 2 1 0-1 -2-3 -4-5 Joint p(x,y) Marginal p(x) -6-4 -2 0 2 4 x ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy Marginal distribution: ż pp x q pp x, y qd y N ` µ x, Σ xx Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 7

Marginal y 3 2 1 0-1 -2-3 -4-5 Joint p(x,y) Marginal p(x) -6-4 -2 0 2 4 x ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Marginal distribution: ż pp x q pp x, y qd y N ` µ x, Σ xx The marginal of a joint Gaussian distribution is Gaussian Intuitively: Ignore (integrate out) everything you are not interested in Σ yy Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 7

Bernoulli Distribution Distribution for a single binary variable x P t0, 1u Governed by a single continuous parameter µ P r0, 1s that represents the probability of x P t0, 1u. ppx µq µ x p1 µq 1 x Erxs µ Vrxs µp1 µq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 8

Bernoulli Distribution Distribution for a single binary variable x P t0, 1u Governed by a single continuous parameter µ P r0, 1s that represents the probability of x P t0, 1u. ppx µq µ x p1 µq 1 x Erxs µ Vrxs µp1 µq Example: Result of flipping a coin. Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 8

Beta Distribution ppµ α, βq Erµs Distribution over a continuous variable µ P r0, 1s, which is often used to represent the probability for some binary event (see Bernoulli distribution) Governed by two parameters α ą 0, β ą 0 Γpα ` βq µα 1 p1 µq β 1 ΓpαqΓpβq α α ` β, Vrµs αβ pα ` βq 2 pα ` β ` 1q Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 9

Binomial Distribution 0.25 Generalization of the Bernoulli 0.20 0.15 distribution to a distribution over integers 0.10 0.05 0.00 0 2 4 6 8 10 12 14 ppm N, µq Probability of observing m occurrences of x 1 in a set of N samples from a Bernoulli distribution, where ppx 1q µ P r0, 1s ˆN µ m p1 µq N m m Erms Nµ, Vrms Nµp1 µq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 10

Binomial Distribution 0.25 Generalization of the Bernoulli 0.20 0.15 distribution to a distribution over integers 0.10 0.05 0.00 0 2 4 6 8 10 12 14 ppm N, µq Probability of observing m occurrences of x 1 in a set of N samples from a Bernoulli distribution, where ppx 1q µ P r0, 1s ˆN µ m p1 µq N m m Erms Nµ, Vrms Nµp1 µq Example: What is the probability of observing m heads in N experiments if the probability for observing head in a single experiment is µ? Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 10

Gamma Distribution ppτ a, bq 1 Γpaq ba τ a 1 expp bτq Erτs a b Vrτs a b 2 Distribution over positive real numbers τ ą 0 Governed by parameters a ą 0 (shape), b ą 0 (scale) Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 11

Conjugate Priors Posterior 9 prior ˆ likelihood Specification or the prior can be tricky Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 12

Conjugate Priors Posterior 9 prior ˆ likelihood Specification or the prior can be tricky Some priors are (computationally) convenient If the posterior and the prior are of the same type (e.g., Beta), the prior is called conjugate Likelihood is also involved... Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 12

Conjugate Priors Posterior 9 prior ˆ likelihood Specification or the prior can be tricky Some priors are (computationally) convenient If the posterior and the prior are of the same type (e.g., Beta), the prior is called conjugate Likelihood is also involved... Examples: Conjugate prior Likelihood Posterior Beta Bernoulli Beta Gaussian-iGamma Gaussian Gaussian-iGamma Dirichlet Multinomial Dirichlet Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 12

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. We place a Beta-prior on the parameter µ: Betapµ α, βq Γpα ` βq ΓpαqΓpβq µα 1 p1 µq β 1 9 µ α 1 p1 µq β 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. We place a Beta-prior on the parameter µ: Betapµ α, βq Γpα ` βq ΓpαqΓpβq µα 1 p1 µq β 1 9 µ α 1 p1 µq β 1 If we now observe some outcomes x px 1,..., x n q of a repeated coin-flip experiment with h heads and t tails, we compute the posterior distribution on µ : Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. We place a Beta-prior on the parameter µ: Betapµ α, βq Γpα ` βq ΓpαqΓpβq µα 1 p1 µq β 1 9 µ α 1 p1 µq β 1 If we now observe some outcomes x px 1,..., x n q of a repeated coin-flip experiment with h heads and t tails, we compute the posterior distribution on µ : ppµ x hq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. We place a Beta-prior on the parameter µ: Betapµ α, βq Γpα ` βq ΓpαqΓpβq µα 1 p1 µq β 1 9 µ α 1 p1 µq β 1 If we now observe some outcomes x px 1,..., x n q of a repeated coin-flip experiment with h heads and t tails, we compute the posterior distribution on µ : ppµ x hq 9 ppx µqppµ α, βq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. We place a Beta-prior on the parameter µ: Betapµ α, βq Γpα ` βq ΓpαqΓpβq µα 1 p1 µq β 1 9 µ α 1 p1 µq β 1 If we now observe some outcomes x px 1,..., x n q of a repeated coin-flip experiment with h heads and t tails, we compute the posterior distribution on µ : ppµ x hq 9 ppx µqppµ α, βq µ h p1 µq t µ α 1 p1 µq β 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. We place a Beta-prior on the parameter µ: Betapµ α, βq Γpα ` βq ΓpαqΓpβq µα 1 p1 µq β 1 9 µ α 1 p1 µq β 1 If we now observe some outcomes x px 1,..., x n q of a repeated coin-flip experiment with h heads and t tails, we compute the posterior distribution on µ : ppµ x hq 9 ppx µqppµ α, βq µ h p1 µq t µ α 1 p1 µq β 1 µ h`α 1 p1 µq t`β 1 9 Betaph ` α, t ` βq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Parameter Estimation Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 14

Parameter Estimation 4 3 2 1 0 t 1 0 1 2 1 3 4 2 0 2 4 6 0 x 1 Given a data set we want to obtain good estimates of the parameters of the model that may have generated the data Parameter estimation problem Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 15

Parameter Estimation 4 3 2 1 0 t 1 0 1 2 1 3 4 2 0 2 4 6 0 x 1 Given a data set we want to obtain good estimates of the parameters of the model that may have generated the data Parameter estimation problem Example: x 1,..., x N P R D are i.i.d. samples from a Gaussian Find the mean and covariance of ppxq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 15

Maximum Likelihood Parameter Estimation (1) Maximum likelihood estimation finds a point estimate of the parameters that maximizes the likelihood of the parameters Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 16

Maximum Likelihood Parameter Estimation (1) Maximum likelihood estimation finds a point estimate of the parameters that maximizes the likelihood of the parameters In our Gaussian example, we seek µ, Σ: Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 16

Maximum Likelihood Parameter Estimation (1) Maximum likelihood estimation finds a point estimate of the parameters that maximizes the likelihood of the parameters In our Gaussian example, we seek µ, Σ: max ppx 1,..., x n µ, Σq i.i.d. max Nź ppx i µ, Σq i 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 16

Maximum Likelihood Parameter Estimation (1) Maximum likelihood estimation finds a point estimate of the parameters that maximizes the likelihood of the parameters In our Gaussian example, we seek µ, Σ: max ppx 1,..., x n µ, Σq i.i.d. max max Nÿ log ppx i µ, Σq i 1 Nź ppx i µ, Σq i 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 16

Maximum Likelihood Parameter Estimation (1) Maximum likelihood estimation finds a point estimate of the parameters that maximizes the likelihood of the parameters In our Gaussian example, we seek µ, Σ: max ppx 1,..., x n µ, Σq i.i.d. max max Nÿ log ppx i µ, Σq i 1 Nź ppx i µ, Σq i 1 max ND 2 logp2πq N 1 log Σ 2 2 Nÿ px i µq J Σ 1 px i µq i 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 16

Maximum Likelihood Parameter Estimation (2) µ ML arg max ND µ 2 logp2πq N log Σ 1 2 2 1 N Nÿ i 1 x i Nÿ px i µq J Σ 1 px i µq i 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 17

Maximum Likelihood Parameter Estimation (2) µ ML arg max ND µ 2 logp2πq N log Σ 1 2 2 Σ ML 1 N Nÿ i 1 x i arg max ND Σ 2 logp2πq N 1 log Σ 2 2 1 N Nÿ px i µ ML qpx i µ ML q J i 1 Nÿ px i µq J Σ 1 px i µq i 1 Nÿ px i µq J Σ 1 px i µq i 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 17

Maximum Likelihood Parameter Estimation (2) µ ML arg max ND µ 2 logp2πq N log Σ 1 2 2 Σ ML 1 N Nÿ i 1 x i arg max ND Σ 2 logp2πq N 1 log Σ 2 2 1 N Nÿ px i µ ML qpx i µ ML q J i 1 Nÿ px i µq J Σ 1 px i µq i 1 Nÿ px i µq J Σ 1 px i µq ML estimate Σ ML is biased, but we can get an unbiased estimate as Σ 1 Nÿ px i µ N 1 ML qpx i µ ML q J i 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 17 i 1

MLE: Properties Asymptotic consistency: The MLE converges to the true value in the limit of infinitely many observations, plus a random error that is approximately normal The size of the sample necessary to achieve these properties can be quite large The error s variance decays in 1{N where N is the number of data points Especially, in the small data regime, MLE can lead to overfitting Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 18

Example: MLE in the Small-Data Regime 0.14 0.12 True Gaussian True mean MLE mean 0.10 0.08 0.06 0.04 0.02 0.00 10 5 0 5 10 15 20 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 19

Maximum A Posteriori Estimation Instead of maximizing the likelihood, we can seek parameters that maximize the posterior distribution of the parameters θ arg max θ ppθ xq arg max log ppθq ` log ppx θq θ Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 20

Maximum A Posteriori Estimation Instead of maximizing the likelihood, we can seek parameters that maximize the posterior distribution of the parameters θ arg max θ ppθ xq arg max log ppθq ` log ppx θq θ MLE with an additional regularizer that comes from the prior MAP estimator Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 20

Maximum A Posteriori Estimation Instead of maximizing the likelihood, we can seek parameters that maximize the posterior distribution of the parameters θ arg max θ ppθ xq arg max log ppθq ` log ppx θq θ MLE with an additional regularizer that comes from the prior MAP estimator Example: Estimate the mean µ of a 1D Gaussian with known variance σ 2 after having observed N data points x i. Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 20

Maximum A Posteriori Estimation Instead of maximizing the likelihood, we can seek parameters that maximize the posterior distribution of the parameters θ arg max θ ppθ xq arg max log ppθq ` log ppx θq θ MLE with an additional regularizer that comes from the prior MAP estimator Example: Estimate the mean µ of a 1D Gaussian with known variance σ 2 after having observed N data points x i. Gaussian prior ppµq N `µ m, s 2 on mean yields µ MAP Ns2 Ns 2 ` σ 2 µ σ 2 ML ` Ns 2 ` σ 2 m Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 20

Interpreting the Result µ MAP Ns2 Ns 2 ` σ 2 µ σ 2 ML ` Ns 2 ` σ 2 m Linear interpolation between the prior mean and the sample mean (ML estimate), weighted by their respective covariances Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 21

Interpreting the Result µ MAP Ns2 Ns 2 ` σ 2 µ σ 2 ML ` Ns 2 ` σ 2 m Linear interpolation between the prior mean and the sample mean (ML estimate), weighted by their respective covariances The more data we have seen (N), the more we believe the sample mean Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 21

Interpreting the Result µ MAP Ns2 Ns 2 ` σ 2 µ σ 2 ML ` Ns 2 ` σ 2 m Linear interpolation between the prior mean and the sample mean (ML estimate), weighted by their respective covariances The more data we have seen (N), the more we believe the sample mean The higher the variance s 2 of the prior, the more we believe the sample mean; MAP s2 Ñ8 ÝÑ MLE Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 21

Interpreting the Result µ MAP Ns2 Ns 2 ` σ 2 µ σ 2 ML ` Ns 2 ` σ 2 m Linear interpolation between the prior mean and the sample mean (ML estimate), weighted by their respective covariances The more data we have seen (N), the more we believe the sample mean The higher the variance s 2 of the prior, the more we believe the sample mean; MAP s2 Ñ8 ÝÑ MLE The higher the variance σ 2 of the data, the less we believe the sample mean Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 21

Example 0.20 0.15 True Gaussian Mean Prior True mean MLE mean MAP mean 0.10 0.05 0.00 10 5 0 5 10 15 20 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 22

Bayesian Inference (Marginalization) An even better idea than MAP estimation: Instead of estimating a parameter, integrate it out (according to the posterior) when making predictions ż ppx Dq ppx θqppθ Dqdθ where ppθ Dq is the parameter posterior Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 23

Bayesian Inference (Marginalization) An even better idea than MAP estimation: Instead of estimating a parameter, integrate it out (according to the posterior) when making predictions ż ppx Dq ppx θqppθ Dqdθ where ppθ Dq is the parameter posterior This integral is often tricky to solve Choose appropriate distributions (e.g., conjugate distributions) or solve approximately (e.g., sampling or variational inference) Works well (even in the small-data regime) and is robust to overfitting Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 23

Example: Linear Regression t 1 0 1 0 x 1 Adapted from PRML (Bishop, 2006) Blue: data Black: True function (unknown) Red: Posterior mean (MAP estimate) Red-shaded: 95% confidence area of the prediction Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 24

Example: Linear Regression t 1 0 1 0 x 1 Adapted from PRML (Bishop, 2006) Blue: data Black: True function (unknown) Red: Posterior mean (MAP estimate) Red-shaded: 95% confidence area of the prediction Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 24

Example: Linear Regression t 1 0 1 0 x 1 Adapted from PRML (Bishop, 2006) Blue: data Black: True function (unknown) Red: Posterior mean (MAP estimate) Red-shaded: 95% confidence area of the prediction Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 24

Example: Linear Regression t 1 0 1 0 x 1 Adapted from PRML (Bishop, 2006) Blue: data Black: True function (unknown) Red: Posterior mean (MAP estimate) Red-shaded: 95% confidence area of the prediction Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 24

References I [1] C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer-Verlag, 2006. Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 25