Lecture 11: Probability Distributions and Parameter Estimation

Intelligent Data Analysis and Probabilistic Inference Lecture 11: Probability Distributions and Parameter Estimation Recommended reading: Bishop: Chapters 1.2, 2.1 2.3.4, Appendix B Duncan Gillies and Marc Deisenroth Department of Computing Imperial College London February 10, 2016

Key Concepts in Probability Theory Two fundamental rules: ż ppxq ppx, yqdy ppx, yq ppy xqppxq Sum rule/marginalization property Product rule Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 2

Key Concepts in Probability Theory Two fundamental rules: ż ppxq ppx, yqdy ppx, yq ppy xqppxq Sum rule/marginalization property Product rule Bayes Theorem (Probabilistic Inverse) ppy xq ppxq ppx yq, x : hypothesis, y : measurement ppyq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 2

Key Concepts in Probability Theory Two fundamental rules: ż ppxq ppx, yqdy ppx, yq ppy xqppxq Sum rule/marginalization property Product rule Bayes Theorem (Probabilistic Inverse) ppy xq ppxq ppx yq, x : hypothesis, y : measurement ppyq Posterior belief Prior belief x Likelihood (measurement model) y Marginal likelihood (normalization constant) Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 2

Mean and (Co)Variance Mean and covariance are often useful to describe properties of probability distributions (expected values and spread). Definition ż E x rxs xppxqdx : µ V x rxs E x rpx µqpx µq J s E x rxx J s E x rxse x rxs J : Σ Covrx, ys E x,y rxy J s E x rxse y rys J Linear/Affine Transformations: y Ax ` b, Erys Vrys where E x rxs µ, V x rxs Σ Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 3

Mean and (Co)Variance Mean and covariance are often useful to describe properties of probability distributions (expected values and spread). Definition ż E x rxs xppxqdx : µ V x rxs E x rpx µqpx µq J s E x rxx J s E x rxse x rxs J : Σ Covrx, ys E x,y rxy J s E x rxse y rys J Linear/Affine Transformations: y Ax ` b, where E x rxs µ, V x rxs Σ Erys E x rax ` bs AE x rxs ` b Aµ ` b Vrys Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 3

Mean and (Co)Variance Mean and covariance are often useful to describe properties of probability distributions (expected values and spread). Definition ż E x rxs xppxqdx : µ V x rxs E x rpx µqpx µq J s E x rxx J s E x rxse x rxs J : Σ Covrx, ys E x,y rxy J s E x rxse y rys J Linear/Affine Transformations: y Ax ` b, where E x rxs µ, V x rxs Σ Erys E x rax ` bs AE x rxs ` b Aµ ` b Vrys V x rax ` bs V x raxs AV x rxsa J AΣA J Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 3

Mean and (Co)Variance Mean and covariance are often useful to describe properties of probability distributions (expected values and spread). Definition ż E x rxs xppxqdx : µ V x rxs E x rpx µqpx µq J s E x rxx J s E x rxse x rxs J : Σ Covrx, ys E x,y rxy J s E x rxse y rys J Linear/Affine Transformations: y Ax ` b, where E x rxs µ, V x rxs Σ Erys E x rax ` bs AE x rxs ` b Aµ ` b Vrys V x rax ` bs V x raxs AV x rxsa J AΣA J If x, y independent: V x,y rx ` ys V x rxs ` V y rys Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 3

Basic Probability Distributions Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 4

The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 p(x) Mean 95% confidence bound 0.25 0.2 p(x) 0.15 0.1 0.05 0 4 3 2 1 0 1 2 3 4 5 6 x Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 5

The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 p(x) Mean 95% confidence bound 3 2 Mean 95% confidence bound 0.25 1 p(x) 0.2 0.15 x 2 0 1 0.1 2 0.05 3 0 4 3 2 1 0 1 2 3 4 5 6 x 5 4 3 2 1 0 1 2 3 x 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 5

The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 Data p(x) Mean 95% confidence interval 3 2 Data Mean 95% confidence bound 0.25 1 p(x) 0.2 0.15 x 2 0 1 0.1 2 0.05 3 0 4 3 2 1 0 1 2 3 4 5 6 x 5 4 3 2 1 0 1 2 3 x 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 5

Conditional 3 2 «ff µx Joint p(x,y) ppx, yq N, µ y «ff Σxx Σ xy Σ yx Σ yy 1 0 y -1-2 -3-4 -5-6 -4-2 0 2 4 x Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 6

Conditional 3 2 1 0 Joint p(x,y) Observation ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy y -1-2 -3-4 -5-6 -4-2 0 2 4 x Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 6

Conditional 3 2 Joint p(x,y) Observation y Conditional p(x y) ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy y 1 0-1 -2-3 -4 ppx yq N `µ x y, Σ x y µ x y µ x ` Σ xy Σ 1 yy py µ y q Σ x y Σ xx Σ xy Σ 1 yy Σ yx -5-6 -4-2 0 2 4 x Conditional ppx yq is also Gaussian Computationally convenient Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 6

Marginal y 3 2 1 0-1 -2-3 -4-5 Joint p(x,y) Marginal p(x) -6-4 -2 0 2 4 x ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy Marginal distribution: ż pp x q pp x, y qd y N ` µ x, Σ xx Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 7

Marginal y 3 2 1 0-1 -2-3 -4-5 Joint p(x,y) Marginal p(x) -6-4 -2 0 2 4 x ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Marginal distribution: ż pp x q pp x, y qd y N ` µ x, Σ xx The marginal of a joint Gaussian distribution is Gaussian Intuitively: Ignore (integrate out) everything you are not interested in Σ yy Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 7

Bernoulli Distribution Distribution for a single binary variable x P t0, 1u Governed by a single continuous parameter µ P r0, 1s that represents the probability of x P t0, 1u. ppx µq µ x p1 µq 1 x Erxs µ Vrxs µp1 µq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 8

Bernoulli Distribution Distribution for a single binary variable x P t0, 1u Governed by a single continuous parameter µ P r0, 1s that represents the probability of x P t0, 1u. ppx µq µ x p1 µq 1 x Erxs µ Vrxs µp1 µq Example: Result of flipping a coin. Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 8

Beta Distribution ppµ α, βq Erµs Distribution over a continuous variable µ P r0, 1s, which is often used to represent the probability for some binary event (see Bernoulli distribution) Governed by two parameters α ą 0, β ą 0 Γpα ` βq µα 1 p1 µq β 1 ΓpαqΓpβq α α ` β, Vrµs αβ pα ` βq 2 pα ` β ` 1q Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 9

Binomial Distribution 0.25 Generalization of the Bernoulli 0.20 0.15 distribution to a distribution over integers 0.10 0.05 0.00 0 2 4 6 8 10 12 14 ppm N, µq Probability of observing m occurrences of x 1 in a set of N samples from a Bernoulli distribution, where ppx 1q µ P r0, 1s ˆN µ m p1 µq N m m Erms Nµ, Vrms Nµp1 µq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 10

Binomial Distribution 0.25 Generalization of the Bernoulli 0.20 0.15 distribution to a distribution over integers 0.10 0.05 0.00 0 2 4 6 8 10 12 14 ppm N, µq Probability of observing m occurrences of x 1 in a set of N samples from a Bernoulli distribution, where ppx 1q µ P r0, 1s ˆN µ m p1 µq N m m Erms Nµ, Vrms Nµp1 µq Example: What is the probability of observing m heads in N experiments if the probability for observing head in a single experiment is µ? Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 10

Gamma Distribution ppτ a, bq 1 Γpaq ba τ a 1 expp bτq Erτs a b Vrτs a b 2 Distribution over positive real numbers τ ą 0 Governed by parameters a ą 0 (shape), b ą 0 (scale) Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 11

Conjugate Priors Posterior 9 prior ˆ likelihood Specification or the prior can be tricky Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 12

Conjugate Priors Posterior 9 prior ˆ likelihood Specification or the prior can be tricky Some priors are (computationally) convenient If the posterior and the prior are of the same type (e.g., Beta), the prior is called conjugate Likelihood is also involved... Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 12

Conjugate Priors Posterior 9 prior ˆ likelihood Specification or the prior can be tricky Some priors are (computationally) convenient If the posterior and the prior are of the same type (e.g., Beta), the prior is called conjugate Likelihood is also involved... Examples: Conjugate prior Likelihood Posterior Beta Bernoulli Beta Gaussian-iGamma Gaussian Gaussian-iGamma Dirichlet Multinomial Dirichlet Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 12

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. We place a Beta-prior on the parameter µ: Betapµ α, βq Γpα ` βq ΓpαqΓpβq µα 1 p1 µq β 1 9 µ α 1 p1 µq β 1 If we now observe some outcomes x px 1,..., x n q of a repeated coin-flip experiment with h heads and t tails, we compute the posterior distribution on µ : Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. We place a Beta-prior on the parameter µ: Betapµ α, βq Γpα ` βq ΓpαqΓpβq µα 1 p1 µq β 1 9 µ α 1 p1 µq β 1 If we now observe some outcomes x px 1,..., x n q of a repeated coin-flip experiment with h heads and t tails, we compute the posterior distribution on µ : ppµ x hq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. We place a Beta-prior on the parameter µ: Betapµ α, βq Γpα ` βq ΓpαqΓpβq µα 1 p1 µq β 1 9 µ α 1 p1 µq β 1 If we now observe some outcomes x px 1,..., x n q of a repeated coin-flip experiment with h heads and t tails, we compute the posterior distribution on µ : ppµ x hq 9 ppx µqppµ α, βq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. We place a Beta-prior on the parameter µ: Betapµ α, βq Γpα ` βq ΓpαqΓpβq µα 1 p1 µq β 1 9 µ α 1 p1 µq β 1 If we now observe some outcomes x px 1,..., x n q of a repeated coin-flip experiment with h heads and t tails, we compute the posterior distribution on µ : ppµ x hq 9 ppx µqppµ α, βq µ h p1 µq t µ α 1 p1 µq β 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Example Consider a Binomial random variable x Binpm N, µq where ˆN ppx µq µ m p1 µq N m 9 µ a p1 µq b m for some constants a, b. We place a Beta-prior on the parameter µ: Betapµ α, βq Γpα ` βq ΓpαqΓpβq µα 1 p1 µq β 1 9 µ α 1 p1 µq β 1 If we now observe some outcomes x px 1,..., x n q of a repeated coin-flip experiment with h heads and t tails, we compute the posterior distribution on µ : ppµ x hq 9 ppx µqppµ α, βq µ h p1 µq t µ α 1 p1 µq β 1 µ h`α 1 p1 µq t`β 1 9 Betaph ` α, t ` βq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 13

Parameter Estimation Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 14

Parameter Estimation 4 3 2 1 0 t 1 0 1 2 1 3 4 2 0 2 4 6 0 x 1 Given a data set we want to obtain good estimates of the parameters of the model that may have generated the data Parameter estimation problem Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 15

Parameter Estimation 4 3 2 1 0 t 1 0 1 2 1 3 4 2 0 2 4 6 0 x 1 Given a data set we want to obtain good estimates of the parameters of the model that may have generated the data Parameter estimation problem Example: x 1,..., x N P R D are i.i.d. samples from a Gaussian Find the mean and covariance of ppxq Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 15

Maximum Likelihood Parameter Estimation (1) Maximum likelihood estimation finds a point estimate of the parameters that maximizes the likelihood of the parameters Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 16

Maximum Likelihood Parameter Estimation (1) Maximum likelihood estimation finds a point estimate of the parameters that maximizes the likelihood of the parameters In our Gaussian example, we seek µ, Σ: max ppx 1,..., x n µ, Σq i.i.d. max Nź ppx i µ, Σq i 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 16

Maximum Likelihood Parameter Estimation (1) Maximum likelihood estimation finds a point estimate of the parameters that maximizes the likelihood of the parameters In our Gaussian example, we seek µ, Σ: max ppx 1,..., x n µ, Σq i.i.d. max max Nÿ log ppx i µ, Σq i 1 Nź ppx i µ, Σq i 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 16

Maximum Likelihood Parameter Estimation (1) Maximum likelihood estimation finds a point estimate of the parameters that maximizes the likelihood of the parameters In our Gaussian example, we seek µ, Σ: max ppx 1,..., x n µ, Σq i.i.d. max max Nÿ log ppx i µ, Σq i 1 Nź ppx i µ, Σq i 1 max ND 2 logp2πq N 1 log Σ 2 2 Nÿ px i µq J Σ 1 px i µq i 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 16

Maximum Likelihood Parameter Estimation (2) µ ML arg max ND µ 2 logp2πq N log Σ 1 2 2 1 N Nÿ i 1 x i Nÿ px i µq J Σ 1 px i µq i 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 17

Maximum Likelihood Parameter Estimation (2) µ ML arg max ND µ 2 logp2πq N log Σ 1 2 2 Σ ML 1 N Nÿ i 1 x i arg max ND Σ 2 logp2πq N 1 log Σ 2 2 1 N Nÿ px i µ ML qpx i µ ML q J i 1 Nÿ px i µq J Σ 1 px i µq i 1 Nÿ px i µq J Σ 1 px i µq i 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 17

Maximum Likelihood Parameter Estimation (2) µ ML arg max ND µ 2 logp2πq N log Σ 1 2 2 Σ ML 1 N Nÿ i 1 x i arg max ND Σ 2 logp2πq N 1 log Σ 2 2 1 N Nÿ px i µ ML qpx i µ ML q J i 1 Nÿ px i µq J Σ 1 px i µq i 1 Nÿ px i µq J Σ 1 px i µq ML estimate Σ ML is biased, but we can get an unbiased estimate as Σ 1 Nÿ px i µ N 1 ML qpx i µ ML q J i 1 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 17 i 1

MLE: Properties Asymptotic consistency: The MLE converges to the true value in the limit of infinitely many observations, plus a random error that is approximately normal The size of the sample necessary to achieve these properties can be quite large The error s variance decays in 1{N where N is the number of data points Especially, in the small data regime, MLE can lead to overfitting Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 18

Example: MLE in the Small-Data Regime 0.14 0.12 True Gaussian True mean MLE mean 0.10 0.08 0.06 0.04 0.02 0.00 10 5 0 5 10 15 20 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 19

Maximum A Posteriori Estimation Instead of maximizing the likelihood, we can seek parameters that maximize the posterior distribution of the parameters θ arg max θ ppθ xq arg max log ppθq ` log ppx θq θ MLE with an additional regularizer that comes from the prior MAP estimator Example: Estimate the mean µ of a 1D Gaussian with known variance σ 2 after having observed N data points x i. Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 20

Maximum A Posteriori Estimation Instead of maximizing the likelihood, we can seek parameters that maximize the posterior distribution of the parameters θ arg max θ ppθ xq arg max log ppθq ` log ppx θq θ MLE with an additional regularizer that comes from the prior MAP estimator Example: Estimate the mean µ of a 1D Gaussian with known variance σ 2 after having observed N data points x i. Gaussian prior ppµq N `µ m, s 2 on mean yields µ MAP Ns2 Ns 2 ` σ 2 µ σ 2 ML ` Ns 2 ` σ 2 m Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 20

Interpreting the Result µ MAP Ns2 Ns 2 ` σ 2 µ σ 2 ML ` Ns 2 ` σ 2 m Linear interpolation between the prior mean and the sample mean (ML estimate), weighted by their respective covariances The more data we have seen (N), the more we believe the sample mean Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 21

Interpreting the Result µ MAP Ns2 Ns 2 ` σ 2 µ σ 2 ML ` Ns 2 ` σ 2 m Linear interpolation between the prior mean and the sample mean (ML estimate), weighted by their respective covariances The more data we have seen (N), the more we believe the sample mean The higher the variance s 2 of the prior, the more we believe the sample mean; MAP s2 Ñ8 ÝÑ MLE Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 21

Interpreting the Result µ MAP Ns2 Ns 2 ` σ 2 µ σ 2 ML ` Ns 2 ` σ 2 m Linear interpolation between the prior mean and the sample mean (ML estimate), weighted by their respective covariances The more data we have seen (N), the more we believe the sample mean The higher the variance s 2 of the prior, the more we believe the sample mean; MAP s2 Ñ8 ÝÑ MLE The higher the variance σ 2 of the data, the less we believe the sample mean Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 21

Example 0.20 0.15 True Gaussian Mean Prior True mean MLE mean MAP mean 0.10 0.05 0.00 10 5 0 5 10 15 20 Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 22

Bayesian Inference (Marginalization) An even better idea than MAP estimation: Instead of estimating a parameter, integrate it out (according to the posterior) when making predictions ż ppx Dq ppx θqppθ Dqdθ where ppθ Dq is the parameter posterior This integral is often tricky to solve Choose appropriate distributions (e.g., conjugate distributions) or solve approximately (e.g., sampling or variational inference) Works well (even in the small-data regime) and is robust to overfitting Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 23

Example: Linear Regression t 1 0 1 0 x 1 Adapted from PRML (Bishop, 2006) Blue: data Black: True function (unknown) Red: Posterior mean (MAP estimate) Red-shaded: 95% confidence area of the prediction Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 24

References I [1] C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer-Verlag, 2006. Probability Distributions and Parameter Estimation IDAPI, Lecture 11 February 10, 2016 25