Probability and Estimation. Alan Moses

Similar documents
Computational Perception. Bayesian Inference

Review of Probabilities and Basic Statistics

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

PMR Learning as Inference

Introduction to Probabilistic Machine Learning

Bayesian Inference and MCMC

Computational Cognitive Science

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

CSC321 Lecture 18: Learning Probabilistic Models

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

CS 361: Probability & Statistics

Linear Models A linear model is defined by the expression

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Machine Learning

Intro to Probability. Andrei Barbu

Naïve Bayes classification

COS513 LECTURE 8 STATISTICAL CONCEPTS

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Bayesian Models in Machine Learning

Bayesian RL Seminar. Chris Mansley September 9, 2008

Discrete Binary Distributions

COMP90051 Statistical Machine Learning

Time Series and Dynamic Models

Lecture 2: Priors and Conjugacy

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probabilistic Machine Learning

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Introduction to Machine Learning

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Probability Theory and Simulation Methods

Stat 5101 Lecture Notes

Introduc)on to Bayesian Methods

Introduction to Machine Learning

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Computational Cognitive Science

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Bayesian Methods: Naïve Bayes

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Patterns of Scalable Bayesian Inference Background (Session 1)

6.867 Machine Learning

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

6.867 Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

Density Estimation. Seungjin Choi

Quick Tour of Basic Probability Theory and Linear Algebra

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Introduction to Machine Learning

CS 361: Probability & Statistics

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Statistical Methods in Particle Physics

Basics on Probability. Jingrui He 09/11/2007

Bayesian Regression Linear and Logistic Regression

Machine Learning 4771

Lecture 2: Conjugate priors

The Bayes classifier

Machine Learning using Bayesian Approaches

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

10. Exchangeability and hierarchical models Objective. Recommended reading

Introduction to Machine Learning CMU-10701

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Lecture 11: Probability Distributions and Parameter Estimation

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Exponential Families

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

A primer on Bayesian statistics, with an application to mortality rate estimation

Point Estimation. Vibhav Gogate The University of Texas at Dallas

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Mathematical statistics

EIE6207: Maximum-Likelihood and Bayesian Estimation

Statistics: Learning models from data

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Introduction to Machine Learning

Introduction to Applied Bayesian Modeling. ICPSR Day 4

CS540 Machine learning L9 Bayesian statistics

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

ECE521 week 3: 23/26 January 2017

CS 630 Basic Probability and Information Theory. Tim Campbell

CPSC 340: Machine Learning and Data Mining

Data Analysis and Uncertainty Part 2: Estimation

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Conjugate Priors, Uninformative Priors

Introduction to Probability and Statistics (Continued)

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

STAT J535: Chapter 5: Classes of Bayesian Priors

Probability and Information Theory. Sargur N. Srihari

Lecture : Probabilistic Machine Learning

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Transcription:

Probability and Estimation Alan Moses

Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic. Describes events (e.g., coin toss is heads) In practice, all observed data can be thought of as an event (e.g., expression level of p53 is 10748.42) Probability theory is a mathematical way of understanding (and predicting) the stochastic variability

Why did probability theory originate?

laws of probability Say X is a random variable and P(X=A) is the probability of event A, (often written simply as P(X) ) Σ A P(X=A) = 1 Say Y is another random variable, P(X=A,Y=32.7) is the joint probability of events A and 32.7 (often written simply as P(X,Y)) Σ Σ Y X P(X,Y) = 1 If X and Y are independent, the joint distribution can be factored or Σ X P(X=A, Y=32.7) = P(X=A)P(Y=32.7) P(X) = 1 or P(X,Y) = P(X)P(Y)

More laws of probability More generally, if X 1 X N are a series of independent random variables, i=n P(X 1 X N ) = Π P(X i ) i=1 P(X=A Y=32.7) is the conditional probability of event A given that event 32.7 already happened P(X=A, Y=32.7) = P(X=A Y=32.7)P(Y=32.7) or P(X,Y) = P(X Y)P(Y)

Exercises: Proove Bayes theorem: P(X Y) = P(Y X) Solve: Σ X Σ Y P(X Y) =??? P(X Y) =??? P(X) P(Y) Show that the Poisson is correctly normalized, if P(X λ) is the Poisson pdf X= Σ X=0 P(X λ) = 1

Probabilistic models In biology, the Truth is usually unknown and very complicated We only will consider the data we have and our ability to make a model of it We don t think about whether our model is True or Correct, but only how well it fits the data We accept that a better model will always be possible Realistic (=complicated) models need more data that simple ones, so we usually try to choose a simple on

Probabilistic models Probability distributions are mathematical objects that include several functions. pdf (probability distribution function) cdf (cumulative distribution function) Probability generating function Moment generating function Distributions can be characterized by their moments (1 st and 2 nd moments are mean and variance ) What distribution describes my data? Continuous vs. discrete vs. ordinal distributions

Bernoulli, binomial, multinomial Major family of distributions for discrete events Bernoulli describes binary outcomes (heads/tails) in a single trial, based on a single parameter, say f P(X f) P(X=1) = f P(X=0) = 1 - f Binomial describes the number of positive outcomes in a series of N Bernoulli trials i=n say Y = Σ X i i=1 Y is another random variable, whose distribution is P(Y f,n) = ( N ) Y f Y (1 - f ) N - Y Let s derive the Binomial pdf

Moments 1 st moment is the mean or expectation E[X] = Σ P(X=A)A or A E[X] = 2 nd moment is the variance V[X] = P(X)(X E[X]) 2 = E[(X E[X]) 2 ] Σ X Σ P(X)X X What are the mean and variance of the Bernoulli?

Multivariate generalization What if there are more than 2 possibilities? Multinomial is the generalization of the binomial to the case of more than 2 possibilities. E.g., for DNA. The sequence ACGT might be written as: X 1 = (1, 0, 0, 0) X 2 = (0, 1, 0, 0) X 3 = (0, 0, 1, 0) X 4 = (0, 0, 0, 1) A distribution P(X f) now needs f = (f A, f C, f G, f T ) dimensions are not independent, this can be quantified by correlation or covariance

Gaussian/Normal Major distribution for continuous events Gaussian describes the probability of observing real numbers between - and, in terms of two parameters, µ, σ 1 P(X µ,σ) = 2πσ 2 P(X µ,σ) dx = 1 Because of the very special Gaussian integral e (X µ) 2 2σ 2 May be due to Laplace in 1782

Why the Gaussian? What s so normal about it? Why do so many measurements in the real world follow such an obscure mathematical formula? E.g., Binomial converges to a Gaussian as N becomes large N=20 f=0.5 Why are random errors usually Gaussian?

Gaussian distribution has very special moments 1 st moment is the mean or expectation E[X] = P(X µ,σ) X X dx = µ 2 nd moment is the variance V[X] = P(X µ,σ) X (X µ) 2 dx = σ 2 For this reason, the parameters of the Gaussian are named mean and standard deviation.

Moments of continuous distributions In general, the mean and variance are functions of the parameters of the distribution, but not necessarily simple ones E.g., Gumbel or Extreme Value Distribution (used for for BLAST statistics) has pdf P(X µ,σ) = 1 σ µ X e σ e µ X σ E[X] = µ + σγ, where γ = 0.577 (= Euler s constant) V[X] = π 2 σ 2 6

Multivariate Gaussian Now each observation is a vector, say X 1 = (1.3, 4,6) The mean is a vector, µ = (µ 1, µ 2 ) The variance is a matrix Σ = V 11 V 12 V 12 V 22 The diagonal elements are the variances in each dimension The off diagonal elements are the covariances which summarize the dependence between the dimensions

Multivariate Gaussian

Multivariate Gaussian

Parameter estimation Given some data, and some probabilistic model, how do I infer the parameters? The technical name for this is estimation. Several methods exist, each yielding estimators Least squares methods (NWLS, MMSE) Maximum likelihood (ML) Maximum a posteriori probability (MAP) How are estimators evaluated? consistency bias efficiency

Likelihood and MLEs Likelihood is the probability of the data (say X) given certain parameters (say θ) L = P(X θ) Maximum likelihood estimation says: choose θ, so that the data is most probable. L = 0 θ In practice there are many ways to maximize the likelihood.

Example of ML estimation Data: X i 5.2 9.1 8.2 7.3 7.8 P(X i µ=6.5, σ=1.5) 0.182737304 0.059227322 0.13996368 0.230761096 0.182737304 L = P(X θ) = P(X 1 X N θ) i=5 = Π P(X i µ=6.5, σ=1.5) = 6.39 x 10-5 i=1 L Mean, µ

Example of ML estimation In practice, we almost always use the log likelihood, which becomes a very large negative number when there is a lot of data Mean, µ Log(L)

Log(L) Example of ML estimation

ML Estimation In general, the likelihood is a function of multiple variables, so the derivatives with respect to all of these should be zero at a maximum In the example of the Gaussian, we have two parameters, so that L = 0 µ and L = 0 σ In general, finding MLEs means solving a set of coupled equations, which usually have to be solved numerically for complex models.

MLEs for the Gaussian 1 1 µ ML = Σ X V ML = Σ (X - µ ML ) 2 N X The Gaussian is the symmetric continuous distribution that has as its centre a parameter given by what we consider the average. The MLE for the for variance of the Gaussian is like the squared error from the mean, but is actually a biased (but still consistent!?) estimator N X

Let s derive the MLEs for the Binomial distribution

Properties of MLEs MLEs are asymptotically normal The mean of the MLE is the parameter you are trying to estimate, i.e., E[θ ML ]=θ The variance of the MLE is given by: V[θ ML ] = 2 E[- logl θ ML ] -1 2 θ Often written as V[θ ML ] = I -1, where the E[ ] is called the Fisher Information

Likelihood and MLEs In general, the likelihood function might be too complicated to calculate the MLEs analytically MLEs can still be obtained by maximizing the likelihood using numerical methods. Greedy: Gradient Ascent, Newton s Method Sampling methods: Gibbs, Metropolis The likelihood could have multiple maxima that make optimization difficult The more parameters the likelihood has, the more difficult the optimization problem

MAP estimation We can then write the posterior probability of the model, given some data. We can compute the posterior using Bayes theorem Where posterior = P(θ X) posterior = P(X θ) P(X) = Σ θ MAP estimate says: maximize the posterior P(θ) is a prior distribution P(θ) P(X) P(X θ)p(θ) = L Don t usually need to integrate over all the parameters weighting the likelihood by prior beliefs P(θ) P(X)

Bayesian methods Do we really need estimators for all parameters? Nuisance parameters Instead, use the whole posterior distribution Use its moments, quantiles, etc. Integrate the function (e.g., squared error) that you care about over the posterior directly E.g., mean of the posterior distribution: E[θ ] = Σ P(θ X)θ 1 1 θ Now we do need to integrate over the prior distribution

Conjugate priors special mathematically convenient prior distributions that make integrations possible/easier Priors are distributions on the parameters. Often use exotic distributions rarely needed for real data The parameters of these distributions are called hyperparameters and also have to be estimated or integrated away E.g., for a Bernouli, the parameter is a continuous value between 0 and 1 (the prior is a Beta) E.g., for a Gaussian, the prior on the mean is a Gaussian, but the prior on the standard deviation (which is always >0) is exotic (inverse Gamma?)