Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Similar documents
STAT 730 Chapter 4: Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

CSC321 Lecture 18: Learning Probabilistic Models

Statistics: Learning models from data

Lecture 4: Probabilistic Learning

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

ECE531 Lecture 10b: Maximum Likelihood Estimation

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Density Estimation: ML, MAP, Bayesian estimation

Density Estimation. Seungjin Choi

IEOR 165 Lecture 13 Maximum Likelihood Estimation

Answer Key for STAT 200B HW No. 7

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Bayesian Decision Theory

Parameter Estimation

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

COM336: Neural Computing

Parametric Techniques Lecture 3

Lecture : Probabilistic Machine Learning

Parametric Techniques

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

COS513 LECTURE 8 STATISTICAL CONCEPTS

Introduction to Machine Learning. Lecture 2

Brief Review on Estimation Theory

Pattern Recognition. Parameter Estimation of Probability Density Functions

Classical and Bayesian inference

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Machine learning - HT Maximum Likelihood

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 3 1

Bayesian Decision and Bayesian Learning

The Gaussian distribution

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Gaussians Linear Regression Bias-Variance Tradeoff

2 Statistical Estimation: Basic Concepts

Minimum Message Length Analysis of the Behrens Fisher Problem

Probability and Estimation. Alan Moses

A732: Exercise #7 Maximum Likelihood

Machine Learning 4771

An Introduction to Expectation-Maximization

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

The Expectation-Maximization Algorithm

Rowan University Department of Electrical and Computer Engineering

Foundations of Statistical Inference

Cheng Soon Ong & Christian Walder. Canberra February June 2018

F & B Approaches to a simple model

MLE and GMM. Li Zhao, SJTU. Spring, Li Zhao MLE and GMM 1 / 22

Bayes Decision Theory

Bayesian Learning (II)

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Introduction to Machine Learning

Problem Selected Scores

Parameter Estimation

ELEG 5633 Detection and Estimation Minimum Variance Unbiased Estimators (MVUE)

STAT 830 Bayesian Estimation

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

Basic concepts in estimation

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

EIE6207: Maximum-Likelihood and Bayesian Estimation

Chapter 8.8.1: A factorization theorem

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Introduction to Machine Learning

Naïve Bayes Lecture 17

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

Econometrics I, Estimation

Quick Tour of Basic Probability Theory and Linear Algebra

Statistical Inference

Statistical learning. Chapter 20, Sections 1 4 1

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Notes on the Multivariate Normal and Related Topics

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

EE 302: Probabilistic Methods in Electrical Engineering

Hidden Markov Models and Gaussian Mixture Models

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Uncorrelatedness and Independence

Definition 1.1 (Parametric family of distributions) A parametric distribution is a set of distribution functions, each of which is determined by speci

Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

p(z)

Linear Regression and Discrimination

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

LDA, QDA, Naive Bayes

Some Probability and Statistics

1 Bayesian Linear Regression (BLR)

Lecture 18: Bayesian Inference

Statistics & Data Sciences: First Year Prelim Exam May 2018

Multivariate Bayesian Linear Regression MLAI Lecture 11

Lecture 9: PGM Learning

10. Composite Hypothesis Testing. ECE 830, Spring 2014

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Maximum Likelihood Estimation. only training data is available to design a classifier

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Transcription:

Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were generated. This model will usually belong to a family of models indexed by parameters of interest (i.e. each model from within this family will be uniquely identified by the numerical values of its parameters). We are interested in fitting a model from such a family to the data, i.e. finding values of parameters that define model describing data in the best way. Denote the T-dimensional data vector as x T = (x(1),x(2),...,x(t)) T and the vector of parameters as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator ˆΘ of the parameter vector Θ is a function which allows to estimate the parameters from the data. A numerical value of the estimator ˆΘ is called estimate of Θ. 1

Well known examples of estimators are the sample mean and variance of the random variable x: ˆµ = 1 T T x i i=1 T ˆσ 2 = 1 T 1 i=1 (x i ˆµ) 2 Classic estimation - the parameters are regarded as some unknown constants; Bayesian estimation - parameters treated as random variables with some a priori pdf p Θ (Θ). 2

Classical estimation Method of moments: Assume that there are T independent measurements x(1),...,x(t) from a common probability distribution p(x Θ) characterized by the parameters Θ = (Θ 1,...,Θ m ) T. We want to estimate the values of the parameters Θ i from the data. Recall, that the jth moment of a probability distribution is defined as α j = E[x j Θ] = x j p(x Θ)dx calculate the expressions of m moments α j as functions of parameters from their definitions; calculate sample moments from the data: d j = T 1 Ti=1 [x(i)] j 3

form m equations by equating the expressions for moments with their estimates α j (Θ) = d j ; solve the resulting system of m equations for the m unknown parameters Usually, first m moments are sufficient. The solutions to the equation α j (Θ) = d j are called moment estimators. Example: Let the data sample x(1),..., x(t) be drawn independently from a pdf p(x Θ) = 1 exp x Θ 1 Θ 2 Θ 2 with Θ 1 < x < and Θ 2 > 0. Calculate the first 2 moments α 1 = E[x Θ] = ( ) ( x exp x Θ ) 1 dx Θ 1 Θ 2 Θ 2 =... = Θ 1 +Θ 2

α 2 = E[x 2 Θ] = x 2 ( exp x Θ ) 1 dx Θ 1 Θ 2 Θ 2 =... = (Θ 1 +Θ 2 ) 2 +Θ 2 2 Estimating the sample moments, d 1,d 2 from the data we can form the system of 2 equations Θ 1 +Θ 2 = d 1 (Θ 1 +Θ 2 ) 2 +Θ 2 2 = d 2 which gives the moment estimates ˆΘ 1 = d 1 (d 2 d 2 1 )1/2 ˆΘ 2 = (d 2 d 2 1 )1/2 This method can also be used with the central moments µ j = E[(x α 1 ) j Θ] and their sample estimates s j = T 1 1 Ti=1 (x(i) d i ) j. The method of moments is simple; does not have statistically desired properties, hence is used if no better alternatives are available; 4

Least-Squares estimation: Consider a model y T = HΘ+v T where Θ is the vector of parameters, v T is the vector of measurement errors and H is an observation matrix. y T is an observation vector. The number of observations is greater than the number of parameters. The least squares criterion e LS = 1 2 v T 2 = 1 2 (y T HΘ) T (y T HΘ) allows to minimize the effect of the measurement errors. The solution minimising the e LS is ˆΘ LS = H + y T where H + = (H T H) 1 H T is a pseudoinverse of H. 5

Example: Many nonlinear curve-fitting problems can be turned into a linear regression by applying an appropriate transform of the data, e.g. fitting the exponential relationship to the data y(t) = e at+b can be achieved by applying the log transform first and performing linear regression log(y)(t) = at+b Example: More generally least squares can be applied also if the model is linear in parameters but nonlinear in observations y(t) = m i=1 a i φ i (t)+v(t) where the basis functions φ i are nonlinear in t. 6

Maximum likelihood (ML) estimation assumes that the unknown parameters are constant; no prior information available; good statistical properties; works well when there is a lot of data; ML solution ˆΘ chooses the parameters defining the model under which the data are most likely. The likelihood function p(x T Θ) = p(x(1),...,x(t) Θ) has the same form as the joint density of the measurements. Often, the log likelihood is used lnp(x T Θ). 7

The likelihood equation Θ lnp(x T Θ) = 0 enables to find the maximum of the likelihood function. Assuming that the measurements are independent, the likelihood factors out p(x T Θ) = T j=1 p(x j Θ) where p(x j Θ) is a conditional pdf of a measurement x j. In this case, the log likelihood consists of a sum of logs of conditional pdfs. The vector likelihood equation consists of m scalar equations Θ i lnp(x T Θ) = 0, i = 1,...,m Usually these are nonlinear and coupled - need numerical solution methods. 8

Example: Assume T independent observations of a r.v. x about which we assume its pdf to be N(µ,σ 2 ) with unknown µ and σ 2. We can use maximum likelihood method to determine the values of parameters that make the data most likely. The likelihood function p(x T µ,σ 2 ) = (2πσ 2 ) T/2 exp The log likelihood function 1 2σ 2 lnp(x T µ,σ 2 ) = T 2 ln(2πσ2 ) 1 2σ 2 and the likelihood equations µ p(x T µ,σ 2 ) = 0 σ 2p(x T µ,σ 2 ) = 0 T j=1 T j=1 (x(j) µ) 2 (x(j) µ) 2 9

produce 1 σ 2 T 2σ 2 + 1 2σ 4 T j=1 T j=1 (x(j) µ) = 0 (x(j) µ) 2 = 0 Thus the ML estimates of the unknown parameters are ˆµ ML = T 1 Tj=1 x(j) and ˆσ ML 2 = Tj=1 (x(j) ˆµ ML ) 2. 1 T Hence, the ML estimates of the Gaussian first 2 moments amount to sample mean and sample variance. Note: If, in the least squares estimation, the measurement errors are i.i.d. r.v. s with pdf N(0,σ 2 ), then the least squares estimation yields the maximum likelihood estimates. 10

Bayesian estimation Classical estimation methods assumed that the parameters were unknown deterministic constants. In Bayesian estimation parameters are random; a priori parameter pdf p Θ (Θ) assumed to be known; the posterior density p Θ x (Θ x T ) contains all the information about the parameters Θ Maximum a posteriori (MAP) estimator is a popular method of estimation of parameters in Bayesian framework 11

similar to the classical maximum likelihood; maximise the posterior pdf p Θ x (Θ x T ) of Θ given data x T ; the MAP estimator is the most probable value of the parameter as evidenced by data; From Bayes rule p Θ x (Θ x T ) = p x Θ (x T Θ)p Θ (Θ) p x (x T ) the denominator - the prior density of data, independent of Θ; hence, finding MAP estimator amounts to maximising the nominator - the joint pdf of parameters and data p Θ,x (Θ,x T ). 12

MAP estimator can be found from the log likelihood equation Θ lnp(θ,x T) = Θ lnp(x T Θ)+ Θ lnp(θ) = 0 in the above the first term in the sum is the same as in the maximum likelihood; the second term - prior information about the parameters Note: If the prior pdf of Θ is uniform for parameter values for which p(x T Θ) > 0 (i.e. second term above equals zero), then MAP and ML estimators are equal. Thus, MAP and ML coincide when there is no available prior information about the parameters. Example: Assume that x(1),...,x(t) are independent observations drawn from 13

N(µ x,σ 2 x). The µ x is itself a rv with pdf N(0,σ 2 µ). Assuming that σ 2 x, σµ2 are known, find estimator ˆµ MAP. Using log likelihood equation which yields ˆµ MAP = σ 2 µ σ 2 x +Tσ2 µ T j=1 x(j) if we have no a priori knowledge about µ: σ 2 µ ˆµ MAP 1 T T j=1 x(j) Similarly, T ˆµ MAP 1 T Tj=1 x(j) i.e. if the number of samples increases, the prior information about parameters becomes less important; MAP estimator combines the prior information and samples according to their reliability (i.e. σ 2 µ, σ2 x ). 14

Maximum Entropy Principle information on the pdf p x of a scalar r.v. x, e.g. having estimates of expectations of a number of functions of x, E[F i (x)] = c i, for i = 1,...,m may not be enough for unique (principled) choice of pdf. finite number of data points may not tell us exactly what the underlying probability distribution is. Maximum entropy principle - (regularisation) criterion stating that out of all probability distribution functions compatible with the constraints above one should choose one which has least structure. 15

least structure = maximum entropy (where entropy is extended from rv s to pdfs). entropy measures randomness - thus maximizing entropy means choosing pdf compatible with the constraints and making least assumptions about the data. under some conditions a maximum entropy pdf satisfying conditions above is p 0 (ξ) = Aexp( i a i F i (ξ)) where constants A, a i are determined from constraints above; 16

gaussian pdf has the largest entropy among all pdfs with zero mean and unit variance, p 0 (ξ) = Aexp(a 1 ξ 2 +a 2 ξ) in many dimensions - gaussian pdf has maximum entropy among all pdfs with the same covariance matrix. thus - gaussian is the least structured of all distributions and entropy can be used as a measure of nongaussianity. 17