Randomness and uncertainty play an important

Similar documents
Lecture 3 Probability review (cont d)

Functions of Random Variables

Lecture 3. Sampling, sampling distributions, and parameter estimation

CHAPTER VI Statistical Analysis of Experimental Data

Econometric Methods. Review of Estimation

STK4011 and STK9011 Autumn 2016

Chapter 5 Properties of a Random Sample

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Point Estimation: definition of estimators

Summary of the lecture in Biostatistics

Lecture Notes Types of economic variables

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Chapter 4 Multiple Random Variables

Bayes (Naïve or not) Classifiers: Generative Approach

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

TESTS BASED ON MAXIMUM LIKELIHOOD

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Chapter 14 Logistic Regression Models

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Special Instructions / Useful Data

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Simple Linear Regression

Module 7. Lecture 7: Statistical parameter estimation

Simulation Output Analysis

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Qualifying Exam Statistical Theory Problem Solutions August 2005

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

BAYESIAN INFERENCES FOR TWO PARAMETER WEIBULL DISTRIBUTION

ρ < 1 be five real numbers. The

Chapter 4 Multiple Random Variables

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

Bayesian Inferences for Two Parameter Weibull Distribution Kipkoech W. Cheruiyot 1, Abel Ouko 2, Emily Kirimi 3

Multivariate Transformation of Variables and Maximum Likelihood Estimation

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

Parameter, Statistic and Random Samples

Lecture Note to Rice Chapter 8

ESS Line Fitting

Unsupervised Learning and Other Neural Networks

X ε ) = 0, or equivalently, lim

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

STATISTICAL INFERENCE

Multiple Linear Regression Analysis

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

6.867 Machine Learning

ENGI 3423 Simple Linear Regression Page 12-01

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Objectives of Multiple Regression

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Generative classification models

Class 13,14 June 17, 19, 2015

Lecture 1 Review of Fundamental Statistical Concepts

22 Nonparametric Methods.

Introduction to local (nonparametric) density estimation. methods

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

Mu Sequences/Series Solutions National Convention 2014

PROPERTIES OF GOOD ESTIMATORS

Descriptive Statistics

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

CODING & MODULATION Prof. Ing. Anton Čižmár, PhD.

BASICS ON DISTRIBUTIONS

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

3. Basic Concepts: Consequences and Properties

Lecture 8: Linear Regression

CHAPTER 3 POSTERIOR DISTRIBUTIONS

Point Estimation: definition of estimators

Lecture 9: Tolerant Testing

Chapter 8. Inferences about More Than Two Population Central Values

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

MEASURES OF DISPERSION

A New Family of Transformations for Lifetime Data

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

Lecture Notes 2. The ability to manipulate matrices is critical in economics.

Analysis of Variance with Weibull Data

1. A real number x is represented approximately by , and we are told that the relative error is 0.1 %. What is x? Note: There are two answers.

Chapter 9 Jordan Block Matrices

The number of observed cases The number of parameters. ith case of the dichotomous dependent variable. the ith case of the jth parameter

A Study of the Reproducibility of Measurements with HUR Leg Extension/Curl Research Line

arxiv: v1 [math.st] 24 Oct 2016

Lecture Notes Forecasting the process of estimating or predicting unknown situations

Dr. Shalabh. Indian Institute of Technology Kanpur

The Mathematical Appendix

Simple Linear Regression


Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

A NEW LOG-NORMAL DISTRIBUTION

LINEAR REGRESSION ANALYSIS

The equation is sometimes presented in form Y = a + b x. This is reasonable, but it s not the notation we use.

Chapter 3 Sampling For Proportions and Percentages

Module 7: Probability and Statistics

Introduction to Probability

STA302/1001-Fall 2008 Midterm Test October 21, 2008

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

ECON 5360 Class Notes GMM

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

Lecture 02: Bounding tail distributions of a random variable

Transcription:

CHAPTER 3 Probablty, Statstcs, ad Iformato Theory Radomess ad ucertaty play a mportat role scece ad egeerg. Most spoke laguage processg problems ca be characterzed a probablstc framework. Probablty theory ad statstcs provde the mathematcal laguage to descrbe ad aalyze such systems. The crtera ad methods used to estmate the ukow probabltes ad probablty destes form the bass for estmato theory. Estmato theory forms the bascs for parameter learg patter recogto. I ths chapter, three wdely used estmato methods are dscussed. They are mmum mea squared error estmato (MMSE), mamum lkelhood estmato (MLE), ad mamum posteror probablty estmato (MAP). Sgfcace testg s also mportat statstcs, whch deals wth the cofdece of statstcal ferece, such as kowg whether the estmato of some parameter ca be accepted wth cofdece. I patter recogto, sgfcace testg s etremely mportat for determg whether the observed dfferece betwee two dfferet classfers s real. I our coverage of sgfcace testg, we descrbe varous methods that are used patter recogto dscussed. Chapter 4. 73

74 Probablty, Statstcs, ad Iformato Theory Iformato theory was orgally developed for effcet ad relable commucato systems. It has evolved to a mathematcal theory cocered wth the very essece of the commucato process. It provdes a framework for the study of fudametal ssues, such as the effcecy of formato represetato ad the lmtatos relable trasmsso of formato over a commucato chael. May of these problems are fudametal to spoke laguage processg. 3.. PROBABILITY THEORY Probablty theory deals wth the averages of mass pheomea occurrg sequetally or smultaeously. We ofte use probablstc epressos our day-to-day lves, such as whe sayg, It s very lkely that the Dow (Dow Joes Idustral de) wll ht, pots et moth, or,the chace of scattered showers Seattle ths weeked s hgh. Each of these epressos s based upo the cocept of the probablty, or the lkelhood, whch some specfc evet wll occur. Probablty ca be used to represet the degree of cofdece the outcome of some actos (observatos), whch are ot defte. I probablty theory, the term sample space, S, s used to refer to the collecto (set) of all possble outcomes. A evet refers to a subset of the sample space or a collecto of outcomes. The probablty of evet A deoted as PA, ( ) ca be terpreted as the relatve frequecy wth whch the evet A would occur f the process were repeated a large umber of tmes uder smlar codtos. Based o ths terpretato, PA ( ) ca be computed smply by coutg the total umber, N S, of all observatos ad the umber of observatos N A whose outcome belogs to the evet A. Thats, N A PA ( ) = (3.) N S PA ( ) s bouded betwee zero ad oe,.e., PA ( ) foralla (3.) The lower boud of probablty PA ( ) s zero whe the evet set A s a empty set. O the other had, the upper boud of probablty PA ( ) s oe whe the evet set A happes to be S. If there are evets A, A, A S such that A, A, A are dsjot ad 7 A = S, = evets A, A, A are sad to form a partto of S. The followg obvous equato forms a fudametal aom for probablty theory. PA ( A A) = PA ( ) = (3.3) =

Probablty Theory 75 Based o the defto Eq. (3.), the jot probablty of evet A ad evet B occurrg cocurretly s deoted as PAB ( ) ad ca be calculated as: N AB PAB ( ) = (3.4) N S 3... Codtoal Probablty Ad Bayes' Rule It s useful to study the way whch the probablty of a evet A chages after t has bee leared that some other evet B has occurred. Ths ew probablty deoted as PAB ( ) s called the codtoal probablty of evet A gve that evet B has occurred. Sce the set of those outcomes B that also result the occurrece of A s eactly the set AB as llustrated Fgure 3., t s atural to defe the codtoal probablty as the proporto of the total probablty PB ( ) that s represeted by the jot probablty PAB.Thsleadstothefollowg defto: ( ) PAB ( ) NAB NS PAB ( ) = = (3.5) PB ( ) N N S B S A B AB Fgure 3. The tersecto AB represets where the jot evet A ad B occurs cocurretly. Based o the defto of codtoal probablty, the followg epressos ca be easly derved. PAB ( ) = PA ( BPB ) ( ) = PB ( APA ) ( ) (3.6) Equato (3.6) s the smple verso of the cha rule. The cha rule, whch ca specfy a jot probablty terms of multplcato of several cascaded codtoal probabltes, s ofte used to decompose a complcated jot probablstc problem to a sequece of stepwse codtoal probablstc problems. Eq. (3.6) ca be coverted to such a geeral cha: PAA ( A) = PA ( AA ) PA ( A) PA ( ) (3.7)

76 Probablty, Statstcs, ad Iformato Theory Whe two evets, A ad B, are depedet of each other, the sese that the occurrece or of ether of them has o relato to ad o fluece o the occurrece of the other, t s obvous that the codtoal probablty PB ( A) equals to the ucodtoal probablty PB. ( ) It follows that the jot probablty PAB ( ) s smply the product of PA ( ) ad PB ( ) f A ad B, are depedet. If the evets A, A, A form a partto of S ad B s ay evet S as llustrated Fgure 3., the evets AB, AB, A B form a partto of B. Thus, we ca rewrte: B = AB A B A B (3.8) Sce AB, AB, A B are dsjot, PB ( ) = PAB ( ) (3.9) k = k A 3 A B A 4 A A 5 Fgure 3. The tersectos of B wth partto evets A, A, A. Equato (3.9) s called the margal probablty of evet B, where the probablty of evet B s computed from the sum of jot probabltes. Accordg to the cha rule, Eq. (3.6), PAB ( ) = PA ( ) PB ( A), t follows that k = PB ( ) = PA ( ) PB ( A) (3.) k k Combg Eqs. (3.5) ad (3.), we get the well-kow Bayes' rule: PAB ( ) PB ( A) PA ( ) PA ( B) = = PB ( ) PB ( A) PA ( ) k = k Bayes' rule s the bass for patter recogto that s descrbed Chapter 4. k (3.)

Probablty Theory 77 3... Radom Varables Elemets a sample space may be umbered ad referred to by the umbers gve. A varable X that specfes the umercal quatty a sample space s called a radom varable. Therefore, a radom varable X s a fucto that maps each possble outcome s the sample space S oto real umbers X( s ). Sce each evet s a subset of the sample space, a evet s represeted as a set of { s} whch satsfes { s X( s) = }. We use captal letters to deote radom varables ad lower-case letters to deote fed values of the radom varable. Thus, the probablty that X = s deoted as: PX ( = ) = Ps ( Xs ( ) = ) (3.) A radom varable X s a dscrete radom varable, or X has a dscrete dstrbuto, f X ca take oly a fte umber of dfferet values,,,, or at most, a fte sequece of dfferet values,,. If the radom varable X s a dscrete radom varable, the probablty fucto (p.f.) or probablty mass fucto (p.m.f.) of X sdefedtobethe fucto p such that for ay real umber, p ( ) = P( X = ) (3.3) X For the cases whch there s o cofuso, we drop the subscrpto X for px ( ).The sum of probablty mass over all values of the radom varable s equal to uty. p ( ) = PX ( = ) = (3.4) k= k= The margal probablty, cha rule ad Bayes' rule ca also be rewrtte wth respect to radom varables. m p ( ) = PX ( = ) = PX ( = Y, = y) = PX ( = Y= y) PY ( = y) X k k k k= k= m (3.5) PX ( =,, X = ) = PX ( = X =,, X = ) PX ( = X = ) PX ( = ) (3.6) PX ( = Y, = y) PY ( = y X= ) PX ( = ) PX ( = Y= y) = = PY ( = y) PY ( = y X= ) P( X= ) k = k k (3.7) I a smlar maer, f the radom varables X ad Y are statstcally depedet, they ca be represeted as: PX ( = Y, = y) = PX ( = ) PY ( = y)= p ( ) p( y) alladj (3.8) j j X Y j

78 Probablty, Statstcs, ad Iformato Theory A radom varable X s a cotuous radom varable, or X has a cotuous dstrbuto, f there ests a oegatve fucto f, defed o the real le, such that for a terval A, PX ( A) = f ( d ) (3.9) A X The fucto f X s called the probablty desty fucto (abbrevated p.d.f.) of X. Wedrop the subscrpt X for fx f there s o ambguty. As llustrated Fgure 3.3, the area of shaded rego s equal to the value of Pa ( X b) f () Fgure 3.3 A eample of p.d.f. The area of the shaded rego s equal to the value of Pa ( X b). Every p.d.f must satsfy the followg two requremets. f( ) for - ad f( ) d = (3.) The margal probablty, cha rule, ad Bayes' rule ca also be rewrtte wth respect to cotuous radom varables: (3.) X = X, Y = X Y Y f ( ) f (, y) dy f ( y) f ( y) dy a b f (,, ) = f (,, ) f ( ) f ( ) (3.) X,, X X X,, X X X X f XY ( y) f (, y) f ( y ) f ( ) XY, YX X = = (3.3) fy ( y) fyx ( y ) fx( ) d

Probablty Theory 79 The dstrbuto fucto or cumulatve dstrbuto fucto F of a dscrete or cotuous radom varable X s a fucto defed for all real umber as follows: F( ) = P( X ) for (3.4) For cotuous radom varables, It follows that: F( ) = f ( ) d (3.5) f X X df( ) ( ) = (3.6) d 3..3. Mea ad Varace Suppose that a dscrete radom varable X has a p.f. f(); the epectato or mea of X s defed as follows: E( X) = f( ) (3.7) Smlarly, f a cotuous radom varable X has a p.d.f. f, theepectato or mea of X s defed as follows: EX ( ) = f( d ) (3.8) I physcs, the mea s regarded as the ceter of mass of the probablty dstrbuto. The epectato ca also be defed for ay fucto of the radom varable X. IfX s a cotuous radom varable wth p.d.f. f, the the epectato of ay fucto g( X) ca be defed as follows: [ ] E g( X) = g( ) f( ) d (3.9) The epectato of a radom varable s a lear operator. That s, t satsfes both addtvty ad homogeety propertes: EaX ( + + ax + b) = aex ( ) + + aex ( ) + b (3.3) where a,, a, b are costats Equato (3.3) s vald regardless of whether or ot the radom varables X,, X are depedet. Suppose that X s a radom varable wth mea µ = EX ( ).Thevarace of X deoted as Var( X ) s defed as follows:

8 Probablty, Statstcs, ad Iformato Theory Var( X ) = σ = E ( X µ ) (3.3) where σ, the oegatve square root of the varace s kow as the stadard devato of radom varable X. Therefore, the varace s also ofte deoted as σ. The varace of a dstrbuto provdes a measure of the spread or dsperso of the dstrbuto aroud ts mea µ. A small value of the varace dcates that the probablty dstrbuto s tghtly cocetrated aroud µ, ad a large value of the varace typcally dcates the probablty dstrbuto has a wde spread aroud µ. Fgure 3.4 llustrates three dfferet Gaussa dstrbutos wth the same mea, but dfferet varaces. The varace of radom varable X ca be computed the followg way: [ ] Var( X ) = E( X ) E( X ) (3.3) k I physcs, the epectato EX ( ) s called the k th momet of X for ay radom varable X ad ay postve teger k. Therefore, the varace s smply the dfferece betwee the secod momet ad the square of the frst momet. The varace satsfes the followg addtvty property, f radom varables X ad Y are depedet: Var( X + Y ) = Var( X ) + Var( Y ) (3.33) However, t does ot satsfy the homogeety property. Istead for costat a, Var ax ( ) a Var( X ) = (3.34) Sce t s clear that Var( b ) = for ay costat b, we have a equato smlar to Eq. (3.3) f radom varables X,, X are depedet. Var( a X + + a X + b) = a Var( X ) + + a Var( X ) (3.35) Codtoal epectato ca also be defed a smlar way. Suppose that X ad Y are dscrete radom varables ad let f( y ) deote the codtoal p.f. of Y gve X =, the the codtoal epectato E( Y X) s defed to be the fucto of X whose value EY ( ) whe X = s E ( Y X = ) = yf ( y ) (3.36) YX YX y For cotuous radom varables X ad Y wth fyx ( y ) as the codtoal p.d.f. of Y gve X =, the codtoal epectato EY ( X) s defed to be the fucto of X whose value EY ( ) whe X = s We descrbe Gaussa dstrbutos Secto 3..7

Probablty Theory 8 E ( Y X = ) = yf ( y ) dy (3.37) YX YX.8.7.6 sgma =.5 sgma = sgma =.5.4.3.. - -8-6 -4-4 6 8 Fgure 3.4 Three Gaussa dstrbutos wth same mea µ,butdfferetvaraces,.5,.,ad., respectvely. The dstrbuto wth a large value of the varace has a wde spread aroud the mea µ. Sce EY ( X) s a fucto of radom varable X, t tself s a radom varable whose probablty dstrbuto ca be derved from the dstrbuto of X. It ca be show that E X EY X( Y X) = EX, Y( Y) (3.38) More geerally, suppose that X ad Y have a cotuous jot dstrbuto ad that (, ) E g( X, Y) X s gy s ay arbtrary fucto of X ad Y. The codtoal epectato [ ] defedtobethefuctoofx whose value E[ g( X, Y) ] whe X = s [ ] E g( X, Y) X = = g(, y) f ( y ) dy (3.39) YX YX Equato (3.38) ca also be geeralzed to the followg equato: { [ (, ) ]}, [ (, ) ] E E g X Y X = E g X Y (3.4) X Y X X Y Fally, t s worthwhle to troduce meda ad mode. A meda of the dstrbuto of X s defed to be a pot m, such that PX ( m) ad PX ( m).thus,the meda m dvdes the total probablty to two equal parts,.e., the probablty to the left of m ad the probablty to the rght of m are eactly.

8 Probablty, Statstcs, ad Iformato Theory Suppose a radom varable X has ether a dscrete dstrbuto wth p.f. p ( ) or cotuous p.d.f. f( ) ; a pot ϖ s called the mode of the dstrbuto f p ( ) or f( ) attas the mamum value at the pot ϖ. A dstrbuto ca have more tha oe modes. 3..3.. The Law of Large Numbers The cocept of sample mea ad sample varace s mportat statstcs because most statstcal epermets volve samplg. Suppose that the radom varables X,, X form a radom sample of sze from some dstrbuto for whch the mea s µ ad the varace s σ. I other words, the radom varables X,, X are depedet detcally dstrbuted (ofte abbrevated by..d.) ad each has mea µ ad varace σ. Now f we deote X as the arthmetc average of the observatos the sample, the X = ( X + + X ) (3.4) X X s a radom varable ad s referred to as sample mea. The mea ad varace of ca be easly derved based o the defto. σ EX ( ) = µ ad Var( X) = (3.4) Equato (3.4) states that the mea of sample mea s equal to mea of the dstrbuto, whle the varace of sample mea s oly tmes the varace of the dstrbuto. I other words, the dstrbuto of X wll be more cocetrated aroud the mea µ tha was the orgal dstrbuto. Thus, the sample mea s closer to µ tha s the value of just a sgle observato X from the gve dstrbuto. The law of large umbers s oe of most mportat theorems probablty theory. Formally, t states that the sample mea X coverges to the mea µ probablty, that s, ( ) lm P X µ < ε = for ay gve umber ε > (3.43) The law of large umbers bascally mples that the sample mea s a ecellet estmate of the ukow mea of the dstrbuto whe the sample sze s large.

Probablty Theory 83 3..4. Covarace ad Correlato Let X ad Y be radom varables havg a specfc jot dstrbuto, ad EX ( ) = µ X, EY ( ) = µ Y, Var( X ) = σ X, ad Var( Y ) = σ Y. The covarace of X ad Y, deoted as Cov( X, Y ), s defed as follows: [ µ X µ Y ] Cov( X, Y) = E ( X )( Y ) = Cov( Y, X ) (3.44), s defed as fol- I addto, the correlato coeffcet of X ad Y, deoted as lows: ρ XY ρ XY Cov( X, Y) = (3.45) σ σ It ca be show that ( XY, ) X Y ρ should be boud wth [ ],thats, ρ( XY, ) (3.46) X ad Y are sad to be postvely correlated f ρ XY >, egatvely correlated f ρ XY <, ad ucorrelated f ρ XY =.Itcaalsobeshowthat Cov( X, Y ) ad ρ XY must have the same sg; that s, both are postve, egatve, or zero at the same tme. Whe E( XY ) =, the two radom varables are called orthogoal. There are several theorems pertag to the basc propertes of covarace ad correlato. We lst here the most mportat oes: Theorem For ay radom varables X ad Y Cov( X, Y ) = E( XY ) E( X ) E( Y ) (3.47) Theorem If X ad Y are depedet radom varables, the Cov( X, Y) = ρ XY = Theorem 3 Suppose X s a radom varable ad Y s a lear fucto of X the form of Y = ax + bfor some costat a ad b, where a.if a >,the ρ XY =.If a <,the ρ XY =. Sometmes, ρ XY s referred to as the amout of lear depedecy betwee radom varables X ad Y. Theorem 4 For ay radom varables X ad Y, Var( X + Y ) = Var( X ) + Var( Y ) + Cov( X, Y ) (3.48) Theorem 5 If X,, X are radom varables, the

84 Probablty, Statstcs, ad Iformato Theory (3.49) Var( X ) = Var( X ) + Cov( X, X ) j = = = j= 3..5. Radom Vectors ad Multvarate Dstrbutos Whe a radom varable s a vector rather tha a scalar, t s called a radom vector ad we ofte use boldface varable lke X = ( X,, X ) to dcate that t s a radom vector. It s sad that radom varables X,, X have a dscrete jot dstrbuto f the radom vector X = ( X,, X ) ca have oly a fte umber or a fte sequece of dfferet values (,, ) R.Thejotp.f.of X,, X s defed to be the fucto f X such that for ay pot (,, ) R, X (,, ) = ( =,, = ) (3.5) f P X X Smlarly, t s sad that radom varables X,, X have a cotuous jot dstrbuto f there s a oegatve fucto f defed o R such that for ay subset A R, [ ] = X P ( X,, X ) A f (,, ) d d (3.5) A The jot dstrbuto fucto ca also be defed smlarly for radom varables X,, X as follows: X (,, ) = (,, ) (3.5) F P X X The cocept of mea ad varace for a radom vector ca be geeralzed to mea vector ad covarace matr. Supposed that X s a -dmesoal radom vector wth compoets X,, X, uder matr represetato, we have X X = (3.53) X The epectato (mea) vector E( X) of radom vector X s a -dmesoal vector whose compoets are the epectatos of the dvdual compoets of X, thats, E( X) E( X) = (3.54) E( X )

Probablty Theory 85 The covarace matr Cov( X) of radom vector X s defed to be a matr such that the elemet the th row ad j th colum s Cov( X, Y ),thats, Cov( X, X) Cov( X, X ) t Cov( X) = E [ X E( X )][ X E( X )] = Cov( X, X) Cov( X, X ) j (3.55) It should be emphaszed that the dagoal elemets of the covarace matr Cov( X) are actually the varaces of X,, X. Furthermore, sce covarace s symmetrc,.e., Cov( X, X j) = Cov( X j, X ), the covarace matr Cov( X) must be a symmetrc matr. There s a mportat theorem regardg the mea vector ad covarace matr for a lear trasformato of the radom vector X. Suppose X s a -dmesoal vector as specfed by Eq. (3.53), wth mea vector E( X) ad covarace matr Cov( X ).Now,assume Y s a m-dmesoal radom vector whch s a lear trasform of radom vector X by the relato: Y = AX + B,whereA s a m trasformato matr whose elemets are costats, ad B s a m-dmesoal costat vector. The we have the followg two equatos: E( Y) = AE( X) + B (3.56) t Cov( Y) = ACov( X) A (3.57) 3..6. Some Useful Dstrbutos I the followg two sectos, we wll troduce several useful dstrbutos that are wdely used applcatos of probablty ad statstcs, partcularly spoke laguage systems. 3..6.. Uform Dstrbutos The smplest dstrbuto s uform dstrbuto where the p.f. or p.d.f. s a costat fucto. For uform dscrete radom varable X, whch oly takes possble values from, the p.f. for X s { } PX ( = ) = (3.58) For uform cotuous radom varable X, whch oly takes possble values from real ab,, the p.d.f. for X s terval [ ]

86 Probablty, Statstcs, ad Iformato Theory f( ) = a b b a (3.59) f( ) b a a b Fgure 3.5 A uform dstrbuto for p.d.f. Eq. (3.59) 3..6.. Bomal Dstrbutos The bomal dstrbuto s used to descrbe bary-decso evets. For eample, suppose that a sgle co toss wll produce the head wth probablty p ad produce the tal wth probablty p.now,fwetossthesameco tmes ad let X deote the umber of heads observed, the the radom varable X has the followg bomal p.f.: PX ( = ) = f( p, ) = p( p) (3.6).35.3 p=. p=.3 p=.4.5..5..5 3 4 5 6 7 8 9 Fgure 3.6 Three bomal dstrbutos wth p=.,.3 ad.4. It ca be show that the mea ad varace of a bomal dstrbuto are:

Probablty Theory 87 E( X) = p (3.6) Var( X ) = p( p) (3.6) Fgure 3.6 llustrates three bomal dstrbutos wth p =.,.3 ad.4. 3..6.3. Geometrc Dstrbutos The geometrc dstrbuto s related to the bomal dstrbuto. As the depedet co toss eample, the head-up has a probablty p ad the tal-up has a probablty p.the geometrc dstrbuto s to model the tme utl a tal-up appears. Let the radom varable X be the tme (the umber of tosses) utl the frst tal-up s show. The p.d.f. of X s the followg form: PX ( = ) = f( p) = p ( p) =,, ad < p< (3.63) The mea ad varace of a geometrc dstrbuto are gve by: EX ( ) = p (3.64) Var( X ) = ( p) (3.65) Oe eample for the geometrc dstrbuto s the dstrbuto of the state durato for a hdde Markov model, as descrbed Chapter 8. Fgure 3.7 llustrates three geometrc dstrbutos wth p =.,.3 ad.4..9.8.7 p =. p=.4 p =.7.6.5.4.3.. 3 4 5 6 7 8 9 Fgure 3.7 Three geometrc dstrbutos wth dfferet parameter p.

88 Probablty, Statstcs, ad Iformato Theory 3..6.4. Multomal Dstrbutos Suppose that a bag cotas balls of k dfferet colors, where the proporto of the balls of k color s p. Thus, p > for =,, k ad p = =. Now suppose that balls are radomly selected from the bag ad there are eough balls ( > ) of each color. Let X deote the umber of selected balls that are of color. The radom vector X = ( X,, X k ) s sad to have a multomal dstrbuto wth parameters ad p = ( p, p k ). For a vector = (, k ), the p.f. of X has the followg form:!, k p pk where =,, k!, k! P( X= ) = f(, p) = ad + + k = (3.66) otherwse.6.4. 4 6 8 4 6 8 X X Fgure 3.8 A multomal dstrbuto wth =, p =. ad p =.3 are: It ca be show that the mea, varace ad covarace of the multomal dstrbuto E( X ) = p ad Var( X ) = p ( p ) =,, k (3.67)

Probablty Theory 89 Cov( X, X ) = p p (3.68) j j Fgure 3.8 shows a multomal dstrbuto wth =, p =. ad p =.3. Sce there are oly two free parameters ad, the graph s llustrated oly usg ad as as. Multomal dstrbutos are typcally used wth the χ test that s oe of the most wdely used goodess-of-ft hypotheses testg procedures descrbed Secto 3.3.3. 3..6.5. Posso Dstrbutos Aother popular dscrete dstrbuto s Posso dstrbuto. The radom varable X has a Posso dstrbuto wth mea λ ( λ > ) f the p.f. of X has the followg form: e λ λ PX ( = ) = f( λ) = for =,,, (3.69)! otherwse The mea ad varace of a Posso dstrbuto are the same ad equal λ : EX ( ) = VarX ( ) = λ (3.7).45.4.35 lambda= lambda= lambda= 4.3.5..5..5 3 4 5 6 7 8 9 Fgure 3.9 Three Posso dstrbutos wth λ =,, ad 4. Fgure 3.9 llustrates three Posso dstrbutos wth λ =,, ad 4. The Posso dstrbuto s typcally used queug theory, where s the total umber of occurreces of some pheomeo durg a fed perod of tme or wth a fed rego of space. Eamples clude the umber of telephoe calls receved at a swtchboard durg a fed perod of

9 Probablty, Statstcs, ad Iformato Theory tme. I speech recogto, the Posso dstrbuto s used to model the durato for a phoeme. 3..6.6. Gamma Dstrbutos A cotuous radom varable X s sad to have a gamma dstrbuto wth parameters α ad β ( α > ad β > )fx has a cotuous p.d.f. of the followg form: α β α β e > f( α, β) = Γ( α) where (3.7) α e d Γ ( α ) = (3.7) It ca be show that the fucto Γ s a factoral fucto whe α s a postve teger. ( )! =,3, Γ ( ) = = (3.73).4.35.3 alpha = alpha = 3 alpha = 4.5..5..5 3 4 5 6 7 8 9 Fgure 3. Three Gamma dstrbutos wth β =. ad α =., 3., ad 4.. The mea ad varace of a gamma dstrbuto are: EX ( ) = α α ad VarX ( ) β = β (3.74)

Probablty Theory 9 Fgure 3. llustrates three gamma dstrbutos wth β =. ad α =., 3., ad 4.. There s a terestg theorem assocated wth gamma dstrbutos. If the radom varables X,, Xk are depedet ad each radom varable X has a gamma dstrbuto wth parameters α ad β, the the sum X + + X k also has a gamma dstrbuto wth parameters α + + αk ad β. A specal case of gamma dstrbuto s called epoetal dstrbuto. A cotuous radom varable X s sad to have a epoetal dstrbuto wth parameters β ( β > )f X has a cotuous p.d.f. of the followg form: β β e > f( β ) = (3.75) It s clear that the epoetal dstrbuto s a gamma dstrbuto wth α =.Themea ad varace of the epoetal dstrbuto are: EX ( ) = ad VarX ( ) β = β (3.76).9.8 beta = beta =.6 beta =.3.7.6.5.4.3.. 3 4 5 6 7 8 9 Fgure 3. Three epoetal dstrbutos wth β =.,.6 ad.3. Fgure 3. shows three epoetal dstrbutos wth β =.,.6, ad.3. The epoetal dstrbuto s ofte used queug theory for the dstrbutos of the durato of a servce or the ter-arrval tme of customers. It s also used to appromate the dstrbuto of the lfe of a mechacal compoet.

9 Probablty, Statstcs, ad Iformato Theory 3..7. Gaussa Dstrbutos Gaussa dstrbuto s by far the most mportat probablty dstrbuto maly because may scetsts have observed that the radom varables studed varous physcal epermets (cludg speech sgals), ofte have dstrbutos that are appromately Gaussa. The Gaussa dstrbuto s also referred to as ormal dstrbuto. A cotuous radom varable X s sad to have a Gaussa dstrbuto wth mea µ ad varace σ ( σ > )f X has a cotuous p.d.f. the followg form: f ( µ ) = = πσ σ ( µσ, ) N( µσ, ) ep (3.77) It ca be show that µ ad σ are deed the mea ad the varace for the Gaussa dstrbuto. Some eamples of Gaussa ca be foud Fgure 3.4. The use of Gaussa dstrbutos s justfed by the Cetral Lmt Theorem, whch states that observable evets cosdered to be a cosequece of may urelated causes wth o sgle cause predomatg over the others, ted to follow the Gaussa dstrbuto [6]. It ca be show from Eq. (3.77) that the Gaussa f( µ, σ ) s symmetrc wth respect to = µ. Therefore, µ s both the mea ad the meda of the dstrbuto. Moreover, µ s also the mode of the dstrbuto,.e., the p.d.f. f( µ, σ ) attas ts mamum at the mea pot = µ. Several Gaussa p.d.f. s wth the same mea µ, but dfferet varaces are llustrated Fgure 3.4. Readers ca see that the curve has a bell shape. The Gaussa p.d.f. wth a small varace has a hgh peak ad s very cocetrated aroud the mea µ, whereas the Gaussa p.d.f., wth a large varace, s relatvely flat ad s spread out more wdely over the -as. If the radom varable X s a Gaussa dstrbuto wth mea µ ad varace σ, the ay lear fucto of X also has a Gaussa dstrbuto. That s, f Y = ax + b,where a ad b are costats ad a, Y has a Gaussa dstrbuto wth mea aµ + b ad varace a σ. Smlarly, the sum X + + X k of depedet radom varables X,, Xk, where each radom varable has a Gaussa dstrbuto, s also a Gaussa dstrbuto. X 3..7.. Stadard Gaussa Dstrbutos The Gaussa dstrbuto wth mea ad varace, deoted as (,) N, s called the stadard Gaussa dstrbuto or ut Gaussa dstrbuto. Sce the lear trasformato of a Gaussa dstrbuto s stll a Gaussa dstrbuto, the behavor of a Gaussa dstrbuto ca be solely descrbed usg a stadard Gaussa dstrbuto. If the radom varable

Probablty Theory 93 X s a Gaussa dstrbuto wth mea µ ad varace show that σ,thats, X ~ N ( µ, σ ), t ca be X µ Z = ~ N(,) (3.78) σ Based o Eq. (3.78), the followg property ca be show: P( X µ kσ) = P( Z k) (3.79) Equato (3.79) demostrates that every Gaussa dstrbuto cotas the same total amout of probablty wth ay fed umber of stadard devatos of ts mea. 3..7.. The Cetral Lmt Theorem If radom varables X,, X are..d. accordg to a commo dstrbuto fucto wth mea µ ad varace σ, the as the radom sample sze approaches, the followg radom varable has a dstrbuto covergg to the stadard Gaussa dstrbuto: X ( µ ) Y = ~ N (,) (3.8) σ where X s the sample mea of radom varables X,, X asdefedeq.(3.4). Based o Eq. (3.8), the sample mea radom varable X ca be appromated by a Gaussa dstrbuto wth mea µ ad varace σ /. The cetral lmt theorem above s appled to..d. radom varables X,, X.A. Lapouov 9 derved aother cetral lmt theorem for depedet but ot ecessarly detcally dstrbuted radom varables X,, X. Suppose X,, X are depedet 3 radom varables ad E( X µ ) < for ; the followg radom varable wll coverge to stadard Gaussa dstrbuto whe. / Y = ( X µ )/ σ = = = (3.8) I other words, the sum of radom varables X,, X ca be appromated by a / Gaussa dstrbuto wth mea µ ad varace σ. = = Both cetral lmt theorems essetally state that regardless of ther orgal dvdual dstrbutos, the sum of may depedet radom varables (effects) teds to be dstrbuted lke a Gaussa dstrbuto as the umber of radom varables (effects) becomes large.

94 Probablty, Statstcs, ad Iformato Theory 3..7.3. Multvarate Mture Gaussa Dstrbutos Whe X = ( X,, X ) s a -dmesoal cotuous radom vector, the multvarate Gaussa p.d.f. has the followg form: t f( X = µ, Σ) = N( ; µ, Σ) = ep ( ) ( ) / / µ Σ µ Σ ( π ) (3.8) where µ s the -dmesoal mea vector, Σ s the covarace matr, ad Σ s the determat of the covarace matr Σ. µ = E( ) (3.83) Σ = E ( µ )( µ ) t (3.84) of covarace matr Σ ca be specfed as fol- More specfcally, the -j th elemet lows: σ j σ j = E( µ )( j µ j ) (3.85) Fgure 3. A two-dmesoal multvarate Gaussa dstrbuto wth depedet radom varables ad that have the same varace.

Probablty Theory 95 If X,, X are depedet radom varables, the covarace matr Σ s reduced to dagoal covarace where all the off-dagoal etres are zero. The dstrbuto ca be regarded as depedet scalar Gaussa dstrbutos. The jot p.d.f. s the product of all the dvdual scalar Gaussa p.d.f.. Fgure 3. shows a two-dmesoal multvarate Gaussa dstrbuto wth depedet radom varables ad wth the same varace. Fgure 3.3 shows aother two-dmesoal multvarate Gaussa dstrbuto wth depedet radom varables ad that have dfferet varaces. Although Gaussa dstrbutos are umodal, more comple dstrbutos wth multple local mama ca be appromated by Gaussa mtures: K f( ) = ck Nk( ; µ k, Σ k) (3.86) k = where c k, the mture weght assocated wth kth Gaussa compoet are subject to the followg costrat: c k K k k = ad c = Gaussa mtures wth eough mture compoets ca appromate ay dstrbuto. Throughout ths book, most cotuous probablty desty fuctos are modeled wth Gaussa mtures. Fgure 3.3 Aother two-dmesoal multvarate Gaussa dstrbuto wth depedet A umodal dstrbuto has a sgle mamum (bump) for the dstrbuto. For Gaussa dstrbuto, the mamum occurs at the mea.

96 Probablty, Statstcs, ad Iformato Theory radom varable ad whch have dfferet varaces. 3..7.4. χ Dstrbutos A gamma dstrbuto wth parameters α ad β s defed Eq. (3.7). For ay gve postve teger, the gamma dstrbuto for whch α = ad β = s called the dstrbuto wth degrees of freedom. It follows from Eq. (3.7) that the p.d.f. for the dstrbuto s χ χ ( ) e > f( ) = Γ( ) (3.87) χ dstrbutos are mportat statstcs because they are closely related to radom samples of Gaussa dstrbuto. They are wdely appled may mportat problems of statstcal ferece ad hypothess testg. Specfcally, f the radom varables X,, X are depedet ad detcally dstrbuted, ad f each of these varables has a stadard Gaussa dstrbuto, the the sum of square X + + X ca be proved to have a χ dstrbuto wth degree of freedom. Fgure 3.4 llustrates three =, 3 ad 4. χ dstrbutos wth.5.45.4.35.3 = =3 =4.5..5..5 3 4 5 6 7 8 9 Fgure 3.4 Three χ dstrbutos wth =, 3, ad 4. The mea ad varace for the χ dstrbuto are

Probablty Theory 97 EX ( ) = ad VarX ( ) = (3.88) Followg the addtvty property of the gamma dstrbuto, the χ dstrbuto also has the addtvty property. That s, f the radom varables X,, X are depedet ad f X has a χ dstrbuto wth k degrees of freedom, the sum X + + X has a χ dstrbuto wth k + + k degrees of freedom. 3..7.5. Log-Normal Dstrbuto Let, be a Gaussa radom varable wth mea µ ad stadard devato σ,the y = e (3.89) follows a log-ormal dstrbuto (l y µ ) f( y µ, σ) = ep yσ σ π show Fgure 3.5, ad whose mea s gve by (3.9) ( µ ) µ y = E{} y = E{ e } = ep{} ep d πσ σ ( ( µ + σ ) = ep + / ep = ep + / πσ σ { µ σ } d { µ σ } (3.9) where we have rearraged the quadratc form of ad made use of the fact that the total probablty mass of a Gaussa s. Smlarly, the secod order momet of y s gve by E{ y } = ep{ } ep d πσ σ ( µ ) ( ( µ + σ ) = ep + ep = ep + πσ σ { µ σ } d { µ σ } ad thus the varace of y s gve by ( E y ) ( { } ) y y (3.9) σ = E{ y } { } = µ ep σ (3.93)

98 Probablty, Statstcs, ad Iformato Theory.35.3 std=3 std= std=.5.5..5..5.5.5.5 3 3.5 4 4.5 5 Fgure 3.5 Logormal dstrbuto for µ = ad σ = 3, ad.5 accordg to Eq. (3.9). Smlarly, f s a Gaussa radom vector wth mea the radom vector µ ad covarace matr Σ, y = e s log-ormal wth mea ad covarace matr [8] gve by { } ( { } ) µ y[] = ep µ [] + Σ[,]/ Σ [, j] = µ [] µ [ j] ep Σ [, j] y y y usg a smlar dervato as Eqs. (3.9) to (3.93). (3.94) 3.. ESTIMATION THEORY Estmato theory ad sgfcace testg are two most mportat theores ad methods of statstcal ferece. I ths secto, we descrbe estmato theory whle sgfcace testg s covered the et secto. A problem of statstcal ferece s oe whch data geerated accordace wth some ukow probablty dstrbuto must be aalyzed, ad some type of ferece about the ukow dstrbuto must be made. I a problem of statstcal ferece, ay characterstc of the dstrbuto geeratg the epermetal data, such as the mea µ ad varace σ of a Gaussa dstrbuto, s called a parameter of the dstrbuto. The set Ω of all possble values of a parameter Φ or a group of parameters Φ, Φ,, Φ s called the parameter space. I ths secto we focus o how to estmate the parameter Φ from sample data. Before we descrbe varous estmato methods, we troduce the cocept ad ature of the estmato problems. Suppose that a set of radom varables X = { X, X,, X } s

Estmato Theory 99..d. accordg to a p.d.f. pφ ( ) where the value of the parameter Φ s ukow. Now, suppose also that the value of Φ must be estmated from the observed values the sample. A estmator of the parameter Φ, based o the radom varables X, X,, X, s a realvalued fucto θ ( X, X,, X ) that specfes the estmated value of Φ for each possble set of values of X, X,, X. That s, f the sample values of X, X,, X tur out to be,,,, the the estmated value of Φ wll be θ (,,, ). We eed to dstgush betwee estmator, estmate, ad estmato. Aestmator θ ( X, X,, X ) s a fucto of the radom varables, whose probablty dstrbuto ca be derved from the jot dstrbuto of X, X,, X. O the other had, a estmate s a specfc value θ (,,, ) of the estmator that s determed by usg some specfc sample values,,,. Estmato s usually used to dcate the process of obtag such a estmator for the set of radom varables or a estmate for the set of specfc sample values. If we use the otato X = { X, X,, X } to represet the vector of radom varables ad = {,,, } to represet the vector of sample values, a estmator ca be deoted as θ ( X) ad a estmate θ ( ). Sometmes we abbrevate a estmator θ ( X) by just the symbol θ. I the followg four sectos we descrbe ad compare three dfferet estmators (estmato methods). They are mmum mea square estmator, mamum lkelhood estmator, adbayes estmator. The frst oe s ofte used to estmate the radom varable tself, whle the latter two are used to estmate the parameters of the dstrbuto of the radom varables. 3... Mmum/Least Mea Squared Error Estmato Mmum mea squared error (MMSE) estmato ad least squared error (LSE) estmato are mportat methods for radom varable sce the goal (mmze the squared error) s a tutve oe. I geeral, two radom varables X ad Y are..d. accordg to some p.d.f. fxy, (, y ). Suppose that we perform a seres of epermets ad observe the value of X. We wat to fd a trasformato Y ˆ = g( X) such that we ca predct the value of the radom varable Y. The followg quatty ca measure the goodess of such a trasformato. EY ( Yˆ ) EY ( g( X)) = (3.95) Ths quatty s called mea squared error (MSE) because t s the mea of the squared error of the predctor g( X ). The crtero of mmzg the mea squared error s a good oe for pckg the predctor g( X ). Of course, we usually specfy the class of fucto G, fromwhch g( X) may be selected. I geeral, there s a parameter vector Φ assocated wth the fucto g( X ), so the fucto ca be epressed as g( X, Φ ). The process to

Probablty, Statstcs, ad Iformato Theory fd the parameter vector Φˆ MMSE that mmzes the mea of the squared error s called mmum mea squared error estmato ad Φ ˆ MMSE s called the mmum mea squared error estmator. That s, Φ ˆ arg m ( (, )) MMSE = E Y g X Φ (3.96) Φ Sometmes, the jot dstrbuto of radom varables X ad Y s ot kow. Istead, samples of (,y) pars may be observable. I ths case, the followg crtero ca be used stead, LSE Φ = Φ = arg m y g(, Φ ) (3.97) The argumet of the mmzato Eq. (3.97) s called sum-of-squared-error (SSE) ad the process of fdg the parameter vector Φ ˆ LSE, whch satsfes the crtero s called least squared error estmato or mmum squared error estmato. LSE s a powerful mechasm for curve fttg, where the fucto g(, Φ) descrbes the observato pars (, y ).I geeral, there are more pots () tha the umber of free parameters fucto gφ (, ), so the fttg s over-determed. Therefore, o eact soluto ests, ad LSE fttg becomes ecessary. It should be emphaszed that MMSE ad LSE are actually very smlar ad share smlar propertes. The quatty Eq. (3.97) s actually tmes the sample mea of the squared error. Based o the law of large umbers, whe the jot probablty fxy, (, y) s uform or the umber of samples approaches to fty, MMSE ad LSE are equvalet. For the class of fuctos, we cosder the followg three cases: Costat fuctos,.e., c { ( ), } G = g = c c R (3.98) Lear fuctos,.e., l { ( ),, } G = g = a+ b a b R (3.99) Other o-lear fuctos Gl 3... MMSE/LSE for Costat Fuctos Whe Y ˆ = g( ) = c, Eq. (3.95) becomes EY ( Yˆ ) EY ( c) = (3.)

Estmato Theory To fd the MMSE estmate for c, we take the dervatves of both sdes Eq. (3.) wth respect to c ad equate t to. The MMSE estmate c s gve as MMSE cmmse = E( Y) (3.) ad the mmum mea squared error s eactly the varace of Y, Var( Y ). For the LSE estmate of c, the quatty Eq. (3.97) becomes = [ y ] c m (3.) Smlarly, the LSE estmate c LSE ca be obtaed as follows: c LSE = = y (3.3) The quatty Eq. (3.3) s the sample mea. 3... MMSE ad LSE For Lear Fuctos Whe Y ˆ = g( ) = a+ b, Eq. (3.95) becomes eab (, ) EY ( Yˆ ) EY ( a b) = = (3.4) TofdtheMMSEestmateofa ad b, we ca frst set e e =, ad = (3.5) a b ad solve the two lear equatos. Thus, we ca obta cov( XY, ) σ Y a = = ρ XY (3.6) Var( X ) σ X σ Y b= E( Y) ρ XY E( X) (3.7) σ X For LSE estmato, we assume that the sample s a d-dmesoal vector for geeralty. Assumg we have sample-vectors (, y) = (,,,, y), =, a lear d fucto ca be represeted as

Probablty, Statstcs, ad Iformato Theory d y a d ˆ y a or Y= XA = d y a d The sum of squared error ca the be represeted as (3.8) t ( ) y e( A) Yˆ Y A (3.9) = = = A closed-form soluto of the LSE estmate of A ca be obtaed by takg the gradet of ( ) e A, t t e( A) = ( A y ) = X ( XA Y ) (3.) = ad equatg t to zero. Ths yelds the followg equato: t t XXA= XY (3.) Thus the LSE estmate ALSE wll be of the followg form: t t ALSE = ( X X) X Y (3.) t t ( XX) X Eq. (3.) s also refereed to as the pseudo-verse of X ad s sometmes deoted as X. t Whe XX s sgular or some boudary codtos cause the LSE estmato Eq. (3.) to be uattaable, some umerc methods ca be used to fd a appromate soluto. Istead of mmzg the quatty Eq. (3.9), oe ca mmze the followg quatty: e( ) A = XA Y + α X (3.3) Followg a smlar procedure, oe ca obta the LSE estmate to mmze the quatty above the followg form. t A = ( X X+ I) X Y (3.4) * t LSE α The LSE soluto Eq. (3.) ca be used for polyomal fuctos too. I the problem of polyomal curve fttg usg the least square crtero, we are amg to fd the coeffcets A = ( a, a, a,, a ) t d that mmze the followg quatty: m EY ( Y ˆ) a a a a,,,, d (3.5)

Estmato Theory 3 where Yˆ = a + a + a + + a d d To obta the LSE estmate of coeffcets A = ( a, a, a,, a ) t d, smply chage the formato of matr X Eq. (3.8) to the followg: d d X = d (3.6) j j Note that Eq. (3.8) meas the j-th dmeso of sample, whle Eq. (3.6) meas j-th order of value. Therefore, the LSE estmate of polyomal coeffcets A = ( a, a, a,, a ) t has the same form as Eq. (3.). LSE d 3...3. MMSE/LSE For Nolear Fuctos As the most geeral case, cosder solvg the followg mmzato problem: g( ) Gl [ ] m EY gx ( ) (3.7) Sce we eed to deal wth all possble olear fuctos, takg a dervatve does ot work here. Istead, we use the property of codtoal epectato to solve ths mmzato problem. By applyg Eq. (3.38) to (3.7), we get { } [ ] [ ] XY, X YX E Y g( X) = E E Y g( X) X = = E YX [ Y g( X) ] X = fx( ) d - = E YX [ Y g( ) ] X = fx( ) d - (3.8) Sce the tegrad s oegatve Eq. (3.8), the quatty Eq. (3.7) wll be mmzed at the same tme the followg equato s mmzed. [ ] m E Y YX g ( ) X = g( ) R (3.9) Sce g ( ) s a costat the calculato of the codtoal epectato above, the MMSE estmate ca be obtaed the same way as the costat fuctos Secto 3... Thus, the MMSE estmate should take the followg form: Y ˆ = g ( X ) = E ( Y X ) (3.) MMSE Y X

4 Probablty, Statstcs, ad Iformato Theory If the value X = s observed ad the value EY ( X= ) s used to predct Y, the mea squared error (MSE) s mmzed ad specfed as follows: E YX Y EYX ( Y X ) X = = = VarYX ( Y X = ) The overall MSE, averaged over all the possble values of X, s: { } X Y X X Y X Y X X Y X (3.) E Y E ( Y X) = E E Y E ( Y X) X = E Var( Y X = ) (3.) It s mportat to dstgush betwee the overall MSE EX VarY X( Y X) ad the MSE of the partcular estmate whe X =,whchs VarYX ( Y X = ). Before the value of X s observed, the epected MSE for the process of observg X ad predctg Y s EX VarY X( Y X). O the other had, after a partcular value of X has bee observed ad the predcto EYX ( Y X = ) has bee made, the approprate measure of MSE of the predcto s VarYX ( Y X = ). I geeral, the form of the MMSE estmator for olear fuctos depeds o the form of the jot dstrbuto of X ad Y. There s o mathematcal closed-form soluto. To get the codtoal epectato Eq. (3.), we have to perform the followg tegral: Yˆ( ) = yfy ( y X = ) dy (3.3) It s dffcult to solve ths tegral calculato. Frst, dfferet measures of could determe dfferet codtoal p.d.f. for the tegral. Eact formato about the p.d.f. s ofte mpossble to obta. Secod, there could be o aalytc soluto for the tegral. Those dffcultes reduce the terest of the MMSE estmato of olear fuctos to theoretcal aspects oly. The same dffcultes also est for LSE estmato for olear fuctos. Some certa classes of well-behaved olear fuctos are typcally assumed for LSE problems ad umerc methods are used to obta LSE estmate from sample data. 3... Mamum Lkelhood Estmato Mamum lkelhood estmato (MLE) s the most wdely used parametrc estmato method, largely because of ts effcecy. Suppose that a set of radom samples X = { X, X,, X } s to be draw depedetly accordg to a dscrete or cotuous dstrbuto wth the p.f. or the p.d.f. pφ ( ), where the parameter vector Φ belogs to some parameter space Ω. Gve a observed vector = (,, ),thelkelhood of the set of sample data vectors wth respect to Φ s defed as the jot p.f. or jot p.d.f. p( Φ ) ; p ( ) Φ s also referred to as the lkelhood fucto.

Estmato Theory 5 MLE assumes the parameters of p.d.f. s are fed but ukow ad ams to fd the set of parameters that mamzes the lkelhood of geeratg the observed data. For eample, the p.d.f. p( Φ) s assumed to be a Gaussa dstrbuto N ( µ, Σ ),thecompoetsof Φ wll the clude eactly the compoets of mea-vector µ ad covarace matr Σ. Sce X, X,, X are depedet radom varables, the lkelhood ca be rewrtte as follows: p( Φ) = p( k Φ ) (3.4) k = The lkelhood p( Φ) ca be vewed as the probablty of geeratg the sample data set based o parameter set Φ.Themamum lkelhood estmator of Φ s deoted as Φ that mamzes the lkelhood p ( Φ ).Thats, MLE Φ = argma p ( Φ ) (3.5) MLE Φ Ths estmato method s called the mamum lkelhood estmato method ad s ofte abbrevated as MLE. Sce the logarthm fucto s a mootocally creasg fucto, the parameter set ΦMLE that mamzes the log-lkelhood should also mamze the lkelhood. If p ( Φ) s dfferetable fucto of Φ, ΦMLE ca be attaed by takg the partal dervatve wth respect to Φ ad settg t to zero. Specfcally, let Φ be a k- compoet parameter vector Φ = ( Φ, Φ,, Φ ) t k ad Φ be the gradet operator: Φ Φ = (3.6) Φk The log-lkelhood becomes: l( Φ) log p ( Φ) log p( Φ ) (3.7) = = k = ad ts partal dervatve s: k = k l( Φ) = log p( Φ ) (3.8) Φ Φ k Thus, the mamum lkelhood estmate of Φ ca be obtaed by solvg the followg set of k equatos: l( ) = Φ Φ (3.9)

6 Probablty, Statstcs, ad Iformato Theory Eample Let s take a look at the mamum lkelhood estmator of a uvarate Gaussa p.d.f., gve as the followg equato: ( µ ) p ( Φ ) = ep (3.3) πσ σ are the mea ad the varace respectvely. The parameter vector Φ de- where µ ad otes σ ( µ, σ ). The log-lkelhood s: log p ( Φ) = log p( Φ) k = ( k µ ) = log ep k = πσ σ = log( πσ ) ( µ ) k σ k = ad the partal dervatve of the above epresso s: log p( Φ) = ( ) k µ µ σ k = ( µ ) σ σ σ k log p ( ) Φ = + 4 k = k We set the two partal dfferetal dervatves to zero, k = ( k µ ) = σ ( µ ) + = σ k 4 k = σ (3.3) (3.3) (3.33) are obtaed by solvg the above equa- The mamum lkelhood estmates for µ ad tos: µ = = E( ) MLE k k = MLE = ( k MLE ) = E( MLE ) k = σ µ µ σ (3.34) Equato (3.34) dcates that the mamum lkelhood estmato for mea ad varace s just the sample mea ad varace.

Estmato Theory 7 Eample For the multvarate Gaussa p.d.f. p( ) t p( Φ) = ep ( ) Σ ( ) d / / µ µ Σ ( π ) (3.35) The mamum lkelhood estmates of m ad Σ ca be obtaed by a smlar procedure. µ ˆ = MLE k k = t MLE = k MLE k MLE = E k MLE k MLE k = ˆ t Σ ( µ ˆ )( µ ˆ ) ( µ ˆ )( µ ˆ ) (3.36) Oce aga, the mamum lkelhood estmato for mea vector ad co-varace matr s the sample mea vector ad sample covarace matr. I some stuatos, a mamum lkelhood estmato of Φ may ot est, or the mamum lkelhood estmator may ot be uquely defed,.e., there may be more tha oe MLE of Φ for a specfc set of sample values. Fortuately, accordg to Fsher s theorem, for most practcal problems wth a well-behaved famly of dstrbutos, the MLE ests ad s uquely defed [4, 5, 6]. I fact, the mamum lkelhood estmator ca be prove to be soud uder certa codtos. As metoed before, the estmator θ ( X) s a fucto of the vector of radom varables X that represet the sample data. θ ( X) tself s also a radom varable, wth a dstrbuto determed by jot dstrbutos of X.Let Φ be the parameter vector of true dstrbuto pφ ( ) from whch the samples are draw. If the followg three codtos hold:. The sample s a draw from the assumed famly of dstrbuto,. The famly of dstrbutos s well behaved, 3. The sample s large eough, the mamum lkelhood estmator, Φ MLE, has a Gaussa dstrbuto wth a mea Φ ad avaraceoftheform/b [6], where stheszeofsamplead B s the Fsher formato, whch s determed solely by Φ ad. A estmator s sad to be cosstet, ff the estmate wll coverge to the true dstrbuto whe there s fte umber of trag samples. lm Φ = Φ (3.37) > MLE

8 Probablty, Statstcs, ad Iformato Theory ΦMLE Φ MLE s a cosstet estmator based o the aalyss above. I addto, t ca be show that o cosstet estmator has a lower varace tha.iotherwords,oestmator provdes a closer estmate of the true parameters tha the mamum lkelhood estmator. 3..3. Bayesa Estmato ad MAP Estmato Bayesa estmato has a dfferet phlosophy tha mamum lkelhood estmato. Whle MLE assumes that the parameter Φ 3 s fed but ukow, Bayesa estmato assumes that the parameter Φ tself s a radom varable wth a pror dstrbuto p( Φ ). Suppose we observe a sequece of radom samples = {,,, }, whch are..d. wth a p.d.f. pφ ( ). Accordg to Bayes rule, we have the posteror dstrbuto of Φ as: p( Φ) p( Φ) p( Φ ) = p( Φ) p( Φ) p( ) (3.38) I Eq. (3.38), we dropped the deomator p( ) here because t s depedet of the parameter Φ. The dstrbuto Eq. (3.38) s called the posteror dstrbuto of Φ because t s the dstrbuto of Φ after we observed the values of radom varables X, X,, X. 3..3.. Pror ad Posteror Dstrbutos For mathematcal tractablty, cojugate prors are ofte used Bayesa estmato. Suppose a radom sample s take of a kow dstrbuto wth p.d.f. p( Φ). A cojugate pror for the radom varable (or vector) s defed as the pror dstrbuto for the parameters of the probablty desty fucto of the radom varable (or vector), such that the class-codtoal p.d.f. p( Φ), the posteror dstrbuto p( Φ ), ad the pror dstrbuto p( Φ) belog to the same dstrbuto famly. For eample, t s well kow that the cojugate pror for the mea of a Gaussa p.d.f. s also a Gaussa p.d.f. [4]. Now, let s derve such a posteror dstrbuto p( Φ ) from the wdely used Gaussa cojugate pror. Eample Suppose X, X,, X are draw from a Gaussa dstrbuto for whch the mea Φ s a radom varable ad the varace σ s kow. The lkelhood fucto p( Φ) ca be wrtte as: 3 For smplcty, we assume the parameter Φ s a scalar stead of a vector here. However, the eteso to a parameter vector Φ ca be derved accordg to a smlar procedure.

Estmato Theory 9 p( Φ Φ Φ ) = ep ep / ( π) σ = σ = σ (3.39) To further smply Eq. (3.39), we could use Eq. (3.4) ( Φ ) = ( Φ ) + ( ) = = (3.4). where = = thesamplemeaof = {,,, } = Let s rewrte p( Φ) Eq. (3.39) to Eq. (3.4): p( Φ) ep ( Φ ) ep ( ) σ σ (3.4) = Now supposed the pror dstrbuto of Φ s also a Gaussa dstrbuto wth mea µ ad varace ν,.e., the pror dstrbuto p( Φ) s gve as follows: Φ µ Φ µ p( Φ ) = ep ep / ( π) ν ν ν (3.4) By combg Eqs. (3.4) ad (3.4) whle droppg the secod term Eq. (3.4) we could atta the posteror p.d.f. p( Φ ) the followg equato: p( Φ ) ep ( Φ ) + ( Φ µ ) σ ν (3.43) Now f we defe ρ ad τ as follows: σ µ + ν ρ = σ + ν (3.44) τ σν = σ + ν We ca rewrte Eq. (3.43) ca be rewrtte as: (3.45) p( Φ ) ep ( ρ ) ( µ ) Φ + τ σ + ν (3.46) Sce the secod term Eq. (3.46) does ot deped o Φ, t ca be absorbed the costat factor. Fally, we have the posteror p.d.f. the followg form:

Probablty, Statstcs, ad Iformato Theory ( ) ep ( ) p Φ = Φ ρ πτ τ (3.47) Equato (3.47) shows that the posteror p.d.f. p( Φ ) s a Gaussa dstrbuto wth mea ρ ad varace τ as defed Eqs. (3.44) ad (3.45). The Gaussa pror dstrbuto defed Eq. (3.4) s a cojurgate pror. 3..3.. Geeral Bayesa Estmato The foremost requremet of a good estmator θ s that t ca yeld a estmate of Φ ( θ ( X )) whch s close to the real value Φ. I other words, a good estmator s oe for whch t s hghly probable that the error θ ( X) Φ s close to. I geeral, we ca defe a loss fucto 4 R( ΦΦ, ). It measures the loss or cost assocated wth the fact that the true value of the parameter s Φ whle the estmate s Φ. Whe oly the pror dstrbuto p( Φ) s avalable ad o sample data has bee observed, f we choose oe partcular estmate Φ, the epected loss s: E R( ΦΦ, ) = R( ΦΦ, ) p( Φ) dφ (3.48) The fact that we could derve posteror dstrbuto from the lkelhood fucto ad the pror dstrbuto [as show the dervato of Eq. (3.47)] s very mportat here because t allows us to compute the epected posteror loss after sample vector s observed. The epected posteror loss assocated wth estmate Φ s: E R( ΦΦ, ) = R( ΦΦ, ) p( Φ ) dφ (3.49) The Bayesa estmator of Φ s defed as the estmator that attas mmum Bayes rsk, that s, mmzes the epected posteror loss fucto (3.49). Formally, the Bayesa estmator s chose accordg to: θ Bayes [ θ ] ( ) = argm E R( Φ, ( )) (3.5) θ The Bayesa estmator of Φ s the estmator θbayes for whch Eq. (3.5) s satsfed for every possble value of of radom vector X. Therefore, the form of the Bayesa estmator θ Bayes should deped oly o the loss fucto ad the pror dstrbuto, but ot the sample value. 4 The Bayesa estmato ad loss fucto are based o Bayes decso theory, whch wll be descrbed detal Chapter 4.

Estmato Theory Oe of the most commo loss fuctos used statstcal estmato s the mea squared error fucto []. The mea squared error fucto for Bayesa estmato should have the followg form: R( Φ, θ( )) ( θ( )) = Φ (3.5) to mmze the epected pos- I order to fd the Bayesa estmator, we are seekg teror loss fucto: θ Bayes [ ( Φ, θ( )) ] = ( Φ θ( )) = ( Φ ) θ( ) ( Φ ) θ( ) E R E E E (3.5) The mmum value of ths fucto ca be obtaed by takg the partal dervatve of Eq. (3.5) wth respect to θ ( ). Sce the above equato s smply a quadratc fucto of θ ( ), t ca be show that the mmum loss ca be acheved whe o the followg equato: θ Bayes s chose based θ ( ) = E( Φ ) (3.53) Bayes Equato (3.53) traslates to the fact that the Bayesa estmate of the parameter Φ for mea squared error fucto s equal to the mea of the posteror dstrbuto of Φ.I the followg secto, we dscuss aother popular loss fucto (MAP estmato) that also geerates the same estmate for certa dstrbuto fuctos. 3..3.3. MAP Estmato Oe tutve terpretato of Eq. (3.38) s that a pror p.d.f. p( Φ) represets the relatve lkelhood before the values of X, X,, X have bee observed; whle the posteror p.d.f. p( Φ ) represets the relatve lkelhood after the values of X, X,, X have bee observed. Therefore, choosg a estmate Φ that mamzes posteror probablty s cosstet wth out tuto. Ths estmator s fact the mamum posteror probablty (MAP) estmator ad s the most popular Bayesa estmator. The loss fucto assocated wth the MAP estmator s the so-called uform loss fucto []:, f θ ( ) Φ R( Φ, θ ( )) = where > (3.54), f θ ( ) Φ > Now let s see how ths uform loss fucto results MAP estmato. Based o loss fucto defed above, the epected posteror loss fucto s: ER ( ( Φ, θ( )) ) = P( θ( ) Φ > ) θ ( ) + (3.55) = P( θ ( ) Φ ) = p( Φ ) θ ( )