Probability Density Function Estimation by different Methods

Similar documents
Classification as a Regression Problem

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Lecture 12: Classification

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

The Expectation-Maximization Algorithm

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Lecture Notes on Linear Regression

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

EEE 241: Linear Systems

VQ widely used in coding speech, image, and video

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Expectation Maximization Mixture Models HMMs

Generalized Linear Methods

Gaussian Mixture Models

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

10-701/ Machine Learning, Fall 2005 Homework 3

More metrics on cartesian products

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Conjugacy and the Exponential Family

The big picture. Outline

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Markov Chain Monte Carlo Lecture 6

On mutual information estimation for mixed-pair random variables

Kernel Methods and SVMs Extension

Chapter Newton s Method

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

The Basic Idea of EM

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Assortment Optimization under MNL

Clustering & Unsupervised Learning

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Hidden Markov Models

Homework Assignment 3 Due in class, Thursday October 15

SDMML HT MSc Problem Sheet 4

A linear imaging system with white additive Gaussian noise on the observed data is modeled as follows:

Differentiating Gaussian Processes

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Natural Language Processing and Information Retrieval

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Clustering & (Ken Kreutz-Delgado) UCSD

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Lecture 12: Discrete Laplacian

Clustering with Gaussian Mixtures

Composite Hypotheses testing

Statistical pattern recognition

Estimation: Part 2. Chapter GREG estimation

3.1 ML and Empirical Distribution

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

A Robust Method for Calculating the Correlation Coefficient

Appendix B: Resampling Algorithms

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Learning Theory: Lecture Notes

Problem Set 9 Solutions

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Some modelling aspects for the Matlab implementation of MMA

The exam is closed book, closed notes except your one-page cheat sheet.

EM and Structure Learning

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Uncertainty as the Overlap of Alternate Conditional Distributions

Maximal Margin Classifier

1 Convex Optimization

The Geometry of Logit and Probit

Error Probability for M Signals

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

A New Scrambling Evaluation Scheme based on Spatial Distribution Entropy and Centroid Difference of Bit-plane

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Unified Subspace Analysis for Face Recognition

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Lecture 4: November 17, Part 1 Single Buffer Management

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

1 Motivation and Introduction

Machine learning: Density estimation

6 Supplementary Materials

Linear Approximation with Regularization and Moving Least Squares

Feature Selection: Part 1

DETERMINATION OF UNCERTAINTY ASSOCIATED WITH QUANTIZATION ERRORS USING THE BAYESIAN APPROACH

Chat eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall, 1980

4DVAR, according to the name, is a four-dimensional variational method.

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Lecture 10 Support Vector Machines. Oct

Lecture 3: Probability Distributions

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Lecture 14 (03/27/18). Channels. Decoding. Preview of the Capacity Theorem.

Which Separator? Spring 1

Linear Regression Analysis: Terminology and Notation

e i is a random error

Clustering gene expression data & the EM algorithm

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

18.1 Introduction and Recap

Maximizing the number of nonnegative subsets

14 Lagrange Multipliers

Transcription:

EEE 739Q SPRIG 00 COURSE ASSIGMET REPORT Probablty Densty Functon Estmaton by dfferent Methods Vas Chandraant Rayar Abstract The am of the assgnment was to estmate the probablty densty functon (PDF of any arbtrary dstrbuton from a set of tranng samples. PDF estmaton was done usng parametrc (Maxmum Lelhood estmaton of a Gaussan model, non-parametrc (Hstogram, Kernel based and K- nearest neghbor and sem-parametrc methods (EM algorthm and gradent based optmzaton. Applcaton of EM algorthm for bnary sequence estmaton has also been dscussed. A I. ITRODUCTIO Bayesan approach towards pattern classfcaton conssts of feature extracton and classfcaton. Feature extracton nvolves the extracton of a lower dmensonal feature vector from the pattern. Once the feature vector s extracted the pattern can be classfed based on Bayes decson rule. Consder a C class problem. Let x be a feature vector extracted from the gven nput pattern. The decson rule can be stated as Decde C f p( C / x > p( C / x ( The posteror probablty can be calculated usng the Bayes theorem as follows p( C / x p( x / C p( C / p( x (.So the mportant part s the evaluaton of the class condtonal densty p x / C for all the C classes. Ths s the tranng ( phase where we have a set of feature vectors also called χ x x... belongng to class tranng samples { }, C and we estmate p x / C ( x gven the tranng samples. Ths has to be done for all the classes. To ease notaton p x / C s referred as p (x ( wth respect to one class only.. Rest of the dscusson wll be The dfferent methods for PDF estmaton can be classfed as Parametrc, on-parametrc and Sem parametrc. In parametrc method the PDF s assumed to be of a standard form (generally Gaussan, Ralegh or unform. The parameters of the assumed PDF can be estmated ether usng ML estmaton or Bayesan Ths report was wrtten for EEE 739Q SPRIG 00 as a part of the course proect. The author s a graduate student at Department of Electrcal Engneerng, Unversty of Maryland, and College Par MD 4074 USA. Estmaton. The non parametrc methods nclude hstogram based, the ernel based methods and the K nearest neghbor methods. In sem parametrc methods the gven densty can be modeled as a combnaton of nown denstes. The parameters can be estmated ether usng gradent descent or Expectaton Maxmzaton (EM algorthm. Secton II dscusses the example used to compare the varous PDF estmaton technques and also the performance measure used. Secton III, IV, V dscusses the parametrc, non parametrc and sem parametrc technques respectvely. Secton VI concludes. Secton VII dscusses the applcaton of EM algorthm for bnary sequence estmaton. II. PROGRAM DETAILS A dmensonal feature vector was used n the program. Fgure show the orgnal densty functon used. The brghtness of the pxel corresponds to the densty value at that pont. In our case the densty functon s unform n the whte regon. We would le to estmate the densty from a set of tranng samples drawn from t. The tranng samples were drawn from a unform dstrbuton over the entre range of the mage and the sample was retaned f t belonged to the whte regon or else dscarded. In ths way random tranng samples were drawn Fgure Plot of orgnal pdf s used A GUI was wrtten n MATLAB 6. to estmate the PDF from these samples usng dfferent methods. Fgure shows a snapshot of the GUI. Once the PDF was estmated the method was evaluated usng the Kullbac-Lebler dstance. The performance was evaluated as follows. Frst we draw M samples from the mage called as x test. The PDF s evaluated at

EEE 739Q SPRIG 00 COURSE ASSIGMET REPORT each of the M ponts p eval (x. Let p(x PDF then the Kullbac-Lebler dstance. D s defned as be the orgnal D p( x ln( p( x / p ( x (3 x test Although D does not satsfy the trangle nequalty and s therefore not a true metrc, t satsfes many mportant mathematcal propertes. For example, t s a convex functon of p eval (x, s always nonnegatve, and equals zero only f p eval (x p (x eval.for teratve algorthms, D was plotted as a functon of the teraton number. Whenever p eval (x was zero D was evaluated by settng p eval (x to a very small value. Also t does not mae sense to use ths measure to compare dfferent methods as we are choosng the test ponts only where the orgnal PDF s not zero. Le n ML estmaton we may get a good estmate n the regon where the orgnal PDF s not zero however where the PDF goes zero the estmate s very bad (though we are not consderng regons where the orgnal PDF s not zero. However ths measure wll be useful to study the effect of changng the parameters of a gven method. samples { x x... } χ randomly drawn the mean and, x the covarance matrx are gven by ML estmaton as µ ˆ Σ ˆ x ( x µ ˆ T ( x µ ˆ Where µˆ and Σˆ are the estmated mean vector and covarance matrx respectvely. µˆ s an unbased consstent estmate of the mean vector. Σˆ s dvded by - and not n order to mae the covarance matrx unbased estmate. Also the covarance matrx estmate s consstent. Fgure 3 shows the plot of the orgnal and the estmated PDF for 500. It can be seen that the PDF s not the same as the orgnal PDF except n the mean and covarance sense. Ths s because our basc assumpton of modelng the dstrbuton as a sngle bvarate Gaussan s not suffcent. (5 Fgure3 Orgnal and Estmated PDF usng ML estmaton Fgure A snapshot of the GUI Fgure 4 shows the Kullbac-Lebler dstance as a functon of for 500 test ponts(.e. M500. So ncreasng beyond 300 does not help much as our model s essentally flawed. III. PARAMETRIC ESTIMATIO In parametrc estmaton the the PDF s assumed to have a nown ds trbuton. In our case a standard bvarate Gaussan was used. The standard multvarate Gaussan has the followng form T / ( x µ Σ ( x µ p ( x e (4 d / / ( p Σ For the bvarate case d, x s D vector µ s the mean vector and Σ s the x covarance matrx. The parameters µ and Σ can be estmated ether usng Bayesan estmaton or Maxmum Lelhood (ML estmaton. Usng the tranng

EEE 739Q SPRIG 00 COURSE ASSIGMET REPORT 3 Fgure 4 Plot of D Vs for ML estmaton for M500.Hstogram IV. O PARAMETRIC METHODS In ths approach the entre mage s dvded nto a small number of bns and usng the tranng samples χ { x x... }, the PDF s calculated as a hstogram. Ths s a very drect and smple approach and once the PDF s estmated the tranng data can be dscarded. The dsadvantage s that we may lose some nformaton and also t s computatonally expensve n hgher dmensons. The bn wdth M has to be chosen optmally. If M s too large we get a spy PDF or f M s too small there wll be sgnfcant loss n structure. Fgure 5 shows D as a functon of bn sze for dfferent. Usng ths to decde the bn sze does not mae sense as we are evaluatng only where the orgnal PDF s not zero. For our case as we ncrease the bn sze snce our orgnal dstrbuton s unform as bn sze ncreases D decreases whch may not be the case for any general PDF. The only thng we can conclude s that as ncreases we get better estmates. Fgure 6 shows the estmated PDF for M4, 000. Fgure 5 bn sze vs. Kullbac-Lebler dstance for dfferent for a Hstogram based PDF estmator (500 test ponts x Fgure6. Hstogram estmaton for bn sze 0 and 000.Prncpled approach A more prncpled verson of the hstogram can be formulated χ x x... let K samples x.gven tranng samples { }, le nsde a regon R of volume V. Then the PDF at any pont nsde the regon R s gven by p ( x K / V. Kernel based methods fx V and fnd K. K nearest neghbor method fxes K and fnds V. The advantage s that these methods do not have the Curse of dmensonalty. However we need to eep all data to evaluate the PDF.. Kernel based methods In ths method we fx the volume of the regon R as V and vary K the estmated PDF at any pont x s gven by p( x x x n H ( n h d ˆ (6 h where H(x s the ernel functon. In ths case H(x s a hypercube of length h centered at x defned as x x H( h n f x falls nsde the hypercube centered at x and heght h, 0 otherwse. The hypercube s bascally a n dscontnuous ernel. Instead of the hypercube we can chose a Gaussan ernel. The varance s of the Gaussan ernel and h the heght of the hypercube are the smoothng parameters. The smoothng parameters are to be optmally chosen. If the smoothng parameter s too low then the PDF s very patchy and has to be very large to get a good estmate of the PDF. If the smoothng parameter s very large then the PDF spreads out. Fgure 7 shows the Kullbac-Lebler dstance for square ernel as a functon of h for dfferent for M500 test ponts. As can be seen from the plot ntally D decreases up tll a certan pont reaches a mnmum and then agan ncreases. The part where D decreases (.e. better estmate s when the squares have enough wdth to overlap and gve a better estmate. From the plot t can be seen that for 600 the optmal value of s around 4 to 6. Fgure 8 shows the estmated PDF and the orgnal PDF for 600 h6 for a rectangular ernel. n Fgure 9 shows the Kullbac-Lebler dstance for the Gaussan ernel as a functon of the varance s for dfferent for 800 test ponts. As can be seen from the plot that D decreases as the smoothng parameter ncreases tll a certan pont and after that t ncreases agan. From the plot t can be seen that the optmal value for s s. Also as ncreases the curves shft downwards whch s straghtforward that as the number of tranng samples ncreases we get a better estmate of the PDF.. Fgure 0 shows the estmated PDF for 500 and s for the Gaussan ernel.

EEE 739Q SPRIG 00 COURSE ASSIGMET REPORT 4 Fgure 7 Plot of heght vs. Kullbac-Lebler dstance for dfferent for a rectangular ernel based PDF estmator based on 500 test ponts Fgure 0 Estmated PDF usng Gaussan ernel based method for 500 and sgma. K earest eghbor In ths method V s fxed and K s vared. Essentally we need to search the K nearest neghbors. In ths case K has to be optmally chosen for a good estmate of the PDF. Fgure shows the Kullbac-Lebler dstance as a functon of K for dfferent and M00 test ponts. Fgure shows the estmated PDF for 300 and K. Fgure 8 Estmated PDF usng rectangular ernel based method for 600 and h6 Fgure Plot of K vs. Kullbac-Lebler dstance for dfferent for a K based PDF estmator based on 500 test ponts Fgure 9 Plot of sgma vs. Kullbac-Lebler dstance for dfferent for a Gaussan ernel based PDF estmator based on 800 test ponts

EEE 739Q SPRIG 00 COURSE ASSIGMET REPORT 5 Fgure 4 shows the Kullbac-Lebler dstance for M0(component denstes and 500 test ponts for dfferent as a functon of the teraton number. Increasng has no effect on the speed of convergence however as ncreases the Kullbac-Lebler dstance decreases. Fgure Estmated PDF usng K based method for 300 and K V. SEMI PARAMETRIC METHODS These methods combne the flexblty of nonparametrc methods and the effcency n evaluaton of parametrc methods. Here we model the PDF as a mxture of parametrc PDF. The parameters have to be estmated ether by some optmzaton technque le gradent descent or Expectaton Maxmzaton Algorthm..EM Algorthm The EM algorthm convergence propertes are studed as a functon of the number of teratons Fgure 3 shows the Kullbac-Lebler dstance for 500 tranng samples and 500 test ponts for dfferent M(number of mxture components as a functon of the teraton number. It can be seen that the EM algorthm converges n three to sx teratons. Also D decreases as M ncreases. Fgure 4 Kullbac-Lebler dstance for M0 and 500 test ponts for dfferent as a functon of the teraton number Fgure 5 shows the log lelhood for 500 and 500 test ponts as a functon of the teraton number for dfferent M. The log lelhood ncreases from the pont where t starts to converge. Fgure 5 log lelhood for 500 and 500 test ponts as a functon of the teraton number for dfferent M. Fgure 3 Kullbac-Lebler dstance for 500 and 500 test ponts for dfferent M (number of mxture components as a functon of the teraton number. The EM algorthm also depends on the ntalzaton strateges whch n turn affect the number of teratons requred to converge. If the ntal ponts are wthn the unform regon of the PDF them the EM algorthm wll converge very fast. Mostly t was observed that the EM algorthm nvarant to ntalzaton strateges converged n 5 to 0 steps. Fgure 6 shows the ntal and the fnal poston of the Gaussans.

EEE 739Q SPRIG 00 COURSE ASSIGMET REPORT 6 the mxng parameters. The descent rate for each of the parameters s got after tral and error approach. In ths case alpha for mean was used as 0.9 for sgma 0.3 and 0. for the mxng parameters. Fgure 7 shows Kullbac-Lebler dstance for 500 and 500 test ponts for dfferent M (number of mxture components as a functon of the teraton number. As can be seen from the plot the PDF converges after around 5 teratons. The convergence s very slow as compared to the EM algorthm. Also convergence s very senstve to the descent rate. The descent rate for mean, varance and mxng parameters were chosen by tral and error. By properly choosng the descent rate I guess we can get a faster convergence. Fgure 7 Kullbac-Lebler dstance for 500 and 500 test ponts for dfferent M (number of mxture components as a functon of the teraton number Fgure 8 shows the Kullbac-Lebler dstance for M0 and 500 test ponts for dfferent as a functon of the teraton number Fgure 6 ntal Gaussans and the fnal postons of the Gaussans after 5 teratons and the Estmated PDF for M0 components 500. Gradent Descent Optmzaton The negatve log lelhood functon can be mnmzed by gradent descent method. The mnmzaton s wth respect to the parameters the mean, varance of the ntal Gaussans and Fgure 8 Kullbac-Lebler dstance for M0 and 500 test ponts for dfferent as a functon of the teraton number.

EEE 739Q SPRIG 00 COURSE ASSIGMET REPORT 7 sequence s scaled by a fxed unnown non zero scalar c. t s then corrupted by addtve whte Gaussan nose. c Z B Y Fgure. A smple channel wth scalng and nose added. Fgure 9 shows the log lelhood for 500 and 500 test ponts as a functon of the teraton number for dfferent M. The log lelhood ncreases from the pont where t starts to converge. So we have y b z Y c B+ Z where Y. B. Z. y b z Each of the z are..d. Gaussan random wth zero mean and varance s.e. z ~ ( ;0, σ z. The problem s to get an ML estmate of B. ote that c s unnown. The ML estmaton problem can be formulated as follows. / b ( y;0, σ ( y ; c, σ p y / b ( y So the ML estmator s f p y when b when b 0 / b( y / b py / b( y / b 0 b else b 0 Smplfyng we get the followng. f y c / b else b 0 Here the value of c s unnown even though f t s a fxed quantty. We can use the EM algorthm by defnng a new completes data X(Y,C. The E step gves estmates of C whch can be used n the M step. Fgure 0 Estmated PDF for M0 components 500 ote the EM algorthm gves a better PDF than gradent descent for the same number of components. VI. COCLUSIO Use ernel based method for mnmzng the computatonal requrements (but we have to eep the data and use EM algorthm. For mnmzng both memory and computatonal requrements. VII. EXTRA CREDIT II The followng secton dscusses the applcaton of the EM algorthm for bnary sequence estmaton. Consder a system shown n the Fgure. B s a bnary sequence of length. B[b,b,..b ] where each b could be a one or zero. A typcal realzaton for 5 could be [ 0 0 0 ]. The bnary E STEP:Let D be some estmate of B Q ( B / p p Y, C / B c s not Y / C, B D Smplfyn Q ( B / Q ( B / D D a ( y, c / ( y / c, B random g E B, [ log p ( y, c / B / Y, D ] E [ p ( y Y, C / B Y / C, B ( y var ( y ( y / c, B able cb ; cb E [ c / Y, D ] b, σ / Y, D ] D 0 provdes no nformaton about c. Let D the subset of D whch are and let Y be the correspondng Y. So the E step can be summarzed as follows.

EEE 739Q SPRIG 00 COURSE ASSIGMET REPORT 8 let E[ C / Y, D] E[ Y ] a Y s the subset of Y values assscated wth the current estmates of B whch are s. Q( B / D ( y ab M STEP: Fnd B to maxmze ths. We can maxmze each ndvdual term. Consder a typcal term we have ( y a y f new b So the M step s, For each b, y belongng to B and Y, set b new old f ( a > 0 and old y otherwse set b > a / or ( a < 0 and new 0 y < a / ALGORITHM:.Intalze B old [.].E step: ae[y ] where Y s the subset of Y assocated wth the current estmates of B whch are s. 3.M step : For each b, y belongng to B old and Y, f ( a > 0 and set b new y otherwse set b 4.Iterate tll convergence. > a / or ( a < 0 and new 0 y < a / SIMULATIO: Smulaton was done for the case for 000. c3. The algorthm converged n about to 3 teratons. Convergence was decded when there was no further mprovement n the value of estmated c. Fgure shows the error n the estmaton of B as a functon of teraton number for dfferent s. It can be seen that t converges n about to 3 teratons. Fgure 3. shows the estmated value of c as a functon of teraton number for s. Fgure 3. Estmated value of c for sgma as a functon of teraton number REFERECES [] Ghahraman, Z & Jordan, MI (994. Supervsed learnng from ncomplete data va an EM approach. In JD Cowan, G Tesauro and J Alspector, edtors, Advances n eural Informaton Processng Systems 6. San Mateo, CA: Morgan Kaufmann, 0-7. (http://cteseer.n.nec.com/ghahraman94supervsed.html [] Blmes, J (998 A Gentle Tutoral of the EM Algorthm, UC- Bereley TR-97-0. (http://cteseer.n.nec.com/blmes98gentle.html. [3] C.. Georghades and J.C. Han, Sequence Estmaton n the Presence of Random Parameters Va the EM Algorthm, IEEE Transactons on Communcatons, vol. 45, pp. 300-308, March 997 Vas Chandraant Rayar s a graduate student at the Unversty of Maryland College Par, MD 4074 USA (telephone: 30-405-08, e- mal: vas@umacs.umd.edu. Fgure. Error n the estmaton of B as a functon of teraton number for dfferent s.