Image Classification Using EM And JE algorithms

Similar documents
MARKOV CHAIN AND HIDDEN MARKOV MODEL

Supplementary Material: Learning Structured Weight Uncertainty in Bayesian Neural Networks

Example: Suppose we want to build a classifier that recognizes WebPages of graduate students.

Lecture Notes on Linear Regression

Generalized Linear Methods

Supervised Learning. Neural Networks and Back-Propagation Learning. Credit Assignment Problem. Feedforward Network. Adaptive System.

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory

Neural network-based athletics performance prediction optimization model applied research

Gaussian Mixture Models

Associative Memories

Multispectral Remote Sensing Image Classification Algorithm Based on Rough Set Theory

Singular Value Decomposition: Theory and Applications

Delay tomography for large scale networks

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION. Machine Learning

Kernel Methods and SVMs Extension

A finite difference method for heat equation in the unbounded domain

Probability Density Function Estimation by different Methods

10-701/ Machine Learning, Fall 2005 Homework 3

Online Classification: Perceptron and Winnow

Differentiating Gaussian Processes

Supporting Information

Application of support vector machine in health monitoring of plate structures

Cyclic Codes BCH Codes

Chapter 6 Hidden Markov Models. Chaochun Wei Spring 2018

3. Stress-strain relationships of a composite layer

Lecture 4. Instructor: Haipeng Luo

Homework Assignment 3 Due in class, Thursday October 15

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CHAPTER 7: CLUSTERING

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

VQ widely used in coding speech, image, and video

CS229 Lecture notes. Andrew Ng

SDMML HT MSc Problem Sheet 4

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

EEE 241: Linear Systems

COXREG. Estimation (1)

Nested case-control and case-cohort studies

This column is a continuation of our previous column

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Linear Approximation with Regularization and Moving Least Squares

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Relevance Vector Machines Explained

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Quantum Runge-Lenz Vector and the Hydrogen Atom, the hidden SO(4) symmetry

Chapter 6. Rotations and Tensors

Multilayer Perceptron (MLP)

A DIMENSION-REDUCTION METHOD FOR STOCHASTIC ANALYSIS SECOND-MOMENT ANALYSIS

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

The Entire Solution Path for Support Vector Machine in Positive and Unlabeled Classification 1

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

9 Adaptive Soft K-Nearest-Neighbour Classifiers with Large Margin

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Ensemble Methods: Boosting

On the Power Function of the Likelihood Ratio Test for MANOVA

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Semi-supervised Classification with Active Query Selection

Note 2. Ling fong Li. 1 Klein Gordon Equation Probablity interpretation Solutions to Klein-Gordon Equation... 2

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

The Order Relation and Trace Inequalities for. Hermitian Operators

The Expectation-Maximization Algorithm

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Boostrapaggregating (Bagging)

STAT 3008 Applied Regression Analysis

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

1 Convex Optimization

1 Motivation and Introduction

A Dissimilarity Measure Based on Singular Value and Its Application in Incremental Discounting

Excess Error, Approximation Error, and Estimation Error

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Short-Term Load Forecasting for Electric Power Systems Using the PSO-SVR and FCM Clustering Techniques

Regularized Discriminant Analysis for Face Recognition

QUARTERLY OF APPLIED MATHEMATICS

22.51 Quantum Theory of Radiation Interactions

Note on EM-training of IBM-model 1

Markov Chain Monte Carlo Lecture 6

Separation of Variables and a Spherical Shell with Surface Charge

CS : Algorithms and Uncertainty Lecture 14 Date: October 17, 2016

Prediction Error of the Multivariate Additive Loss Reserving Method for Dependent Lines of Business

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

An Application of Fuzzy Hypotheses Testing in Radar Detection

Deriving the Dual. Prof. Bennett Math of Data Science 1/13/06

Lecture 12: Discrete Laplacian

MODEL TUNING WITH THE USE OF HEURISTIC-FREE GMDH (GROUP METHOD OF DATA HANDLING) NETWORKS

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

NUMERICAL DIFFERENTIATION

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1]

Week 5: Neural Networks

Joint Statistical Meetings - Biopharmaceutical Section

IV. Performance Optimization

Feature Selection: Part 1

Hidden Markov Models

Errors for Linear Systems

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Transcription:

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Image Cassfcaton Usng EM And JE agorthms Xaojn Sh Department of Computer Engneerng, Unversty of Caforna, Santa Cruz, CA, 9564 jennfer@soe.ucsc.edu CS242: Machne Learnng, Fa 22, Dec 4, 22 Abstract Parameter estmaton probem s defntey one of the most mportant probems n the research fed of mage cassfcaton. Snce t s not very possbe to drecty get the parameters that ft data best, peope generay w use teratve way to sove ths probem. EM agorthm s a typca method to sove ths probem, and has been aready eadng ths fed for severa decades. Recenty, Yoram Snger and Manfred K. Warmuth provded a method---je agorthm to sove ths probem. JE agorthm has two versons: batch case and onne case. The purpose of ths project s to compare the performance of the three agorthms: EM, JE batch case, and JE onne case n the mage cassfcaton probem. Keywords: Image cassfcaton, EM agorthm, JE agorthm, onne agorthm, Gaussan Mxture.. Introducton Image object cassfcaton s a very mportant research topc n the pattern recognton fed. The purpose of ths work s: gven an mage, et computer to extract dfferent objects automatcay. That s, we want the computer to have sort of ntegence, so that an mage s no onger just a set of numbers to the computer, but a pcture that contans severa dfferent objects. And the computer can understand the dfference between dfferent objects. Ths work surey has ots of temptng appcatons, but how to acheve ths goa s a bg queston. - -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Generay speakng, the steps of an mage cassfcaton probem can be seen n fgure. From fgure we can see, the quaty of the cassfers s the cruca part for the whoe mage cassfcaton probem. A good cassfer shoud have severa features: frst, t must be abe to ft the tranng data we. Ths s because the test data w be smar to the tranng data, f the cassfer can ft the tranng data we, t s reasonabe to conjecture the cassfer w ft the tranng data we; second, the cassfer must be robust. Ths s because that athough the tranng data s smar wth the test data, they are not dentca, the cassfer shoud be abe to toerate the sma dfference between the two datasets. Ony when a cassfer has both of the quaty can t perform we n the mage cassfcaton probem. Feature Extracton Tranng dataset Tranng Parameter estmaton Cassfer Cassfer 2 Cassfer n Cassfcaton Feature extracton Test dataset Fgure Cassfcaton Because of the nature of probabty, the cassfer based on probabty theory shoud be robust to nose. Ths s the reason why parameter estmaton methods are very popuar n the supervsed earnng probem. The basc dea of parameter - 2 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe estmaton methods s: we assume every cassfer has an underyng dstrbuton controng the data for ths cass, and we mode the dstrbuton of each cass wth a certan known or we-studed dstrbuton (for exampe, use Gaussan Mxture, the ony unknown factor for the cassfer s the parameters used to defne the dstrbuton (for exampe, the mean and covarance matrx for Gaussan dstrbuton. So the tranng turns to be a parameter estmaton probem. Snce generay t s not possbe to get the best parameters drecty, we often use some teratve agorthm to fnd a resut. If we mode the cass dstrbuton as Gaussan mxtures, EM agorthm s the typca method to sove ths probem, and peope has used ths method for severa decades. Recenty, Yoram Snger and Manfred K. Warmuth provded a parameter estmaton method for Gaussan Mxtures----JE agorthm. The purpose of ths project s to compare the performance of EM and JE agorthm. In ths project, I use two knds of datasets. The frst type s synthetc datasets, that s, I generate severa datasets based on severa known Gaussan mxture dstrbutons, and run the two agorthms over these synthetc datasets. Snce we ve aready known exacty what shoud be the fna resuts for the estmaton, we can use these datasets to compare the performance of the two approaches. There are two performances can be compared, frst s the accuracy, that s, we can estmate the error between the resut and the orgna dstrbuton, and see whether t s accurate enough; second, we can compare the number of teratons to make the agorthm converge, the convergence rate s very mportant because a these agorthms are teraton based, they need to run severa tmes to get the fna resut. The more number of teratons s, the more the computaton tme. If we can make the computaton faster, t w have much broader appcatons. The other knd of datasets I m gong to use s the rea mage datasets. These datasets are the rea datasets we encountered when we do mage cassfcaton, and are provded by professor Roberto Manduch. We have aready abeed the datasets (that s, ths s a supervsed tranng, and we ve aready had the tranng data for each cass. For each cass, we use the Gaussan Mxture to mode ts underyng dstrbuton. After we run the Monte Caro cross vadaton over the abeed data for each cass, we can know the number of custers (modes for each cass, whch means - 3 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe we ve aready known the number of dfferent Gaussans n the mxture dstrbuton, the ony probem s to estmate the parameters. Athough ths tme, we don t know exacty what s the rea vaue of a parameters, but we can have a genera dea of whether the estmaton s good enough from the fna cassfcaton resuts. If the cassfcaton resuts are very good, we can reasonaby thnk that the parameter estmaton resuts are good. Snce these are the rea data we encounter, the resuts w be meanngfu. It w show the robustness of the two agorthms. Aso t w show whether these agorthms are usefu n the rea appcatons. The foowng sectons are organzed as foows. Secton 2 w descrbe the three dfferent agorthms used n ths project, and made a comparson between them. Secton 3 w descrbe the experment datasets and resuts. Secton 4 w dscuss the experment resut gven n the Secton 3. Secton 5 w provde severa open probems and then provde the future work. Secton 6 gves the concuson. And Secton 7 and 8 are the acknowedgment and reference. 2. Agorthm descrpton There are gong to be three agorthms used n ths project: EM (Expectaton Maxmzaton agorthm, JE (Jont Entropy update agorthm (batch case, whch s provded by Yoram Snger and Manfred K. Warmuth, and the JE agorthm (on-ne case. Foowng w be a bref descrpton of these three agorthms. EM agorthm For the parameter estmaton probem for Mxture Gaussans, EM agorthm s amost a standard. Athough there are many dfferent ways to derve the update formua of EM agorthm, here I want to use the method ntroduced n the machne earnng cass. - 4 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Let H ndcate the hdden varabes, and V ndcate the vsbe varabes. Our purpose s to fnd the set of parameters Θ, whch can mnmze the negatve og ~ kehood of the vsbe data,.e. Θ = arg mn( n P ( V Θ. Snce mnmzng ths P( H V, Θ ~ quantty s very hard, we mnmze P( H V, Θn ~ + η( n P( V Θ H P( H V, Θ nstead. After set η =, and et the dervatve respect to the parameters equa to, we can get the EM update rue as foows. w µ Σ = = = = = = = p( x, Θ p( x, Θ od x p( x, Θ od = od ( x µ ( x od p( x, Θ µ p( x, Θ od T JE agorthm JE agorthm s an aternatve agorthm used to estmate the parameters for Mxture Gaussan, and s provded by Yoram Snger and Manfred K. Warmuth. The motvaton for ths agorthm probaby s ke ths. The formua to be mnmzed n the EM agorthm can be regard as a dstance between the and od parameter, and pus a oss part. If we use another knd of dstance, we w get a set of dfferent update formua. The dstance used for JE agorthm s the reatve entropy between the od and jont densty of the observed and hdden varabes (ote, n EM agorthm, the dstance s the reatve entropy between the od and densty of the - 5 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe hdden varabes ony. After a the dervatons, pease see [] for deta. We have the foowng update rues for JE agorthm n batch case. β ( x w µ Σ = m j = P( x w od j Θ od P( x Θ od od η w exp( β ( x = = m od η w j= j exp( β j ( x = od η od = µ + β ( x ( x µ = Σ od = η + = β ( x ( Σ Σ ( x µ ( x µ od od T Σ od JE onne agorthm Another mert of JE agorthm s that t s easy to be modfed to an on-ne agorthm. The update for the onne agorthm can be seen as foows. The ony requrement for the agorthm to converge s 2 η =, < t= t and η t= t. β ( x = w t = m P( x Θ j= t P( x w od w t od j Θ exp( η β ( x t t exp( η β ( x od µ = µ + η β x ( x µ t j ( t t t t od od od Σ = Σ + η β ( x ( Σ Σ ( x µ ( x µ T od t t t t Σ - 6 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe 3. Experment resuts In order to compare the performances of these three agorthms, I set up four sets of experments, each set s based on severa datasets. Experment and 2 are used to compare the performance of EM agorthm and the batch case JE agorthm; experment 3 and 4 are used to compare the performance of the batch case JE agorthm and the onne case JE agorthm. For a the experments, I use a KMean agorthm to gve nta vaues. Experment The purpose of experment s to compare the parameter estmaton performance of EM and JE agorthm usng synthetc data. Snce for the synthetc data, we know exacty the data dstrbuton, we can compare the accuracy and convergence speed of these two agorthms. I generated severa dfferent datasets, but due to the ack of space, I ony descrbe three of them. The frst dataset s a 3 modes D Mxture Gaussan. The sze of dataset s 65536. In order to see the accuracy of the parameter estmaton, I pot the orgna dstrbuton, the estmated dstrbuton provded by EM agorthm, and the estmated dstrbuton provded by JE agorthm n the same fgure. Pease see fgure for resut. From fgure we can see, the resuts of both of the agorthms are very good. But the convergent speeds are dfferent. Fgure2 pot the curve for the agorthms to get converged. From fgure2 we can see, the earnng rate η have very bg effect on the convergence speed. The detaed dscusson of the effect of earnng rate η can be seen n Secton 4. - 7 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Fgure Fgure 2 The second dataset s a 2 modes 2D Mxture Gaussan data. The sze s 65536. Fgure3 shows the orgna dstrbuton, and fgure4 shows the estmated resut from JE. The thrd dataset s a 3 modes 3D Mxture Gaussan data. The sze s 65536. Tabe shows the comparson of the orgna dstrbuton and the experment resut. From ths we can see both agorthms can acheve pretty good accuracy, but JE (n the best case agorthm need ess teratons to get converge (or get the same accuracy as EM agorthm. - 8 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Fgure 3 Fgure 4 Experment 2 In ths experment, I use the rea mage datasets. Because the rea mages won t foow a certan dstrbuton exacty, the parameter estmaton w be just a knd of approxmaton for the rea mage data, whch means there must be some error n the parameter estmaton. The purpose of ths experment s to test the performances of these two agorthms n a rea word appcaton, and see the robustness. Fgure5 and fgure6 show an exampe of the mages n the two rea mage datasets. These datasets are abeed, whch means ths s a supervsed earnng probem. The exampes of abeed data are show n fgure7 and fgure8. - 9 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Orgna EM JE w [.2.5.3] [.2.5.3] [.3.5.2] µ 5 9 5 2 2 25 3.2 5. 9.4. 5. 2. 2. 25. 3.3.2. 2. 5. 5. 25. 9.4 2. 3.3 4 9.2... 4.5 -.. -. 9.22.2... 4.5 -.. -. 9.22 Σ 4 9 3.96.3 -.2.3 -.2 8.99 -. -.. 3.96.3 -.2.3 -.2 8.99 -. -.. 9 4 8.85 -. -.2 -..2 -.2 -.2 -.2 4. 8.85 -. -.2 -..2 -.2 -.2 -.2 4. n 27 2 Tabe - -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Fgure 5 Fgure 6 Fgure 7 Fgure 8 In ths experment, I frst use a the abeed mage data to tran a cassfer, and then used the traned cassfer to cassfy a the mages, and compute the cassfcaton error rate over the tranng data. In ths way, I ony dea wth the parameter estmaton probem (no predcton, and the error rate can party show the resut of the parameter estmaton by the two agorthms. Tabe2 and tabe3 show the number of teratons needed n tranng the cassfer for a partcuar cass, the tota teratons, and the error rate. - -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe EM JE EM JE Cass Cass 2 Cass 3 4 (dverge 56 9 2 (converge 9 49 Cass Cass 2 Cass 3 Cass 4 59 42 28 77 9 68 239 78 Cass 4 Tota (norma case Error rate (dverge 247 7% 4 (dverge 4 % Cass 5 Cass 6 Tota (norma case 5 3 56 64 75 36 Error rate 24% 23% Tab e 2 Tabe 3 Experment 3 The purpose of ths experment s to compare the parameter estmaton performance between the JE batch agorthm and the JE onne agorthm. In ths experment, I test the two agorthms on both synthetc data and rea mage data. Due to the ack of space, I ony show the resut of one mage dataset. Fgure9 and fgure s the resut of the rea mage dataset. And the way of carryng on ths experment s the same as the one descrbed n experment 2. - 2 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Fgure 9 Fgure Experment 4 As we mentoned before, a good cassfer shoud be not ony good at estmatng par ameters (.e. to ft the data we, but aso need to be good at predctng mages. The purpose of ths experment s to test the predctng abty of the two cassfers generated by the two agorthms separatey. The experments are run over both rea mage datasets n a eave-one-out way,.e., every tme, eave one mage of the dataset out as the test data, and use a the other mages as the tranng dataset. Ths knd of work w be done over a the mages, and get the average error rate as the fna resut. The resut can be shown n Tabe4. JE batch JE onne Road dataset.2372.28 Rock dataset.272.2274 Tabe 4 From tabe4 we can see, when we try to predct the mage usng the traned cassfers, the JE onne agorthm acheves much better resut than the JE batch agorthm. - 3 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe 4. Dscusson Experment & 2, why s the performance of JE batch agorthm out-performs EM?. From the update rue of EM and JE batch agorthm we can see, the updates for the covarance matrx are dfferent. In EM agorthm, we aways update the covarance matrx tsef. But n JE batch agorthm, we update the nverse of covarance matrx nstead. Ths means that n JE agorthm, we ony need to compute the nverse of matrx once, but n EM agorthm, we need to compute the nverse n every teraton. Because the nverse operaton s very unstabe, and t s easy to make the matrx become snguar when we have nose data, EM agorthm sometmes w be stuck at certan pont, but JE batch agorthm can avod ths probem. 2. The dfference between EM agorthm and JE agorthm s that the defntons of dstance n these two agorthms are dfferent. In EM agorthm, the dstance s ony consdered over the hdden varabes, but n JE agorthm, the dstance s consdered over both the hdden and vsbe varabe. It seems that ths s a more reasonabe defnton. In Experment & 2, what s the effect of the earnng rate to the JE agorthm? The effect of the earnng rate over the JE agorthm can be seen from the fgure and fgure2. From these two fgures we can see, the earnng rate does have a bg effect over the performance of JE agorthm, especay about the convergence rate. If we can get a good earnng rate, then the convergence s much faster than EM agorthm, but on the contrary, f we got a bad earnng rate, the performance w be much worse than the EM agorthm. - 4 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Fgure Fgure 2 E xperment 3, why the performance of JE on-ne agorthm s a tte worse than the JE batch agorthm when we ony deang wth the parameter estmaton probem? I thnk the man probem s we ack a good way to tune the earnng rate. The advantage of an on-ne agorthm s that t can adjust the cassfer adaptvey to the nput data. But when we have a the data together, the advantage cannot show up. The dsadvantage of the JE on-ne agorthm s that we need to tune the earnng rate at every teraton, but for the batch case, we ony need to tune t once. So far, the way to tune the earnng rate s just through experments, so t s very possbe to make some mstakes. And as the tmes need to tune the earnng rate ncrease, t s more and more possbe to use bad earnng rate. When n the parameter estmaton probem ony the dsadvantage of JE onne agorthm can show up, t w work a tte worse. But from the experment we can see JE onne agorthm st works pretty good, the dfference between JE batch agorthm and JE onne agorthm s very sma. Experment 4, why does the JE on-ne agorthm out-performs JE batch agorthm when we use the cassfers n the rea cassfcaton probems (ncudng both the parameter estmaton over the tranng dataset and the predcton over the test data? When we try to predct over the test dataset, we meet some data that s smar but not dentca to the tranng dataset, whch means that we shoud make a tte change of the cassfer accordng to the data. In ths case, JE onne - 5 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe agorthm w show ts mert. So athough JE onne agorthm doesn t have a good parameter estmaton as the JE batch agorthm over the tranng dataset, when t receve the nput test data, t can change adaptvey to the data, make tsef ft the data better. So JE onne agorthm can get better resut. Possbe mprovement for JE onne agorthm. When we mpement the JE onne agorthm over the rea word mage data, there are severa probem shoud be notced. The frst one s how to dea wth the nose and outer n the tranng dataset. Because the onne agorthm deas wth the nput data one by one, each of them w have some effect over the cassfer. When we have nose or outer, generay t w be qute dfferent from the rea data, and as a resut, t w cause a bg change of the cassfer f we use the update rue, but ths change s not what we want. In my appcaton, I estmate the dfference between the and od parameters at each teraton, f the dfference s too bg, we can assume the data as nose and just dscard t. Snce n the mage cassfcaton probem, we aways have enough data, dscard severa data won t have bg nfuence over the fna resut, so we can mprove the performance of the onne agorthm. The second probem s how to arrange the sequence of data. In order to make the agorthm converge, we need the earnng rate to decrease accordng to tme. Ths means the earer data w have more effect over the fna resut, but the ater data w have ess effect. Ths won t have troube wth the mean and covarance, but w cause some troube of the pror for each mode. Generay, when we dea wth the mage data, we process data accordng to the sequence of mages, that s, we process the frst mage, then the second mage, and so on. But t s very possbe that the data n the same mage w be very smar to each other, whch means they come from the same mode of the mxture Gaussan, say mode. Then f we process a the data beong to mode frst, the pror for mode w be over emphaszed, and w ead to a wrong resut at ast. A more reasonabe way s to nput data accordng to sequence of modes, that s we frst get a data whch cose to mode, then we get a data cose to - 6 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe mode 2, and so on. In ths ntereavng way, we can avod the over-emphasze probem. Ths project s fnshed n ths way, and acheves very good resut. 5. Future work After ths project, there are st ots of thngs need to be soved n future. Frst of a, regardng to the earnng rate, how to tune the rght vaue s bg probem. From our experments we can see, the earnng rate s very cruca to the fna resut. If we can fnd a good earnng rate, we can get a much better performance than EM, but f not, t s very possbe that the resut w be even worse than EM, sometmes t even cannot converge. So far, my souton to ths probem s to run the JE agorthm over a set of possbe earnng rates, then when some tmes the og kehood of the tranng data becomes decreasng, I termnate the program. But n ths way, the JE agorthm oses part of ts mert: the fast converge rate. So ths s ony a temporary souton. How to tune the earnng rate s an open probem. Second, for the on-ne verson of JE agorthm, snce the earnng rate changes (decreases accordng to tme, whch means the earer data w have more effect over the fna resut, and the ater data w have ess effect over the fna resut, how to arrange the best nput data sequence becomes a probem. If we nput the data mage by mage, t s very possbe that the data correspond to a certan mode w be connected to each other, and f these data occur earer than other data, ths w ead the certan mode w be over-emphaszed n the fna resut (snce a of them have bgger earnng rate. So a more reasonabe souton to ths probem s to ntereave data from dfferent mode. So far, ths work s done manuay. But how to automatcay fnsh ths work s an open probem. - 7 -

Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe 6. Concuson Ths project compares the performance of three dfferent parameter estmaton agorthms: EM, JE batch case, and JE onne case. From the experment resut we can see, f we have a good earnng rate, JE batch case out-performs a ot than the EM agorthm. JE onne agorthm s a tte worse at parameter estmaton probem than the JE batch agorthm, but t acheves a much better performance n the predctng process, so that t can acheve a much better resut than n the rea mage cassfcaton probem. 7. Acknowedgment Thanks a ot to professor Manfred Warmuth, who gves me a ot of hepfu advce durng our dscusson of my project. Thanks a ot to professor Roberto Manduch, who gves me the test dataset, aso heps me a ot n understandng the probem and how to proceed. 8. Reference [] A ew Parameter Estmaton Method for Gaussan Mxtures, by Yoram Snger, Manfred K. Warmuth; [2] Addtve versus exponentated gradent updates for near predcton, by Jyrk Kvnen and Manfred K. Warmuth; [3] A Gente Tutora of the EM Agorthm and ts Appcaton to Parameter Estmaton for Gaussan Mxture and Hdden Markov Modes, by Jeff A. Bmes; [4] The EM Agorthm and Extensons, by Geoffrey J. McLachan, Thryambakam Krshnan; [5] On Bas, Varance, /-Loss, and the Curse-of-Dmensonaty, by Jerome H. Fredman [6] Fa, 22, machne earnng ecture notes 4. - 8 -