Model-based mixture discriminant analysis an experimental study

Similar documents
Complexity of Regularization RBF Networks

10.5 Unsupervised Bayesian Learning

Danielle Maddix AA238 Final Project December 9, 2016

Optimization of Statistical Decisions for Age Replacement Problems via a New Pivotal Quantity Averaging Approach

An I-Vector Backend for Speaker Verification

A Unified View on Multi-class Support Vector Classification Supplement

LOGISTIC REGRESSION IN DEPRESSION CLASSIFICATION

Bilinear Formulated Multiple Kernel Learning for Multi-class Classification Problem

Nonreversibility of Multiple Unicast Networks

Weighted K-Nearest Neighbor Revisited

Variation Based Online Travel Time Prediction Using Clustered Neural Networks

Feature Selection by Independent Component Analysis and Mutual Information Maximization in EEG Signal Classification

Case I: 2 users In case of 2 users, the probability of error for user 1 was earlier derived to be 2 A1

Distributed Gaussian Mixture Model for Monitoring Multimode Plant-wide Process

MODELING MATTER AT NANOSCALES. 4. Introduction to quantum treatments Eigenvectors and eigenvalues of a matrix

Performing Two-Way Analysis of Variance Under Variance Heterogeneity

The Influences of Smooth Approximation Functions for SPTSVM

Multicomponent analysis on polluted waters by means of an electronic tongue

Development of an efficient finite element model for the dynamic analysis of the train-bridge interaction

The universal model of error of active power measuring channel

Control Theory association of mathematics and engineering

Chapter 8 Hypothesis Testing

Design and Development of Three Stages Mixed Sampling Plans for Variable Attribute Variable Quality Characteristics

Solving Constrained Lasso and Elastic Net Using

research papers applied to the one-wavelength anomalousscattering

Supplementary Materials

Sensitivity Analysis in Markov Networks

A New Version of Flusser Moment Set for Pattern Feature Extraction

Ayan Kumar Bandyopadhyay

Normative and descriptive approaches to multiattribute decision making

arxiv: v2 [math.pr] 9 Dec 2016

Adaptive neuro-fuzzy inference system-based controllers for smart material actuator modelling

Error Bounds for Context Reduction and Feature Omission

Likelihood-confidence intervals for quantiles in Extreme Value Distributions

Taste for variety and optimum product diversity in an open economy

MMI-based Training for a Probabilistic Neural Network

Assessing the Performance of a BCI: A Task-Oriented Approach

A Spatiotemporal Approach to Passive Sound Source Localization

Verka Prolović Chair of Civil Engineering Geotechnics, Faculty of Civil Engineering and Architecture, Niš, R. Serbia

Random Effects Logistic Regression Model for Ranking Efficiency in Data Envelopment Analysis

An iterative least-square method suitable for solving large sparse matrices

OPPORTUNISTIC SENSING FOR OBJECT RECOGNITION A UNIFIED FORMULATION FOR DYNAMIC SENSOR SELECTION AND FEATURE EXTRACTION

CSC2515 Winter 2015 Introduc3on to Machine Learning. Lecture 5: Clustering, mixture models, and EM

SURFACE WAVES OF NON-RAYLEIGH TYPE

A simple expression for radial distribution functions of pure fluids and mixtures

Use of prior information in the form of interval constraints for the improved estimation of linear regression models with some missing responses

Combined Electric and Magnetic Dipoles for Mesoband Radiation, Part 2

Agnostic Selective Classification

Neuro-Fuzzy Modeling of Heat Recovery Steam Generator

Lecture 7: Sampling/Projections for Least-squares Approximation, Cont. 7 Sampling/Projections for Least-squares Approximation, Cont.

DIGITAL DISTANCE RELAYING SCHEME FOR PARALLEL TRANSMISSION LINES DURING INTER-CIRCUIT FAULTS

CONDITIONAL CONFIDENCE INTERVAL FOR THE SCALE PARAMETER OF A WEIBULL DISTRIBUTION. Smail Mahdi

Calibration of Piping Assessment Models in the Netherlands

Heat exchangers: Heat exchanger types:

Millennium Relativity Acceleration Composition. The Relativistic Relationship between Acceleration and Uniform Motion

Discrete Bessel functions and partial difference equations

Stress triaxiality to evaluate the effective distance in the volumetric approach in fracture mechanics

Factorized Asymptotic Bayesian Inference for Mixture Modeling

An Integrated Architecture of Adaptive Neural Network Control for Dynamic Systems

Methods of evaluating tests

Tests of fit for symmetric variance gamma distributions

Generalized Neutrosophic Soft Set

Design of an Adaptive Neural Network Controller for Effective Position Control of Linear Pneumatic Actuators

M5' Algorithm for Shear Strength Prediction of HSC Slender Beams without Web Reinforcement

A model for measurement of the states in a coupled-dot qubit

COMPLEX INDUCTANCE AND ITS COMPUTER MODELLING

Array Design for Superresolution Direction-Finding Algorithms

Resolving RIPS Measurement Ambiguity in Maximum Likelihood Estimation

JAST 2015 M.U.C. Women s College, Burdwan ISSN a peer reviewed multidisciplinary research journal Vol.-01, Issue- 01

Transformation to approximate independence for locally stationary Gaussian processes

Computer Science 786S - Statistical Methods in Natural Language Processing and Data Analysis Page 1

Coding for Random Projections and Approximate Near Neighbor Search

Quasi-Monte Carlo Algorithms for unbounded, weighted integration problems

Molecular Similarity in Medicinal Chemistry Miniperspective

A Characterization of Wavelet Convergence in Sobolev Spaces

A NETWORK SIMPLEX ALGORITHM FOR THE MINIMUM COST-BENEFIT NETWORK FLOW PROBLEM

max min z i i=1 x j k s.t. j=1 x j j:i T j

Models for the simulation of electronic circuits with hysteretic inductors

A NONLILEAR CONTROLLER FOR SHIP AUTOPILOTS

Hankel Optimal Model Order Reduction 1

Probabilistic Graphical Models

Estimation for Unknown Parameters of the Extended Burr Type-XII Distribution Based on Type-I Hybrid Progressive Censoring Scheme

A new method of measuring similarity between two neutrosophic soft sets and its application in pattern recognition problems

Sensor management for PRF selection in the track-before-detect context

Bi-clustering of Gene Expression Data Using Conditional Entropy

The Effectiveness of the Linear Hull Effect

Robust Flight Control Design for a Turn Coordination System with Parameter Uncertainties

JAI 97 XVI. JORNADA DE ATUALIZAÇÃO EM INFORMÁTICA JAI 97 XVI. JOURNEY OF ACTUALIZATION IN COMPUTER SCIENCE

Development of Fuzzy Extreme Value Theory. Populations

A Queueing Model for Call Blending in Call Centers

Mean Activity Coefficients of Peroxodisulfates in Saturated Solutions of the Conversion System 2NH 4. H 2 O at 20 C and 30 C

QCLAS Sensor for Purity Monitoring in Medical Gas Supply Lines

A Small Footprint i-vector Extractor

A Heuristic Approach for Design and Calculation of Pressure Distribution over Naca 4 Digit Airfoil

THEORETICAL ANALYSIS OF EMPIRICAL RELATIONSHIPS FOR PARETO- DISTRIBUTED SCIENTOMETRIC DATA Vladimir Atanassov, Ekaterina Detcheva

A Novel Process for the Study of Breakage Energy versus Particle Size

BAYESIAN NETWORK FOR DECISION AID IN MAINTENANCE

Maximum Likelihood Multipath Estimation in Comparison with Conventional Delay Lock Loops

estimated pulse sequene we produe estimates of the delay time struture of ripple-red events. Formulation of the Problem We model the k th seismi trae

Conformal Mapping among Orthogonal, Symmetric, and Skew-Symmetric Matrices

Transcription:

Model-based mixture disriminant analysis an experimental study Zohar Halbe and Mayer Aladjem Department of Eletrial and Computer Engineering, Ben-Gurion University of the Negev P.O.Box 653, Beer-Sheva, 8405, Israel Abstrat The subjet of this paper is an experimental study of a disriminant analysis (DA) based on Gaussian mixture estimation of the lass-onditional densities. Five parameterizations of the ovariane matrixes of the Gaussiaomponents are studied. Reommendation for seletion of the suitable parameterization of the ovariane matrixes is given. Keywords: Disriminant analysis, Gaussian mixture model, Density estimation, Model seletion.. Introdution Disriminant analysis (DA) is a powerful tehnique for lassifying observations into known preexisting lasses. In the Bayesian deision framework [] a ommon assumption is that the observed d- dimensional patterns x (x R d ) are haraterized by the lass onditional density f (x), for eah lass =, 2,, C. Let P denotes a prior probability of the lass. Aording to Bayes theorem the posterior probability that an arbitrary observation x belongs to lass is C P( x Class x) = P f ( x)/ P f ( x ). The j j j= lassifiation rule for alloating x to the lass having the highest posterior probability P(x Class x), minimizes the expeted mislassifiation rate []. The latter rule is named the Bayes lassifiation rule. In this paper we assume that f (x) is a mixture of M normal (Gaussian) densities M f ( x) = π N( x µ, Σ ). () j= Here π are mixing oeffiients, whih are non-negative and sum to one. N(x µ Σ ) denotes multivariate normal density with mean vetor µ and ovariane matrix Σ. Usually the ovariane matrixes Σ are taken to be full (unrestrited) or diagonal or spherial. The fitting of the parameters π, µ and Σ is arried out by maximizing the likelihood of the parameters to the training data { x, x,..., x } of observations from lass. 2 The subjet of this paper is an experimental study of a DA based on an extended parameterization of Σ, proposed in [3] and named model-based DA [2]. In addition, we modify the model seletion

proedure proposed in [2], whih overomes the setting of the maximal number of omponents for the mixture (). To the best of our knowledge this is the first extensive experimental study of the modelbased DA. The paper is organized as follows: in Setion 2, we desribe the model-based DA, experimental results are provided in Setion 3 and onlusions in Setion 4. 2. Model-based DA Following [2, 3] the parameterizations of the ovariane matrixes Σ are taken to be: Σ =λ I; I denotes the identity matrix and λ are adjustable parameters; Σ =λ B ; Σ =λ B, where B and B are adjustable diagonal matrixes having positive diagonal elements; Σ =λ D ; and Σ =D, where D and D are adjustable positive definite symmetri matrixes. D is restrited to have unit determinant ( D =). Note, that Σ =λ I, λ B and D are the spherial, diagonal and full ovariane matrixes, respetively, used in the onventional mixture models (). The other parameterizations Σ = λ B, λ D define f (x) () having the same matrixes B and D for all mixture omponents of f (x). The goal is to redue the number ν of the adjustable parameters of f (x) while ensuring flexibility of the mixture model by adjusting the parameters λ for eah omponent. Table gives the expressions for ν, for the parameterizations of Σ used in this paper. Σ Table : Expressions for Σ parameterization and the number ν of the adjustable parameters of f (x). λ B (or B ) D (or D ) ν λ I tr( W )/dn - - d(m +)+M - - λ B tr( WB )/dn - λ B tr( WB )/dn - λ D tr( WD )/dn - D diag / d ( λ )/ diag( λ ) j j diag( )/ diag( ) W W - d(m +)+2(M -) /d W W - 2M d+m - - / n / d λ / j λ j W W 2(M -)+ M d+d(d+)/2 W M d+m -+M d(d+)/2 For a predefined number of omponents M and parameterization of Σ (desribed above), the parameters µ, Σ and π are determined by an expetation-maximization (EM) algorithm proposed in [3]. The EM algorithm is initialized by the k-means lustering algorithm []. Using the latter algorithm the training data { x, x 2,..., x } is partitioned into M groups G, j=, 2,, M. An indiator vetor 2

z = ( z, z,..., z ) is assoiated with eah observation x i from lass. For x i G, i i i2 im z =, otherwise ( z ) = 0, for,, and j=,, M. The EM algorithm alternates between two steps, an expetation step (E-step) and a maximization step (M-step), until onvergene riterion is satisfied. Using the initial z i and the initial λ =, for j=,, M, the following E- and M-steps are yled: M-step: n z, z xi, n µ π =, n n T = z i i W ( x µ )( x µ ) E-step: Calulate Σ using the expressions in Table. z π N( x i µ, Σ ). M π N ( x µ, Σ ) j= j i In the model-based DA [2] eah ombination of the number M of the omponents of f (x) () and the parameterization of Σ orresponds to a ertain Gaussian mixture model. In [2] the Bayesian informatioriterion (BIC) is used for model seletion. The BIC(M ) for f (x) () having M omponents is: BIC(M ) = 2 log f( x i) + vlog( n), (2) where f (x i ) is the mixture model () fitted to the training data by the EM algorithm. Using BIC(M ) (2) the number M and the parameterization of Σ are set by the following proedure. For eah parameterization of Σ : () Set M =0, BIC(0)=. (2) Update M =M +. Apply the EM algorithm for fitting the f (x). Theompute BIC(M ) (2). (3) If BIC(M )>BIC(M -) then set M = M - else repeat steps (2)-(3). Finally we selet the model (parameterization of Σ and the number M of the omponents) orresponding to the minimal value of BIC. This proedure is a modifiation of the original proposal [2]. It overomes the problem in setting the maximal number M max of omponents required in [2] and redues the omputational omplexity. 3

3. Experiments In this setion we investigated the performane of the model-based DA. For omparison we ran the probabilisti prinipal omponent analysis (PPCA) mixture density estimation [4], whih is a popular latent variable method in pattern reognition literature. We arried out experiments on various data sets, artifiial and real-world data, from the UCI Mahine Learning Repository at http://www.is.ui.edu/~mlearn/mlrepository.html and Gunnar Ratsh at http://www.first.gmd.de/ ~ raetsh/data/banana.txt. In addition to those benhmark data sets we omposed a data set named MODIFIED LETTER. It merges the lasses (26 letters) of the benhmark data set LETTER into two groups. We set letters O, U, P, S, X, Z, E, B, F, T, W, A, Q to be Group and the letters H, D, N, C, R, G, K, Y, V, M, I, J, L to be Group 2, using luster analysis of the letters means. The goal was to set ompliated (highly overlapped) groups. In our experiments with MODIFIED LETTER we set a training sample size of 300 observations, seleted from the original LETTER data randomly. Table 2 shows the harateristis of the data sets: C is the number of the lasses, S is the averaged number of the observations ( S = n / C) into the lasses, p is the original dimension of the observations and d is the redued dimension after preproessing by prinipal omponent analysis []. We indiate by * the data sets having small sample sizes (S/d 0). Table 2: Data sets harateristis. DATA C p d S/d SONAR* 2 60 29 2.86 GLASS* 6 9 6 5.83 IONOSPHERE* 2 34 30 5.83 VOWEL* 3 9 9.88 MODIFIED LETTER* 2 6 5 0 IRIS 3 4 3 6.66 WINE 3 3 2 9.66 LIVER DISORDERS 2 7 5 27.6 GAUSIAN 2 8 7 35.7 WAVEFORM-NOISE 3 40 40 4.65 LETTER 26 6 5 5.26 WAVEFORM 3 2 2 79.33 BANANA 2 2 2 250 BANANA-NOISE 2 2 2 250 Figure : Corret lassifiation rate versus number of omponents for MODIFIED LETTER data set. 4

Eah experiment onsisted of 0 runs of the following proedure. We randomly drew a data set with replaement from the original data and split the data into five roughly equal sized parts. Then we fitted the parameters of f (x) () and seleted the best model (M and parameterization of Σ ) using four parts of the data and alulated the orret lassifiation rate, alloating the observations from the left part by the Bayes lassifiation rule (Setion ). We arried out five repliations using a different part for lassifiation eah time. Finally, we averaged the lassifiation rate over 0 5 random runs. (When a ovariane matrix Σ happened to be singular, the zero eigenvalues were replaed by a small number just large enough to permit numerially stable inversion). In Table 3, we report the obtained averaged perentage of the orret lassifiation rate along with the standard errors. In parentheses we report the averaged number of the model parameters ( ν ). The bold sores indiate the highest perentage of orret lassifiation rates. Table 3: Averaged perentage of orret lassifiation rates. DATA λ I DA with f (x) having different parameterizations of Σ λ B λ B λ D D PPCA SONAR* 7.9±0.89(35) 74.8±0.85(97) 74.7±0.92(232) 8.3±0.64(996) 80.9±0.62(928) 79.8±0.73(456) GLASS* 57.9±0.98(47) 62.2±0.7(75) 6.5±0.82(94) 6.0±0.82(279) 6.0±0.82(279) 49.0±0.92(90) IONOSPHERE* 9.4±0.42(375) 92.5±0.40(404) 92.5±0.39(47) 9.9±0.40(06) 83.±0.69(496) 8.7±0.52(26) VOWEL* 85.3±0.40(760) 86.2±0.40(808) 87.2±0.40(084) 89.9±0.28(083) 87.7±0.35(868) 85.4±0.28(680) MODIFIED LETTER* 79.4±0.73(203) 8.6±0.52(229) 79.8±0.52(242) 82.0±0.56(330) 8.7±0.50(273) 8.0±0.6(258) IRIS 96.±0.46(40) 95.5±0.53(36) 94.8±0.50(35) 96.9±0.4(27) 96.9±0.4(27) 97.4±0.35(28) WINE 96.0±0.36(37) 96.8±0.36(37) 96.6±0.4(05) 98.3±0.30(279) 98.7±0.28(270) 98.4±0.28(37) LIVER DISORDERS 66.±0.50(7) 65.±0.49(63) 64.±0.63(70) 67.4±0.65(66) 67.9±0.69(95) 60.7±0.7(56) GAUSIAN 69.8±0.57(00) 7.6±0.64 (84) 69.8±0.62(86) 8.0±0.48(73) 84.8±0.35(05) 73.4±0.57(03) WAVEFORM- NOISE 86.7±0.4(446) 86.5±0.3(550) 86.5±0.3(807) 84.0±0.4(2580) 84.0±0.4(2580) 86.0±0.4(249) LETTER 82.7±0.2(5450) 89.0±0.0(564) 90.0±0.07(833) 9.7±0.07(6987) 94.73±0.07(3384) 84.0±0.07(3380) WAVEFORM 86.5±0.0(297) 86.4±0.09(355) 86.3±0.(427) 85.±0.3(787) 85.±0.2 (756) 86.0±0.2(35) BANANA 92.9±0.2(65) 94.2±0.0(74) 94.2±0.0(72) 94.3±0.0(75) 95.±0.06(6) 95.±0.07(7) BANANA- NOISE 75.3±0.4(37) 75.±0.5(36) 75.4±0.5(36) 75.4±0.4(46) 75.8±0.6(37) 75.8±0.6(45) For the MODIFIED LETTER data set we ran an additional experiment. We trained the lassonditional densities () using the seleted 300 training observations and alulated the orret lassifiation rate using the Bayes lassifiation rule for the rest of the 9700 LETTER observations. In Fig. we show the averaged lassifiation rates versus the number ν +ν 2 of the parameters of the 5

model. We onstruted the graphs using the same number of omponents for eah lass (M =M 2 =M) varying M=, 2,, 20. Observing the results in Table 3 and Fig. we onlude that the model λ D outperforms others models for the data sets SONAR, VOWEL and MODIFIED LETTER having a small sample size (S/d 0) and the model λ B outperforms the model λ B for most of the data sets. Consequently it seems that the models Σ =λ I, λ B and Σ =λ D over a wide range of data strutures and ould be onsidered as the basi set of models for the model-based DA. Looking at Table 3 for the MODIFIED LETTER dataset we observe that using BIC we seleted models λ B and λ D having ν +ν 2 = 229-330, respetively. Observing the results in Fig. we onlude that for those models (λ B, λ D ) the best generalization performane (highest lassifiation rate for 9700 test observations) is for ν +ν 2 >800. Consequently, the model seletion using BIC underestimates the omplexity of f (x) used for Bayes lassifiation. An improvement of the model seletion is needed. The latter is a subjet of our future researh. 4. Conlusions In this paper, we have investigated the performane of the model-based DA [2, 3] on various data sets, artifiial and real-world data. The results obtained in the experimental study show that Σ =λ I, λ B and λ D seems to be a universal set of models that an be reommended for pratial appliations in small sample size data sets. In addition, we proposed a modifiation of the model seletion proedure proposed in [2], whih overomes the setting of the maximal number of omponents in the mixture () and redues the omputatioomplexity of the model seletion. Referenes [] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistial Learning, Springer press, Canada, 200. [2] C. Fraley, A.E. Raftery, Model-Based Clustering, Disriminant Analysis, and Density Estimation, Journal of the Amerian Statistial Assoiation, Vol. 97, pp. 6-63, Jun 2002. [3] G. Celeux, G. Govaert, Gaussian parsimonious lustering models, Pattern Reognition, Vol. 28, No. 5, pp. 78-793, 995. [4] M.E. Tipping, C.M. Bishop, Mixture of Probabilisti Prinipal Component Analyzers, Neural Computation, Vol., pp. 443-482, February 999. 6

Figure : Corret lassifiation rate versus number of omponents for MODIFIED LETTER data set. 7