Journal of Multivariate Analysis

Size: px

Start display at page:

Download "Journal of Multivariate Analysis"

Amberly McCormick
5 years ago
Views:

Journal of Multivariate Analysis 0 (00 6 637 Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: wwwelseviercom/locate/jmva The efficiency of logistic

States a r t i c l e i n f o a b s t r a c t Article history: Received 0 April 009 Available online 8 March 00 AMS subject classification: 6H30 6F 6E0 Keywords: Class noise Misclassification rate

1 Journal of Multivariate Analysis 0 ( Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: wwwelseviercom/locate/jmva The efficiency of logistic regression compared to normal discriminant analysis under class-conditional classification noise Yingtao Bi Daniel R Jeske Department of Statistics University of California Riverside CA 95 United States a r t i c l e i n f o a b s t r a c t Article history: Received 0 April 009 Available online 8 March 00 AMS subject classification: 6H30 6F 6E0 Keywords: Class noise Misclassification rate Misspecified model Asymptotic distribution In many real world classification problems class-conditional classification noise (CCC- Noise frequently deteriorates the performance of a classifier that is naively built by ignoring it In this paper we investigate the impact of CCC-Noise on the quality of a popular generative classifier normal discriminant analysis (NDA and its corresponding discriminative classifier logistic regression (LR We consider the problem of two multivariate normal populations having a common covariance matrix We compare the asymptotic distribution of the misclassification error rate of these two classifiers under CCC-Noise We show that when the noise level is low the asymptotic error rates of both procedures are only slightly affected We also show that LR is less deteriorated by CCC- Noise compared to NDA Under CCC-Noise contexts the Mahalanobis distance between the populations plays a vital role in determining the relative performance of these two procedures In particular when this distance is small LR tends to be more tolerable to CCC- Noise compared to NDA 00 Elsevier Inc All rights reserved Introduction In many real world classification problems class label noise is inherent in the training dataset CCC-Noise is an important type of class label noise Examples where CCC-Noise arises include remote sensing and traditional medical diagnosis (see 3 and image classification (see 45 Recent medical diagnosis technologies (eg microarray analyses and protein massspectrometry profiling have also sparked interest in considering classifiers where training data is prone to CCC-Noise (see for example 67 or 8 In many applications most analysts acknowledge that noise is present but naively ignore it Alternatively data preprocessing approaches have been proposed to attempt to remove mislabeled training observations see 9 or 0 These approaches however face the risk of removing useful data which consequently would reduce the accuracy of the classifier An alternative solution to handle noisy data might be to construct noise tolerant classifiers directly Li et al 4 combined the class noise into the model based on a probabilistic noise model Norton and Hirsh presented a Bayesian approach to learning from noisy data where prior knowledge of the noise process is applied to compute posterior class probabilities Yasui et al 6 and Magder and Hughes used EM algorithm to handle the label noise The effects of CCC-Noise on the estimation of association between two random variables have been discussed widely in epidemiology see 3 One important conclusion is that the noise attenuates the estimation of the association Neuhaus 4 gave a comprehensive discussion of bias and efficiency loss due to CCC-Noise in the context of general linear model and proposed an approximation for the expected values of the estimators The effects of CCC-Noise have also been studied in the area of pattern reorganization Krishnan 5 studied the efficiency loss using Efron s Asymptotic Relative Efficiency criteria 6 Michalek and Tripathi 3 bounded the asymptotic efficiency of logistic regression and normal discrimination Corresponding author addresses: ybi00@ucredu (Y Bi danieljeske@ucredu (DR Jeske X/$ see front matter 00 Elsevier Inc All rights reserved doi:006/jjmva000300

2 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Zhu and Wu 7 investigated the impact of both class and attribute noise by doing simulation experiments They concluded that the classification accuracies decline almost linearly with the increase of the noise level Until now limited research has been conducted to theoretically quantify the impact of CCC-Noise on the misclassification rates Generally speaking classifiers can be characterized as either generative or discriminative according to whether or not the distribution of the explanatory variables is modeled Comparison of generative and discriminative classifiers has received considerable attention in the literature There is a widely held perception that in the absence of CCC-Noise the discriminative approach is more robust than the generative approach For example Ng and Jordan 8 showed that if the assumed conditional distributions for the explanatory variables are correct then generative classifiers and discriminative classifiers have the same asymptotic error rates although the generative classifiers approach their asymptotic values faster However if the conditional distributions are not correct then discriminative classifiers have lower asymptotic error rates provided the link function is modeled correctly Efron 6 computed the relative efficiency based on the ratio of expected regrets of one popular generative classifier normal discriminant analysis (NDA to the corresponding discriminative classifier 8 logistic regression (LR and concluded that LR is between one half and two thirds as efficient as NDA under normality In this paper we aim at theoretically comparing the misclassification error rate of NDA with LR under CCC-Noise The rest of this paper is organized as follows In Section we briefly review the NDA and LR classifiers and then discuss these two procedures under CCC-Noise In Section 3 the asymptotic distributions of the misclassification error rate of the NDA and LR under CCC-Noise are obtained In Section 4 we compare the expected values of the misclassification error rates of these two procedures and also the relative efficiency Section 5 concludes with summary NDA and LR classifiers under CCC-noise We consider a binary classification task Let X denote the vector of explanatory variables Suppose that X comes from one of two p-dimensional normal populations differing in mean but not in covariance population 0 : X MVN p (µ 0 with probability π 0 population : X MVN p (µ with probability π ( where π 0 + π = If all parameters are known a new observation may be classified based on X as belonging to population or 0 according as where L (αβ (X = α + β X > or < 0 α = λ (µ µ µ 0 µ 0 / β = (µ µ 0 ( with λ = log π /π 0 This linear classifier minimizes the total misclassification rate as is easily shown by applying Bayes theorem However the assumption that the parameters (λ µ 0 µ are known is rarely the case in practice Therefore the parameters are usually estimated from a random sample of n observations (x y (x n y n } where y k is the binary class label for the kth sample Parameter estimates are typically obtained by maximizing the following log-likelihood function n ( y k p log(π log } (x k µ 0 (x k µ 0 + log π 0 k= + n k= y k p log(π log (x k µ (x k µ + log π } Using maximum likelihood estimators (MLEs in place of unknown parameters results in the so called NDA procedure The corresponding discriminative method to NDA is LR LR is a popular discriminative method that does not assume any distribution for X It simply assumes a parametric form for the conditional probability P(Y = X = x Then the parameters (α β are estimated directly by maximizing the conditional likelihood n y k (α + β x k k= n log + exp(α + β x k k= In the context of no label noise both NDA and LR will give consistent estimates under ( Now let s introduce the CCC-Noise into these two procedures For a binary classification problem CCC-Noise means that the label Y (0 or of any observation is independently and randomly flipped to Y with probability θ Y where θ Y

3 64 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( denotes the correct labeling rate We know that Y denotes the corresponding true class label but instead of observing Y we observe Ỹ as the noisy class label The CCC-Noise can be specified as P(Ỹ = Y = 0 = θ 0 P(Ỹ = 0 Y = = θ and the observed training data is (x ỹ (x n ỹ n } For the NDA procedure the misspecified log-likelihood is n ( ỹ k p log(π log } (x k µ 0 (x k µ 0 + log π 0 k= + n k= ỹ k p log(π log (x k µ (x k µ + log π } (3 The MLEs obtained by maximizing (3 are termed misspecified MLEs Similarly the misspecified MLEs for the LR procedure are obtained by maximizing n ỹ k (α + β x k k= n log + exp(α + β x k (4 k= 3 Asymptotic distribution of error rate Let ( α β denote an arbitrary estimate of (α β based upon the training data set For a new observation (X Y the probability of a misclassification (calculated with respect to the joint distribution of the training data and the new observation error is defined to be ER( α β = P(L (X > 0 Y = 0 + P(L ( α β ( α β (X < 0 Y = If ( α β are randomly determined (as in NDA or LR ER( α β will be a random variable Efron 6 stated that the distribution of ER( α β is invariant under a linear transformation that allows reduction of assumption ( to the following canonical form population 0 : X MVN p ( ( /e I with probability π 0 population : X MVN p (( /e I with probability π (5 where (µ µ 0 (µ µ 0 the square root of the Mahalanobis distance I is the p p identity matrix and e ( is a p vector For either LR or NDA the probability of misclassification has the same distribution under ( as it does under the so-called standard situation (5 A formal proof of this statement is presented in 9 where it is also shown the result continues to hold even in the presence of CCC-Noise Henceforth we will work with the standard situation In the standard situation the conditional (given the training data probability of misclassification of an arbitrary estimated classification boundary L ( α β = α + β x = 0 is ER( α β = π 0 Φ β + α β β + π Φ β α β β (6 where β indicates the first component of β and Φ( is the cumulative density function of the standard normal distribution Now suppose the arbitrary estimate ( α β has a limiting normal distribution ( ( α } ( z00 n α β β z MVN p+ 0 0 (7 z 0 Z where z 00 is a scalar z 0 is a column p vector and Z is a p p matrix Differentiating (6 gives ER( α β = f 0 ( α β = π 0 φ β + α π φ β α α β β β β β β β β

4 ER( α β β = f ( α β = π 0 + π Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( e + β β e + β β ( ( β α ( β β 3 β φ β + α β β ( β β 3 β φ β α β β β + α where φ is the standard normal density function Defining ω ( α β = f 0 ( α β z 00 + f ( α β Z f ( α β + f 0 ( α β z 0 f ( α β the following lemma follows from application of the delta method: Lemma In the standard situation if ( α β has the limiting distribution given in (7 with ( α β (α β then ner( α β ER( α β L N(0 ω ( α β Note that if ( α β are the true values of (α β then f 0 ( α β and f ( α β will both vanish since ER( α β is minimized at that point In this situation Lemma will not be valid since the first order Taylor expansion is not enough Efron 6 computed the efficiency of LR to NDA without CCC-Noise based on the second order Taylor expansion For the case of having CCC-Noise Lemma will be valid since ( α β are not equal to the true values In order to use Lemma to obtain the asymptotic distribution of ER( α β we will first to get the asymptotic distribution of ( α β To this end we will utilize White s theorem 0 White s Theorem Suppose iid data x x x n come from a true distribution p(x but we assume a family of distributions p(x θ where θ is an indexing parameter Then under suitable regularity conditions the maximum likelihood estimator of θ converges almost surely to the value θ that minimizes p(x log dx the Kullback Leibler distance of p(x θ from p(x White also shows that the sequence of MLEs ˆθ n is asymptotically multivariate normal in the sense n(ˆθ n θ MVN(0 C(θ where the covariance matrix C(θ is equal to A(θ B(θ A(θ with A(θ and B(θ matrices whose (i jth element is given by A ij (θ = E p(x log p(x θ/ θ i θ j B ij (θ = E p(x ( log p(x θ/ θ i ( log p(x θ/ θ j p(x θ p(x L 3 Asymptotic distribution of error rate of NDA First we consider the misspecified NDA procedure Let ( ˆα ˆβ and (ˆλ ˆµ 0 ˆµ ˆ denote respectively the estimates of (α β and (λ µ 0 µ based on the misspecified NDA procedure Let us write the distinct elements of ˆ as a p(p + / vector ( ˆσ ˆσ ˆσ p ˆσ ˆσ 3 ˆσ pp and indicate this vector as (ˆσ ( ˆσ ( where ˆσ ( ( ˆσ ˆσ ˆσ p ˆσ ( ( ˆσ ˆσ 3 ˆσ pp For convenience we introduce the following expressions first a = π 0 θ 0 b = π ( θ c = π 0 ( θ 0 d = π θ h = + ( ab a + b + cd We also have the following notations E ij being the p p matrix with one in the (i jth position and zero elsewhere O p p being zero matrixes and δ ij = or 0 as i = j or i j As described earlier the asymptotic distribution of the error rate is the same under ( as it is under (5 so we consider the asymptotic distribution of the misspecified MLEs in the standard situation It will also be seen that we do not need the limiting distribution of ˆσ ( to derive the asymptotic distribution of NDA based estimates ( ˆα ˆβ Lemma In the standard situation under CCC-Noise the NDA produces misspecified estimates (ˆλ ˆµ 0 ˆµ ˆσ ( by maximizing (3 which satisfy ˆλ λ n ˆµ 0 ˆµ µ 0 µ MVN 3p+ (0 ˆσ ( σ (

5 66 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( where and with λ = log a + b µ = b a 0 b + a e µ = d c e σ ( = e /h ω 00 O p O p ω 03 e O p ω E + I E O p p ω 3 E = a + b O p O p p ω E + I E ω 3 E ω 03 e ω 3 E ω 3 E ω 33 E + h(i E ω 00 = + exp(λ = exp(λ ω = + ab ( 3 ω 33 = h + 4 ab 4 + a 4 b h 4 (b + a + cd 4 ω 03 = h ab (a + b cd ( (a + b( ω = a + b + ( ab b + a + ab (a + b c 4 d (d + c 3 cd 4 d + c ab(a b 3 ω 3 = (a + b 3 h ω cd(c d 3 3 = ( 3 h Proof see Appendix A Lemma gives us the asymptotic distribution of the misspecified MLEs ˆλ ˆµ 0 ˆµ and ˆσ ( Similar to Efron s derivations 6 we use the multivariate delta method to obtain Lemma 3 which uses the following definitions m 0 = b a b + a h m 0 = c d h ( b ( a d c m 03 = b + a d + c 8 m 3 = ( d c d + c b a b + a Lemma 3 In the standard situation under CCC-Noise the NDA produces misspecified estimates ( ˆα ˆβ satisfying ( ( } α n ˆαˆβ β MVN p+ (0 where with α = log a + b h ( d ( c b ( a d + c b + a ad bc (b + a( e β = (µ µ 0 ( = h κ 00 κ 0 e = κ 0 e κ E + (a + b(h + m 3 h (I E κ 00 = ω 00 + ω (m 0 + ω (m 0 + ω 33 (m 03 + m 03 ω 03 + m 03 ω 3 m 0 + m 03 ω 3 m 0 κ 0 = m 0ω h + m 0ω h + m 0 ω 3 m 3 + m 03ω 3 h κ = ω + ω + m h ω 3 33 m 3 3 h 3 + m 03 ω 33 m 3 + ω 03 m 3 + m 03ω 3 h + m 0 ω 3 m 3 ab(a b cd(c d (a + b 3 ( 3

6 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Proof Differentiating ( gives α λ = β λ = 0 α µ 0 = µ 0 β µ 0 = α µ = µ β µ = α σ ij = µ 0iµ 0j µ iµ j + δ ij β σ ij = E ij + E ji + δ ij (µ µ 0 µ 0i indicating the ith component of µ 0 ; likewise for µ i In the standard situation we have the following first-order differential relationship that is obtained by expanding ( ˆα ˆβ around (α β ˆλ λ ( ( ˆαˆβ α m0 e β m = 0 e m 03 e ˆµ 0 0 µ 0 0 ( ( m 3 I 0 ˆµ µ ˆσ ( σ ( (8 ˆσ ( σ ( Let M be the matrix on the right side of (8 ignoring the last column Then the multivariate delta method gives the covariance matrix of ( ˆα ˆβ as M M Evaluation of M M gives the result of Then the asymptotic distribution of ER( ˆα ˆβ is given in Theorem Theorem ner( ˆα ˆβ ER(α β e L N(0 ω NDA where with ER(α β e α α = π 0 Φ ( + β + π Φ ( β ω NDA = (f 0(α β e κ 00 + κ (f (α β e + f 0 (α β e κ 0 f (α β e f 0 (α β e = π 0 ( β φ α + β π β f (α β e α = π 0 (β φ α ( + β α φ ( β α + π (β φ Proof The proof follows immediately from Lemmas and 3 α c+d log β = a+b h α ( β We know that the minimal value of the error rate is ER(λ e = π 0 Φ ( + λ + π Φ ( λ That means the misspecified NDA procedure will converge to ER(λ e if the following condition is satisfied ( d c ( b a ( d+c b+a h ad bc (b+a(c+d When π 0 = π it can be easily seen that this equation holds if θ 0 = θ 3 Asymptotic distribution of error rate of LR = λ (9 According to White s theorem the misspecified MLEs obtained by maximizing (4 say (ŝ ˆt are asymptotically normal and converge almost surely to the value (s t that minimizes Ỹ(α + β X + log + exp(α + β X } logϕ(x (0 E (XỸ where ϕ(x is the density function of X under (5 Taking the first partial derivatives with respect to α and β respectively and setting them to zero gives: exp(s E (XỸ Ỹ exp(t X E (XỸ = 0 + exp(s exp(t ( X E XỸ E (XỸ (XỸ X exp(s exp(t X = 0 + exp(s exp(t ( X

7 68 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Neuhaus 4 showed that ignoring CCC-Noise in the LR can be viewed as one kind of link function violation Li and Duan s showed that the estimated slope ˆt from LR with a misspecified link function consistently estimates the true parameter β up to a scale factor In the standard situation β = e thus we have t = u e where u is a scalar quantity Then ( and ( are simplified into exp(s exp(u X g (π 0 θ 0 θ = E X + exp(s exp(u (3 X with g (π 0 θ 0 θ = E X X exp(s exp(u X + exp(s exp(u X π exp( X g (π 0 θ 0 θ = E (XỸ Ỹ = θ E X + ( θ 0 E X π 0 + π exp( X X π exp( X g (π 0 θ 0 θ = E (XỸ Ỹ X = θ E X + ( θ 0 E X π 0 + π exp( X π 0 π 0 + π exp( X X π 0 π 0 + π exp( X and X indicating the first component of X Eqs (3 and (4 are solved by combining Monte-Carlo integration with the Newton Raphson algorithm The two unknown parameters (s u only depend on the values of (π 0 θ 0 θ Instead of using s (π 0 θ 0 θ and u (π 0 θ 0 θ we adopt (s u for the convenience of expression Lemma 4 In the standard situation under CCC-Noise the LR produces misspecified estimates (ŝ ˆt satisfying (ŝ ( } s n ˆt u MVN e p (0 where (s u is solution of (3 and (4 and ψ 00 ψ 0 e = ψ 0 e ψ E + v 0 (I E h 0 (4 with ψ 00 = hv 0 h h v + h v ψ (h 0 h h 0 = h h v 0 + (h h 0 + h v h 0 h v (h 0 h h + ψ = hv 0 h h 0 v + h v 0 exp(s exp(u x h (h 0 h h i (π 0 θ 0 θ + exp(s exp(u x xi ϕ(xdx + π θ exp( x + π 0 ( θ 0 exp(s exp(u x v i (π 0 θ 0 θ π 0 + π exp( x + exp(s exp(u x exp(s exp(u } x + + exp(s exp(u x i ϕ(xdx i = 0 x Proof see Appendix B Combining Lemmas and 4 proves the following theorem regarding the asymptotic distribution of ER(ŝ ˆt Theorem ner(ŝ ˆt ER(s u L e N(0 ω LR where with ER(s u s s e = π 0 Φ ( + + π u Φ ( u ω LR = (f 0(s u e ψ 00 + ψ (f (s u e + f 0 (s u e ψ 0 f (s u e f 0 (s u e = π 0 ( u φ s + s f (s u e = π 0 (u φ u s ( + u π ( u φ + π s (u φ s u s ( u

8 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Fig a The asymptotic error rate bias with θ 0 = = and π 0 = 05 4 Comparison and discussion Fig b The asymptotic error rate bias with θ 0 = = and π 0 = 05 Define the bias of the asymptotic error rate of the classification rule by AERB( ˆα ˆβ = lim E ER( ˆα ˆβ ER(λ e = ER(α β n e ER(λ e AERB(ŝ ˆt = lim n E ER(ŝ ˆt ER(λ e = ER(s u e ER(λ e We assume that both θ 0 and θ are greater than 05 since values of θ 0 and θ less than 05 indicates that the process of labeling an observation performs worse than flipping a coin Figs 3 illustrate how the CCC-Noise affects the bias of the asymptotic error rate of NDA and LR respectively If the CCC-Noise is ignored both the LR and NDA procedures are positively biased However when the noise level is low which is usually the case in practice the asymptotic error rates of both procedures are only slightly affected When is small the LR and NDA almost have the same asymptotic error rate As increases the difference in the asymptotic error rates increases with LR always being larger When θ 0 = θ AERB is zero for both NDA and LR implying the corresponding asymptotic error rates are both equal to the optimal value given by ER(λ e We define the relative efficiency of LR to NDA as the following E ER( ˆα ˆβ ER(λ e RE(π 0 θ 0 θ n = E ER(ŝ ˆt ER(λ e AERB ( ˆα ˆβ + ω NDA /n (5 AERB (ŝ ˆt + ω LR /n The quantities of (5 are graphed in Figs 4 6 for different sample sizes These three graphs show that the relative efficiency monotonically increases as the noise level increases which indicates that LR is less deteriorated by CCC-Noise compared to NDA As noise level goes to zero the relative efficiencies converge to their minimal values which are different from the numbers shown in Table of Efron s paper 6 because the relative efficiency employed in Efron s paper is defined as the

9 630 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Fig c The asymptotic error rate bias with θ 0 = = 3 and π 0 = 05 Fig a The asymptotic error rate bias with θ 0 = 09 = and π 0 = 05 Fig b The asymptotic error rate bias with θ 0 = 09 = and π 0 = 05 ratio of the expected regrets of these two procedures rather than the ratio of the mean square errors as in our paper It is also important to note that as gets smaller although NDA is still better it is better by a small margin Furthermore for the case of sample size equal to 50 and equal to as misclassification probability increases the performance of LR can even be better than NDA Though only Fig 4 shows this latter aspect for lower values of θ some of these other curves would rise up to be larger than also

10 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Fig c The asymptotic error rate bias with θ 0 = 09 = 3 and π 0 = 05 Fig 3a The asymptotic error rate bias with θ 0 = 08 = and π 0 = 05 Fig 3b The asymptotic error rate bias with θ 0 = 08 = and π 0 = 05 5 Summary We have investigated the impact of CCC-Noise on the performance of a popular generative classifier normal discriminant analysis (NDA and its corresponding discriminative classifier logistic regression (LR We compared the relative asymptotic error rate of these two classifiers under CCC-Noise when the underlying distributions are multivariate normal Typically the AERB of both procedures are only slightly affected when the noise level is low LR has a larger AERB than NDA when the two populations are more separated

11 63 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Fig 3c The asymptotic error rate bias with θ 0 = 08 = 3 and π 0 = 05 Fig 4 The relative efficiency of LR to NDA with θ 0 = n = 50 and π 0 = 05 Fig 5 The relative efficiency of LR to NDA with θ 0 = n = 00 and π 0 = 05 With respect to the relative efficiency of LR to NDA we showed that LR is less deteriorated by CCC-Noise compared to NDA One important conclusion is that under the CCC-Noise contexts the Mahalanobis distance plays a vital role in determining the relative performance of these two procedures When is small LR tends to be more tolerable to CCC-Noise compared to NDA Our analyses provide insight and a more in-depth understanding of the effect of noisy labeled observations on the accuracy of frequently used classification models and conceivably our results can be used to guide interested researchers

12 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Fig 6 The relative efficiency of LR to NDA with θ 0 = n = 000 and π 0 = 05 on the design of noise handling mechanisms In future work we will extend this efficiency comparison to a non-gaussian assumption for the distribution of X Appendix A Proof of Lemma According to White s theorem the misspecified MLEs (ˆλ ˆµ 0 ˆµ ˆ of NDA converges to (λ µ 0 µ which minimizes p E (XỸ ( Ỹ log(π + log + (X µ 0 (X µ 0 log π 0 + Ỹ p log(π + log + (X µ (X µ log π where the expectation is with respect to the joint distribution of (Ỹ X Let L represent the inner term of the expectation Taking the first partial derivatives with respect to λ µ 0 µ and respectively and setting them to zero gives: exp(λ } E (XỸ + exp(λ Ỹ = 0 (6 } E (XỸ ( Ỹ( (X µ 0 = 0 (7 } E (XỸ Ỹ( (X µ = 0 (8 E (XỸ diag + ( Ỹ(X µ 0 (X µ 0 diag(x µ 0 (X µ 0 + Ỹ(X µ (X µ diag(x µ (X µ } } = 0 (9 Also it can be easily shown that in the standard situation E X = (π π 0 e (XỸ E (XỸ Ỹ = π 0 ( θ 0 + π θ E (XỸ Ỹ X = θ π ( θ 0 π 0 e E (XỸ ( Ỹ(X µ 0 (X µ 0 ( ( ( ( e = π 0 θ 0 I + + e µ 0 + e µ 0 + π ( θ I + e µ 0 µ 0 (0 E (XỸ Ỹ(X µ (X µ ( ( ( ( e = π 0 ( θ 0 I + + e µ + e µ + π θ I + e µ µ

13 634 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Substituting the values of the expressions from (0 into (6 (7 (8 and (9 and simplifying and solving yield (λ µ 0 µ Here = I + ( ab + cd a+b c+d E which indicates that σ ( = he The asymptotic variance covariance matrix of (ˆλ ˆµ 0 ˆµ ˆσ ( is represented as A 00 A 0 A 0 A 03 B 00 B 0 B 0 B 03 = A BA A 0 A = A A 3 B A 0 A 0 B B B 3 A A 3 B A 03 A 3 A 0 B B B A 3 3 A 33 B 03 B 3 B 3 B 33 where A 00 is a scalar A 0 A 0 A 03 are p vectors and A A A 33 A A 3 A 3 are p p matrixes likewise for all components in matrix B First we consider the elements in matrix A Again according to White s theorem we know that L A 00 = E exp(λ (XỸ λ λ = (λ µ 0 µ + exp(λ L A = E (XỸ µ 0 µ = E (XỸ ( Y ( } 0 (λ µ 0 µ ( ( ab = (a + b I + a + b + cd E L } ( ( A = E ab (XỸ = E (XY µ µ Ỹ( = ( I + a + b + cd E (λ µ 0 µ and apparently A 0 = A 0 = A 03 = O p A = A = O p p With respect to A 3 and A 3 for each element in σ ( we have L E Ej + E j (XỸ µ 0 σ j = E (XỸ ( Ỹ(X µ (λ µ 0 0 µ + δ = O p j L E Ej + E j (XỸ = E µ σ j (XỸ Ỹ(X µ + δ = O p j (λ µ 0 µ which indicate that A 3 = A 3 = O p p Now let s consider (9 Taking the first partial derivative with respect to σ ij and evaluated at (λ µ 0 µ gives E (XỸ E ij + E ji + ( + δ ij diag E ij + E ji + δ ij All the elements in this matrix are zero except the positions (i j and (j i which indicates that A 33 is a diagonal matrix and evaluation gives A 33 = h E + h(i E Finally the whole matrix A can be written in the following way ( A = diag exp(λ + exp(λ a + p b h }} (a + b (a + b c + p p d h }} ( ( h }} h/ h/ Now consider the matrix B According to White s theorem we know that L L B 00 = E (XỸ λ λ = E (XỸ exp(λ } (λ µ 0 µ + exp(λ + Ỹ exp(λ = + exp(λ ( L L } B = E (XỸ = E (XY µ 0 µ 0 (λ ( Ỹ ( (X µ µ 0 (X µ 0 ( 0 µ ( ( ab = I + a + b + cd E (a + bi + ab ( ( ab a + b E I + a + b + cd E ( L L B = E (XỸ = E (XY µ µ (λ (Y ( (X µ µ (X µ ( } 0 µ ( ( ab = I + a + b + cd E (I + cd ( ( ab E I + a + b + cd E

14 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( It also can be easily shown that B 0 = B 0 = O p and B = B = O p p based on the fact that ( ỸỸ is a zero random variable Now let s consider B 03 First we let N = diag and n ij be (i j component of N Thus we have L E (XỸ = n ij + ( Ỹ (X i µ 0i (X j µ 0j + Ỹ (X i µ i (X j µ j + δ ij + δ ij σ ij (λ µ 0 µ Note the fact that N is a diagonal matrix Let s consider the jth component of B 03 ( L L E exp(λ (XỸ λ σ j = E (XỸ (λ µ 0 µ + exp(λ Ỹ n j + ( Ỹ (X µ 0 (X j µ 0j + Ỹ (X µ (X j µ } j + δ j + δ j = E (XỸ Ỹ n j Ỹ (X µ (X j µ } j + δ j It can be observed that only the component with j = is nonzero which is equal to E (XỸ Ỹ h Ỹ (X µ (X µ } = ab a + b cd a + b Now let s consider the (i j component of B 33 L L E (XỸ σ j σ i = E ( Ỹ (X µ 0 (X j µ 0j (X µ 0 (X i µ 0i (λ µ 0 µ + δ j + δ i + E (Ỹ (X µ (X j µ j (X µ (X i µ i + δ j + δ i It can be observed that only the components with i equal to j are nonzero For any j L L E (XỸ = n σ j σ + j j + π 0θ 0 (µ 0 µ 0 + π ( θ (µ µ 0 For j = E (XỸ L L (λ µ 0 µ σ σ (λ µ 0 µ For (i j component of B 3 L L E (XỸ σ j µ 0i (λ µ 0 µ + π 0 ( θ 0 (µ 0 µ + π θ (µ µ ( ( ( ( b a = n + j + a + b b + a b + a ( ( d c + ab = + b + a + cd = h = h = E (XỸ ab 4 + a 4 b (b + a + cd c 4 d (d + c 3 4 ( ab b + a + ( Ỹ (X µ 0 (X j µ 0j (X i µ 0i + δ j h δ i It can be observed that only the ( component is nonzero which is equal to L L E 3(µ 0 µ (XỸ = π 0 θ + (µ 0 0 µ h σ µ 0 (λ µ 0 µ + π ( θ 3(µ µ + (µ 0 µ 0 3 h = 3 ab(a b h (a + b Thus B 3 = 3 ab(a b h (a+b E Similarly we have B 3 = 3 cd(c d h (c+d E Finally evaluation of A BA gives the matrix cd d + c

15 636 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Appendix B Proof of Lemma 4 The asymptotic variance covariance matrix of (ŝ ˆt is represented ( ( = V HV V V H H = V V H V H where V is a scalar V are p vectors and V are p p matrixes likewise for all components in matrix H First we consider the elements in matrix H Let l represent the inner term of the expectation of (0 l exp(s H = E exp(u } X (XỸ = EX = α α + exp(s exp(u h X 0 (s u Now consider the ith component of H l E (XỸ α β i (s u = E X Xi exp(s exp(u X + exp(s exp(u X Only the first component is nonzero which is equal to l E X exp(s exp(u } X (XỸ α β = E X = (s u + exp(s exp(u h X The (i j component of H l E (XỸ β i β j (s u } Xi X j exp(s exp(u } X = E X + exp(s exp(u X It can be observed that only the components with i equal to j are nonzero For any i l E (Xi exp(s exp(u } X (XỸ β i β = E X i (s u + exp(s exp(u X exp(s exp(u } X = E X = + exp(s exp(u h X 0 For i = E (XỸ l β β (s u (X exp(s exp(u } X = E X = + exp(s exp(u h X Thus H = h E + h 0 (I + E For matrix V l l V = E (XỸ α α (s u = E (XỸ (Y exp(s exp(u X + exp(s exp(u X Y + = E X exp(s exp(u X } + exp(s exp(u X θ π exp( X + π 0 ( θ 0 exp(s exp(u X exp(s π 0 + π exp( X + exp(s exp(u X + exp(u X = v + exp(s exp(u 0 X For the ith component of V l l E (XỸ α β = E (XỸ i (s u Only the first component is nonzero which is E (XỸ (Y X i exp(s exp(u X + exp(s exp(u X Y X i + X i l α l β (s u V = v E + v 0 (I + E Finally evaluation of V HV gives the matrix exp(s exp(u } X + exp(s exp(u X = v For the matrix V similar to H we have

16 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( References T Krishnan SC Nandy Efficiency of discriminant analysis when initial samples are classified stochastically Pattern Recognition 3 ( UA Katre T Krishnan Pattern recognition with imperfect supervision Pattern Recognition ( JE Michalek RC Tripathi The effect of errors in diagnosis and measurement on the estimation of the probability of an event Journal of the American Statistical Association 75 ( Y Li W Lodeswyk D De Ridder M Reinders Classification in the presence of class noise using a probabilistic kernel fisher method Pattern Recognition 40 ( ND Lawrence B Scholkopf Estimating a kernel fisher discriminant in the presence of label noise in: Proc of 8th ICML 00 pp Y Yasui M Pepe L Hsu BL Adam Z Feng Partially supervised learning using an EM-boosting algorithm Biometrics 60 ( W Zhang R Rekaya K Bertrand A method for predicting disease subtypes in presence of misclassification among training samples using gene expression: application to human breast cancer Bioinformatics ( K Robbins S Joseph W Zhang R Rekaya JK Bertrand Classification of incipient Alzheimer patients using gene expression data: dealing with potential misdiagnosis Online Journal of Bioinformatics 7 ( J Maletic A Marcus Data cleansing: beyond integrity analysis in: Proceedings of the Conference on Information Quality S Weisberg Applied Linear Regression John Wiley and Sons Inc 980 SW Norton H Hirsh Classifier learning from noisy data as probabilistic evidence combination in: Proceedings of the Tenth National Conference on Artificial Intelligence AAAI Press Menlo Park CA 99 pp 4 46 LS Magder JP Hughes Logistic regression when the outcome is measured with uncertainty American Journal of Epidemiology 46 ( M Hofler The effect of misclassification on the estimation of association: a review International Journal of Methods in Psychiatric Research 4 ( JM Neuhaus Bias and efficiency loss due to misclassified responses in binary regression Biometrika 86 ( T Krishnan Efficiency of learning with imperfect supervision Pattern Recognition ( B Efron The efficiency of logistic regression compared to normal discriminant analysis Journal of the American Statistical Association 70 ( X Zhu X Wu Class noise vs attribute noise: a quantitative study of their impacts Artificial Intelligence Review ( AY Ng MI Jordan On discriminative vs generative classifiers: a comparison of logistic regression and naïve Bayes in: NIPS 00 pp Y Bi Theoretical analysis of classification under CCC-Noise and its application to semi-supervised text classification PhD Thesis University of California Riverside H White Maximum likelihood estimation of misspecified model Econometrica 50 (98 5 KC Li N Duan Regression analysis under link violation Annals of Statistics 7 (

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear