Journal of Multivariate Analysis

Size: px
Start display at page:

Download "Journal of Multivariate Analysis"

Transcription

1 Journal of Multivariate Analysis 0 ( Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: wwwelseviercom/locate/jmva The efficiency of logistic regression compared to normal discriminant analysis under class-conditional classification noise Yingtao Bi Daniel R Jeske Department of Statistics University of California Riverside CA 95 United States a r t i c l e i n f o a b s t r a c t Article history: Received 0 April 009 Available online 8 March 00 AMS subject classification: 6H30 6F 6E0 Keywords: Class noise Misclassification rate Misspecified model Asymptotic distribution In many real world classification problems class-conditional classification noise (CCC- Noise frequently deteriorates the performance of a classifier that is naively built by ignoring it In this paper we investigate the impact of CCC-Noise on the quality of a popular generative classifier normal discriminant analysis (NDA and its corresponding discriminative classifier logistic regression (LR We consider the problem of two multivariate normal populations having a common covariance matrix We compare the asymptotic distribution of the misclassification error rate of these two classifiers under CCC-Noise We show that when the noise level is low the asymptotic error rates of both procedures are only slightly affected We also show that LR is less deteriorated by CCC- Noise compared to NDA Under CCC-Noise contexts the Mahalanobis distance between the populations plays a vital role in determining the relative performance of these two procedures In particular when this distance is small LR tends to be more tolerable to CCC- Noise compared to NDA 00 Elsevier Inc All rights reserved Introduction In many real world classification problems class label noise is inherent in the training dataset CCC-Noise is an important type of class label noise Examples where CCC-Noise arises include remote sensing and traditional medical diagnosis (see 3 and image classification (see 45 Recent medical diagnosis technologies (eg microarray analyses and protein massspectrometry profiling have also sparked interest in considering classifiers where training data is prone to CCC-Noise (see for example 67 or 8 In many applications most analysts acknowledge that noise is present but naively ignore it Alternatively data preprocessing approaches have been proposed to attempt to remove mislabeled training observations see 9 or 0 These approaches however face the risk of removing useful data which consequently would reduce the accuracy of the classifier An alternative solution to handle noisy data might be to construct noise tolerant classifiers directly Li et al 4 combined the class noise into the model based on a probabilistic noise model Norton and Hirsh presented a Bayesian approach to learning from noisy data where prior knowledge of the noise process is applied to compute posterior class probabilities Yasui et al 6 and Magder and Hughes used EM algorithm to handle the label noise The effects of CCC-Noise on the estimation of association between two random variables have been discussed widely in epidemiology see 3 One important conclusion is that the noise attenuates the estimation of the association Neuhaus 4 gave a comprehensive discussion of bias and efficiency loss due to CCC-Noise in the context of general linear model and proposed an approximation for the expected values of the estimators The effects of CCC-Noise have also been studied in the area of pattern reorganization Krishnan 5 studied the efficiency loss using Efron s Asymptotic Relative Efficiency criteria 6 Michalek and Tripathi 3 bounded the asymptotic efficiency of logistic regression and normal discrimination Corresponding author addresses: ybi00@ucredu (Y Bi danieljeske@ucredu (DR Jeske X/$ see front matter 00 Elsevier Inc All rights reserved doi:006/jjmva000300

2 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Zhu and Wu 7 investigated the impact of both class and attribute noise by doing simulation experiments They concluded that the classification accuracies decline almost linearly with the increase of the noise level Until now limited research has been conducted to theoretically quantify the impact of CCC-Noise on the misclassification rates Generally speaking classifiers can be characterized as either generative or discriminative according to whether or not the distribution of the explanatory variables is modeled Comparison of generative and discriminative classifiers has received considerable attention in the literature There is a widely held perception that in the absence of CCC-Noise the discriminative approach is more robust than the generative approach For example Ng and Jordan 8 showed that if the assumed conditional distributions for the explanatory variables are correct then generative classifiers and discriminative classifiers have the same asymptotic error rates although the generative classifiers approach their asymptotic values faster However if the conditional distributions are not correct then discriminative classifiers have lower asymptotic error rates provided the link function is modeled correctly Efron 6 computed the relative efficiency based on the ratio of expected regrets of one popular generative classifier normal discriminant analysis (NDA to the corresponding discriminative classifier 8 logistic regression (LR and concluded that LR is between one half and two thirds as efficient as NDA under normality In this paper we aim at theoretically comparing the misclassification error rate of NDA with LR under CCC-Noise The rest of this paper is organized as follows In Section we briefly review the NDA and LR classifiers and then discuss these two procedures under CCC-Noise In Section 3 the asymptotic distributions of the misclassification error rate of the NDA and LR under CCC-Noise are obtained In Section 4 we compare the expected values of the misclassification error rates of these two procedures and also the relative efficiency Section 5 concludes with summary NDA and LR classifiers under CCC-noise We consider a binary classification task Let X denote the vector of explanatory variables Suppose that X comes from one of two p-dimensional normal populations differing in mean but not in covariance population 0 : X MVN p (µ 0 with probability π 0 population : X MVN p (µ with probability π ( where π 0 + π = If all parameters are known a new observation may be classified based on X as belonging to population or 0 according as where L (αβ (X = α + β X > or < 0 α = λ (µ µ µ 0 µ 0 / β = (µ µ 0 ( with λ = log π /π 0 This linear classifier minimizes the total misclassification rate as is easily shown by applying Bayes theorem However the assumption that the parameters (λ µ 0 µ are known is rarely the case in practice Therefore the parameters are usually estimated from a random sample of n observations (x y (x n y n } where y k is the binary class label for the kth sample Parameter estimates are typically obtained by maximizing the following log-likelihood function n ( y k p log(π log } (x k µ 0 (x k µ 0 + log π 0 k= + n k= y k p log(π log (x k µ (x k µ + log π } Using maximum likelihood estimators (MLEs in place of unknown parameters results in the so called NDA procedure The corresponding discriminative method to NDA is LR LR is a popular discriminative method that does not assume any distribution for X It simply assumes a parametric form for the conditional probability P(Y = X = x Then the parameters (α β are estimated directly by maximizing the conditional likelihood n y k (α + β x k k= n log + exp(α + β x k k= In the context of no label noise both NDA and LR will give consistent estimates under ( Now let s introduce the CCC-Noise into these two procedures For a binary classification problem CCC-Noise means that the label Y (0 or of any observation is independently and randomly flipped to Y with probability θ Y where θ Y

3 64 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( denotes the correct labeling rate We know that Y denotes the corresponding true class label but instead of observing Y we observe Ỹ as the noisy class label The CCC-Noise can be specified as P(Ỹ = Y = 0 = θ 0 P(Ỹ = 0 Y = = θ and the observed training data is (x ỹ (x n ỹ n } For the NDA procedure the misspecified log-likelihood is n ( ỹ k p log(π log } (x k µ 0 (x k µ 0 + log π 0 k= + n k= ỹ k p log(π log (x k µ (x k µ + log π } (3 The MLEs obtained by maximizing (3 are termed misspecified MLEs Similarly the misspecified MLEs for the LR procedure are obtained by maximizing n ỹ k (α + β x k k= n log + exp(α + β x k (4 k= 3 Asymptotic distribution of error rate Let ( α β denote an arbitrary estimate of (α β based upon the training data set For a new observation (X Y the probability of a misclassification (calculated with respect to the joint distribution of the training data and the new observation error is defined to be ER( α β = P(L (X > 0 Y = 0 + P(L ( α β ( α β (X < 0 Y = If ( α β are randomly determined (as in NDA or LR ER( α β will be a random variable Efron 6 stated that the distribution of ER( α β is invariant under a linear transformation that allows reduction of assumption ( to the following canonical form population 0 : X MVN p ( ( /e I with probability π 0 population : X MVN p (( /e I with probability π (5 where (µ µ 0 (µ µ 0 the square root of the Mahalanobis distance I is the p p identity matrix and e ( is a p vector For either LR or NDA the probability of misclassification has the same distribution under ( as it does under the so-called standard situation (5 A formal proof of this statement is presented in 9 where it is also shown the result continues to hold even in the presence of CCC-Noise Henceforth we will work with the standard situation In the standard situation the conditional (given the training data probability of misclassification of an arbitrary estimated classification boundary L ( α β = α + β x = 0 is ER( α β = π 0 Φ β + α β β + π Φ β α β β (6 where β indicates the first component of β and Φ( is the cumulative density function of the standard normal distribution Now suppose the arbitrary estimate ( α β has a limiting normal distribution ( ( α } ( z00 n α β β z MVN p+ 0 0 (7 z 0 Z where z 00 is a scalar z 0 is a column p vector and Z is a p p matrix Differentiating (6 gives ER( α β = f 0 ( α β = π 0 φ β + α π φ β α α β β β β β β β β

4 ER( α β β = f ( α β = π 0 + π Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( e + β β e + β β ( ( β α ( β β 3 β φ β + α β β ( β β 3 β φ β α β β β + α where φ is the standard normal density function Defining ω ( α β = f 0 ( α β z 00 + f ( α β Z f ( α β + f 0 ( α β z 0 f ( α β the following lemma follows from application of the delta method: Lemma In the standard situation if ( α β has the limiting distribution given in (7 with ( α β (α β then ner( α β ER( α β L N(0 ω ( α β Note that if ( α β are the true values of (α β then f 0 ( α β and f ( α β will both vanish since ER( α β is minimized at that point In this situation Lemma will not be valid since the first order Taylor expansion is not enough Efron 6 computed the efficiency of LR to NDA without CCC-Noise based on the second order Taylor expansion For the case of having CCC-Noise Lemma will be valid since ( α β are not equal to the true values In order to use Lemma to obtain the asymptotic distribution of ER( α β we will first to get the asymptotic distribution of ( α β To this end we will utilize White s theorem 0 White s Theorem Suppose iid data x x x n come from a true distribution p(x but we assume a family of distributions p(x θ where θ is an indexing parameter Then under suitable regularity conditions the maximum likelihood estimator of θ converges almost surely to the value θ that minimizes p(x log dx the Kullback Leibler distance of p(x θ from p(x White also shows that the sequence of MLEs ˆθ n is asymptotically multivariate normal in the sense n(ˆθ n θ MVN(0 C(θ where the covariance matrix C(θ is equal to A(θ B(θ A(θ with A(θ and B(θ matrices whose (i jth element is given by A ij (θ = E p(x log p(x θ/ θ i θ j B ij (θ = E p(x ( log p(x θ/ θ i ( log p(x θ/ θ j p(x θ p(x L 3 Asymptotic distribution of error rate of NDA First we consider the misspecified NDA procedure Let ( ˆα ˆβ and (ˆλ ˆµ 0 ˆµ ˆ denote respectively the estimates of (α β and (λ µ 0 µ based on the misspecified NDA procedure Let us write the distinct elements of ˆ as a p(p + / vector ( ˆσ ˆσ ˆσ p ˆσ ˆσ 3 ˆσ pp and indicate this vector as (ˆσ ( ˆσ ( where ˆσ ( ( ˆσ ˆσ ˆσ p ˆσ ( ( ˆσ ˆσ 3 ˆσ pp For convenience we introduce the following expressions first a = π 0 θ 0 b = π ( θ c = π 0 ( θ 0 d = π θ h = + ( ab a + b + cd We also have the following notations E ij being the p p matrix with one in the (i jth position and zero elsewhere O p p being zero matrixes and δ ij = or 0 as i = j or i j As described earlier the asymptotic distribution of the error rate is the same under ( as it is under (5 so we consider the asymptotic distribution of the misspecified MLEs in the standard situation It will also be seen that we do not need the limiting distribution of ˆσ ( to derive the asymptotic distribution of NDA based estimates ( ˆα ˆβ Lemma In the standard situation under CCC-Noise the NDA produces misspecified estimates (ˆλ ˆµ 0 ˆµ ˆσ ( by maximizing (3 which satisfy ˆλ λ n ˆµ 0 ˆµ µ 0 µ MVN 3p+ (0 ˆσ ( σ (

5 66 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( where and with λ = log a + b µ = b a 0 b + a e µ = d c e σ ( = e /h ω 00 O p O p ω 03 e O p ω E + I E O p p ω 3 E = a + b O p O p p ω E + I E ω 3 E ω 03 e ω 3 E ω 3 E ω 33 E + h(i E ω 00 = + exp(λ = exp(λ ω = + ab ( 3 ω 33 = h + 4 ab 4 + a 4 b h 4 (b + a + cd 4 ω 03 = h ab (a + b cd ( (a + b( ω = a + b + ( ab b + a + ab (a + b c 4 d (d + c 3 cd 4 d + c ab(a b 3 ω 3 = (a + b 3 h ω cd(c d 3 3 = ( 3 h Proof see Appendix A Lemma gives us the asymptotic distribution of the misspecified MLEs ˆλ ˆµ 0 ˆµ and ˆσ ( Similar to Efron s derivations 6 we use the multivariate delta method to obtain Lemma 3 which uses the following definitions m 0 = b a b + a h m 0 = c d h ( b ( a d c m 03 = b + a d + c 8 m 3 = ( d c d + c b a b + a Lemma 3 In the standard situation under CCC-Noise the NDA produces misspecified estimates ( ˆα ˆβ satisfying ( ( } α n ˆαˆβ β MVN p+ (0 where with α = log a + b h ( d ( c b ( a d + c b + a ad bc (b + a( e β = (µ µ 0 ( = h κ 00 κ 0 e = κ 0 e κ E + (a + b(h + m 3 h (I E κ 00 = ω 00 + ω (m 0 + ω (m 0 + ω 33 (m 03 + m 03 ω 03 + m 03 ω 3 m 0 + m 03 ω 3 m 0 κ 0 = m 0ω h + m 0ω h + m 0 ω 3 m 3 + m 03ω 3 h κ = ω + ω + m h ω 3 33 m 3 3 h 3 + m 03 ω 33 m 3 + ω 03 m 3 + m 03ω 3 h + m 0 ω 3 m 3 ab(a b cd(c d (a + b 3 ( 3

6 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Proof Differentiating ( gives α λ = β λ = 0 α µ 0 = µ 0 β µ 0 = α µ = µ β µ = α σ ij = µ 0iµ 0j µ iµ j + δ ij β σ ij = E ij + E ji + δ ij (µ µ 0 µ 0i indicating the ith component of µ 0 ; likewise for µ i In the standard situation we have the following first-order differential relationship that is obtained by expanding ( ˆα ˆβ around (α β ˆλ λ ( ( ˆαˆβ α m0 e β m = 0 e m 03 e ˆµ 0 0 µ 0 0 ( ( m 3 I 0 ˆµ µ ˆσ ( σ ( (8 ˆσ ( σ ( Let M be the matrix on the right side of (8 ignoring the last column Then the multivariate delta method gives the covariance matrix of ( ˆα ˆβ as M M Evaluation of M M gives the result of Then the asymptotic distribution of ER( ˆα ˆβ is given in Theorem Theorem ner( ˆα ˆβ ER(α β e L N(0 ω NDA where with ER(α β e α α = π 0 Φ ( + β + π Φ ( β ω NDA = (f 0(α β e κ 00 + κ (f (α β e + f 0 (α β e κ 0 f (α β e f 0 (α β e = π 0 ( β φ α + β π β f (α β e α = π 0 (β φ α ( + β α φ ( β α + π (β φ Proof The proof follows immediately from Lemmas and 3 α c+d log β = a+b h α ( β We know that the minimal value of the error rate is ER(λ e = π 0 Φ ( + λ + π Φ ( λ That means the misspecified NDA procedure will converge to ER(λ e if the following condition is satisfied ( d c ( b a ( d+c b+a h ad bc (b+a(c+d When π 0 = π it can be easily seen that this equation holds if θ 0 = θ 3 Asymptotic distribution of error rate of LR = λ (9 According to White s theorem the misspecified MLEs obtained by maximizing (4 say (ŝ ˆt are asymptotically normal and converge almost surely to the value (s t that minimizes Ỹ(α + β X + log + exp(α + β X } logϕ(x (0 E (XỸ where ϕ(x is the density function of X under (5 Taking the first partial derivatives with respect to α and β respectively and setting them to zero gives: exp(s E (XỸ Ỹ exp(t X E (XỸ = 0 + exp(s exp(t ( X E XỸ E (XỸ (XỸ X exp(s exp(t X = 0 + exp(s exp(t ( X

7 68 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Neuhaus 4 showed that ignoring CCC-Noise in the LR can be viewed as one kind of link function violation Li and Duan s showed that the estimated slope ˆt from LR with a misspecified link function consistently estimates the true parameter β up to a scale factor In the standard situation β = e thus we have t = u e where u is a scalar quantity Then ( and ( are simplified into exp(s exp(u X g (π 0 θ 0 θ = E X + exp(s exp(u (3 X with g (π 0 θ 0 θ = E X X exp(s exp(u X + exp(s exp(u X π exp( X g (π 0 θ 0 θ = E (XỸ Ỹ = θ E X + ( θ 0 E X π 0 + π exp( X X π exp( X g (π 0 θ 0 θ = E (XỸ Ỹ X = θ E X + ( θ 0 E X π 0 + π exp( X π 0 π 0 + π exp( X X π 0 π 0 + π exp( X and X indicating the first component of X Eqs (3 and (4 are solved by combining Monte-Carlo integration with the Newton Raphson algorithm The two unknown parameters (s u only depend on the values of (π 0 θ 0 θ Instead of using s (π 0 θ 0 θ and u (π 0 θ 0 θ we adopt (s u for the convenience of expression Lemma 4 In the standard situation under CCC-Noise the LR produces misspecified estimates (ŝ ˆt satisfying (ŝ ( } s n ˆt u MVN e p (0 where (s u is solution of (3 and (4 and ψ 00 ψ 0 e = ψ 0 e ψ E + v 0 (I E h 0 (4 with ψ 00 = hv 0 h h v + h v ψ (h 0 h h 0 = h h v 0 + (h h 0 + h v h 0 h v (h 0 h h + ψ = hv 0 h h 0 v + h v 0 exp(s exp(u x h (h 0 h h i (π 0 θ 0 θ + exp(s exp(u x xi ϕ(xdx + π θ exp( x + π 0 ( θ 0 exp(s exp(u x v i (π 0 θ 0 θ π 0 + π exp( x + exp(s exp(u x exp(s exp(u } x + + exp(s exp(u x i ϕ(xdx i = 0 x Proof see Appendix B Combining Lemmas and 4 proves the following theorem regarding the asymptotic distribution of ER(ŝ ˆt Theorem ner(ŝ ˆt ER(s u L e N(0 ω LR where with ER(s u s s e = π 0 Φ ( + + π u Φ ( u ω LR = (f 0(s u e ψ 00 + ψ (f (s u e + f 0 (s u e ψ 0 f (s u e f 0 (s u e = π 0 ( u φ s + s f (s u e = π 0 (u φ u s ( + u π ( u φ + π s (u φ s u s ( u

8 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Fig a The asymptotic error rate bias with θ 0 = = and π 0 = 05 4 Comparison and discussion Fig b The asymptotic error rate bias with θ 0 = = and π 0 = 05 Define the bias of the asymptotic error rate of the classification rule by AERB( ˆα ˆβ = lim E ER( ˆα ˆβ ER(λ e = ER(α β n e ER(λ e AERB(ŝ ˆt = lim n E ER(ŝ ˆt ER(λ e = ER(s u e ER(λ e We assume that both θ 0 and θ are greater than 05 since values of θ 0 and θ less than 05 indicates that the process of labeling an observation performs worse than flipping a coin Figs 3 illustrate how the CCC-Noise affects the bias of the asymptotic error rate of NDA and LR respectively If the CCC-Noise is ignored both the LR and NDA procedures are positively biased However when the noise level is low which is usually the case in practice the asymptotic error rates of both procedures are only slightly affected When is small the LR and NDA almost have the same asymptotic error rate As increases the difference in the asymptotic error rates increases with LR always being larger When θ 0 = θ AERB is zero for both NDA and LR implying the corresponding asymptotic error rates are both equal to the optimal value given by ER(λ e We define the relative efficiency of LR to NDA as the following E ER( ˆα ˆβ ER(λ e RE(π 0 θ 0 θ n = E ER(ŝ ˆt ER(λ e AERB ( ˆα ˆβ + ω NDA /n (5 AERB (ŝ ˆt + ω LR /n The quantities of (5 are graphed in Figs 4 6 for different sample sizes These three graphs show that the relative efficiency monotonically increases as the noise level increases which indicates that LR is less deteriorated by CCC-Noise compared to NDA As noise level goes to zero the relative efficiencies converge to their minimal values which are different from the numbers shown in Table of Efron s paper 6 because the relative efficiency employed in Efron s paper is defined as the

9 630 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Fig c The asymptotic error rate bias with θ 0 = = 3 and π 0 = 05 Fig a The asymptotic error rate bias with θ 0 = 09 = and π 0 = 05 Fig b The asymptotic error rate bias with θ 0 = 09 = and π 0 = 05 ratio of the expected regrets of these two procedures rather than the ratio of the mean square errors as in our paper It is also important to note that as gets smaller although NDA is still better it is better by a small margin Furthermore for the case of sample size equal to 50 and equal to as misclassification probability increases the performance of LR can even be better than NDA Though only Fig 4 shows this latter aspect for lower values of θ some of these other curves would rise up to be larger than also

10 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Fig c The asymptotic error rate bias with θ 0 = 09 = 3 and π 0 = 05 Fig 3a The asymptotic error rate bias with θ 0 = 08 = and π 0 = 05 Fig 3b The asymptotic error rate bias with θ 0 = 08 = and π 0 = 05 5 Summary We have investigated the impact of CCC-Noise on the performance of a popular generative classifier normal discriminant analysis (NDA and its corresponding discriminative classifier logistic regression (LR We compared the relative asymptotic error rate of these two classifiers under CCC-Noise when the underlying distributions are multivariate normal Typically the AERB of both procedures are only slightly affected when the noise level is low LR has a larger AERB than NDA when the two populations are more separated

11 63 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Fig 3c The asymptotic error rate bias with θ 0 = 08 = 3 and π 0 = 05 Fig 4 The relative efficiency of LR to NDA with θ 0 = n = 50 and π 0 = 05 Fig 5 The relative efficiency of LR to NDA with θ 0 = n = 00 and π 0 = 05 With respect to the relative efficiency of LR to NDA we showed that LR is less deteriorated by CCC-Noise compared to NDA One important conclusion is that under the CCC-Noise contexts the Mahalanobis distance plays a vital role in determining the relative performance of these two procedures When is small LR tends to be more tolerable to CCC-Noise compared to NDA Our analyses provide insight and a more in-depth understanding of the effect of noisy labeled observations on the accuracy of frequently used classification models and conceivably our results can be used to guide interested researchers

12 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Fig 6 The relative efficiency of LR to NDA with θ 0 = n = 000 and π 0 = 05 on the design of noise handling mechanisms In future work we will extend this efficiency comparison to a non-gaussian assumption for the distribution of X Appendix A Proof of Lemma According to White s theorem the misspecified MLEs (ˆλ ˆµ 0 ˆµ ˆ of NDA converges to (λ µ 0 µ which minimizes p E (XỸ ( Ỹ log(π + log + (X µ 0 (X µ 0 log π 0 + Ỹ p log(π + log + (X µ (X µ log π where the expectation is with respect to the joint distribution of (Ỹ X Let L represent the inner term of the expectation Taking the first partial derivatives with respect to λ µ 0 µ and respectively and setting them to zero gives: exp(λ } E (XỸ + exp(λ Ỹ = 0 (6 } E (XỸ ( Ỹ( (X µ 0 = 0 (7 } E (XỸ Ỹ( (X µ = 0 (8 E (XỸ diag + ( Ỹ(X µ 0 (X µ 0 diag(x µ 0 (X µ 0 + Ỹ(X µ (X µ diag(x µ (X µ } } = 0 (9 Also it can be easily shown that in the standard situation E X = (π π 0 e (XỸ E (XỸ Ỹ = π 0 ( θ 0 + π θ E (XỸ Ỹ X = θ π ( θ 0 π 0 e E (XỸ ( Ỹ(X µ 0 (X µ 0 ( ( ( ( e = π 0 θ 0 I + + e µ 0 + e µ 0 + π ( θ I + e µ 0 µ 0 (0 E (XỸ Ỹ(X µ (X µ ( ( ( ( e = π 0 ( θ 0 I + + e µ + e µ + π θ I + e µ µ

13 634 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Substituting the values of the expressions from (0 into (6 (7 (8 and (9 and simplifying and solving yield (λ µ 0 µ Here = I + ( ab + cd a+b c+d E which indicates that σ ( = he The asymptotic variance covariance matrix of (ˆλ ˆµ 0 ˆµ ˆσ ( is represented as A 00 A 0 A 0 A 03 B 00 B 0 B 0 B 03 = A BA A 0 A = A A 3 B A 0 A 0 B B B 3 A A 3 B A 03 A 3 A 0 B B B A 3 3 A 33 B 03 B 3 B 3 B 33 where A 00 is a scalar A 0 A 0 A 03 are p vectors and A A A 33 A A 3 A 3 are p p matrixes likewise for all components in matrix B First we consider the elements in matrix A Again according to White s theorem we know that L A 00 = E exp(λ (XỸ λ λ = (λ µ 0 µ + exp(λ L A = E (XỸ µ 0 µ = E (XỸ ( Y ( } 0 (λ µ 0 µ ( ( ab = (a + b I + a + b + cd E L } ( ( A = E ab (XỸ = E (XY µ µ Ỹ( = ( I + a + b + cd E (λ µ 0 µ and apparently A 0 = A 0 = A 03 = O p A = A = O p p With respect to A 3 and A 3 for each element in σ ( we have L E Ej + E j (XỸ µ 0 σ j = E (XỸ ( Ỹ(X µ (λ µ 0 0 µ + δ = O p j L E Ej + E j (XỸ = E µ σ j (XỸ Ỹ(X µ + δ = O p j (λ µ 0 µ which indicate that A 3 = A 3 = O p p Now let s consider (9 Taking the first partial derivative with respect to σ ij and evaluated at (λ µ 0 µ gives E (XỸ E ij + E ji + ( + δ ij diag E ij + E ji + δ ij All the elements in this matrix are zero except the positions (i j and (j i which indicates that A 33 is a diagonal matrix and evaluation gives A 33 = h E + h(i E Finally the whole matrix A can be written in the following way ( A = diag exp(λ + exp(λ a + p b h }} (a + b (a + b c + p p d h }} ( ( h }} h/ h/ Now consider the matrix B According to White s theorem we know that L L B 00 = E (XỸ λ λ = E (XỸ exp(λ } (λ µ 0 µ + exp(λ + Ỹ exp(λ = + exp(λ ( L L } B = E (XỸ = E (XY µ 0 µ 0 (λ ( Ỹ ( (X µ µ 0 (X µ 0 ( 0 µ ( ( ab = I + a + b + cd E (a + bi + ab ( ( ab a + b E I + a + b + cd E ( L L B = E (XỸ = E (XY µ µ (λ (Y ( (X µ µ (X µ ( } 0 µ ( ( ab = I + a + b + cd E (I + cd ( ( ab E I + a + b + cd E

14 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( It also can be easily shown that B 0 = B 0 = O p and B = B = O p p based on the fact that ( ỸỸ is a zero random variable Now let s consider B 03 First we let N = diag and n ij be (i j component of N Thus we have L E (XỸ = n ij + ( Ỹ (X i µ 0i (X j µ 0j + Ỹ (X i µ i (X j µ j + δ ij + δ ij σ ij (λ µ 0 µ Note the fact that N is a diagonal matrix Let s consider the jth component of B 03 ( L L E exp(λ (XỸ λ σ j = E (XỸ (λ µ 0 µ + exp(λ Ỹ n j + ( Ỹ (X µ 0 (X j µ 0j + Ỹ (X µ (X j µ } j + δ j + δ j = E (XỸ Ỹ n j Ỹ (X µ (X j µ } j + δ j It can be observed that only the component with j = is nonzero which is equal to E (XỸ Ỹ h Ỹ (X µ (X µ } = ab a + b cd a + b Now let s consider the (i j component of B 33 L L E (XỸ σ j σ i = E ( Ỹ (X µ 0 (X j µ 0j (X µ 0 (X i µ 0i (λ µ 0 µ + δ j + δ i + E (Ỹ (X µ (X j µ j (X µ (X i µ i + δ j + δ i It can be observed that only the components with i equal to j are nonzero For any j L L E (XỸ = n σ j σ + j j + π 0θ 0 (µ 0 µ 0 + π ( θ (µ µ 0 For j = E (XỸ L L (λ µ 0 µ σ σ (λ µ 0 µ For (i j component of B 3 L L E (XỸ σ j µ 0i (λ µ 0 µ + π 0 ( θ 0 (µ 0 µ + π θ (µ µ ( ( ( ( b a = n + j + a + b b + a b + a ( ( d c + ab = + b + a + cd = h = h = E (XỸ ab 4 + a 4 b (b + a + cd c 4 d (d + c 3 4 ( ab b + a + ( Ỹ (X µ 0 (X j µ 0j (X i µ 0i + δ j h δ i It can be observed that only the ( component is nonzero which is equal to L L E 3(µ 0 µ (XỸ = π 0 θ + (µ 0 0 µ h σ µ 0 (λ µ 0 µ + π ( θ 3(µ µ + (µ 0 µ 0 3 h = 3 ab(a b h (a + b Thus B 3 = 3 ab(a b h (a+b E Similarly we have B 3 = 3 cd(c d h (c+d E Finally evaluation of A BA gives the matrix cd d + c

15 636 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( Appendix B Proof of Lemma 4 The asymptotic variance covariance matrix of (ŝ ˆt is represented ( ( = V HV V V H H = V V H V H where V is a scalar V are p vectors and V are p p matrixes likewise for all components in matrix H First we consider the elements in matrix H Let l represent the inner term of the expectation of (0 l exp(s H = E exp(u } X (XỸ = EX = α α + exp(s exp(u h X 0 (s u Now consider the ith component of H l E (XỸ α β i (s u = E X Xi exp(s exp(u X + exp(s exp(u X Only the first component is nonzero which is equal to l E X exp(s exp(u } X (XỸ α β = E X = (s u + exp(s exp(u h X The (i j component of H l E (XỸ β i β j (s u } Xi X j exp(s exp(u } X = E X + exp(s exp(u X It can be observed that only the components with i equal to j are nonzero For any i l E (Xi exp(s exp(u } X (XỸ β i β = E X i (s u + exp(s exp(u X exp(s exp(u } X = E X = + exp(s exp(u h X 0 For i = E (XỸ l β β (s u (X exp(s exp(u } X = E X = + exp(s exp(u h X Thus H = h E + h 0 (I + E For matrix V l l V = E (XỸ α α (s u = E (XỸ (Y exp(s exp(u X + exp(s exp(u X Y + = E X exp(s exp(u X } + exp(s exp(u X θ π exp( X + π 0 ( θ 0 exp(s exp(u X exp(s π 0 + π exp( X + exp(s exp(u X + exp(u X = v + exp(s exp(u 0 X For the ith component of V l l E (XỸ α β = E (XỸ i (s u Only the first component is nonzero which is E (XỸ (Y X i exp(s exp(u X + exp(s exp(u X Y X i + X i l α l β (s u V = v E + v 0 (I + E Finally evaluation of V HV gives the matrix exp(s exp(u } X + exp(s exp(u X = v For the matrix V similar to H we have

16 Y Bi DR Jeske / Journal of Multivariate Analysis 0 ( References T Krishnan SC Nandy Efficiency of discriminant analysis when initial samples are classified stochastically Pattern Recognition 3 ( UA Katre T Krishnan Pattern recognition with imperfect supervision Pattern Recognition ( JE Michalek RC Tripathi The effect of errors in diagnosis and measurement on the estimation of the probability of an event Journal of the American Statistical Association 75 ( Y Li W Lodeswyk D De Ridder M Reinders Classification in the presence of class noise using a probabilistic kernel fisher method Pattern Recognition 40 ( ND Lawrence B Scholkopf Estimating a kernel fisher discriminant in the presence of label noise in: Proc of 8th ICML 00 pp Y Yasui M Pepe L Hsu BL Adam Z Feng Partially supervised learning using an EM-boosting algorithm Biometrics 60 ( W Zhang R Rekaya K Bertrand A method for predicting disease subtypes in presence of misclassification among training samples using gene expression: application to human breast cancer Bioinformatics ( K Robbins S Joseph W Zhang R Rekaya JK Bertrand Classification of incipient Alzheimer patients using gene expression data: dealing with potential misdiagnosis Online Journal of Bioinformatics 7 ( J Maletic A Marcus Data cleansing: beyond integrity analysis in: Proceedings of the Conference on Information Quality S Weisberg Applied Linear Regression John Wiley and Sons Inc 980 SW Norton H Hirsh Classifier learning from noisy data as probabilistic evidence combination in: Proceedings of the Tenth National Conference on Artificial Intelligence AAAI Press Menlo Park CA 99 pp 4 46 LS Magder JP Hughes Logistic regression when the outcome is measured with uncertainty American Journal of Epidemiology 46 ( M Hofler The effect of misclassification on the estimation of association: a review International Journal of Methods in Psychiatric Research 4 ( JM Neuhaus Bias and efficiency loss due to misclassified responses in binary regression Biometrika 86 ( T Krishnan Efficiency of learning with imperfect supervision Pattern Recognition ( B Efron The efficiency of logistic regression compared to normal discriminant analysis Journal of the American Statistical Association 70 ( X Zhu X Wu Class noise vs attribute noise: a quantitative study of their impacts Artificial Intelligence Review ( AY Ng MI Jordan On discriminative vs generative classifiers: a comparison of logistic regression and naïve Bayes in: NIPS 00 pp Y Bi Theoretical analysis of classification under CCC-Noise and its application to semi-supervised text classification PhD Thesis University of California Riverside H White Maximum likelihood estimation of misspecified model Econometrica 50 (98 5 KC Li N Duan Regression analysis under link violation Annals of Statistics 7 (

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem Set 2 Due date: Wednesday October 6 Please address all questions and comments about this problem set to 6867-staff@csail.mit.edu. You will need to use MATLAB for some of

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Day 5: Generative models, structured classification

Day 5: Generative models, structured classification Day 5: Generative models, structured classification Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 22 June 2018 Linear regression

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Pattern Recognition. Parameter Estimation of Probability Density Functions

Pattern Recognition. Parameter Estimation of Probability Density Functions Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

CSC 411: Lecture 09: Naive Bayes

CSC 411: Lecture 09: Naive Bayes CSC 411: Lecture 09: Naive Bayes Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 8, 2015 Urtasun, Zemel, Fidler (UofT) CSC 411: 09-Naive Bayes Feb 8, 2015 1

More information

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing Supervised Learning Unsupervised learning: To extract structure and postulate hypotheses about data generating process from observations x 1,...,x n. Visualize, summarize and compress data. We have seen

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

For more information about how to cite these materials visit

For more information about how to cite these materials visit Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1 Introduction and Problem Formulation In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1.1 Machine Learning under Covariate

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent

More information

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1 Lecture 13: Data Modelling and Distributions Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1 Why data distributions? It is a well established fact that many naturally occurring

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering

More information

Lecture 5: LDA and Logistic Regression

Lecture 5: LDA and Logistic Regression Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is Example The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is log p 1 p = β 0 + β 1 f 1 (y 1 ) +... + β d f d (y d ).

More information

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Confidence Estimation Methods for Neural Networks: A Practical Comparison , 6-8 000, Confidence Estimation Methods for : A Practical Comparison G. Papadopoulos, P.J. Edwards, A.F. Murray Department of Electronics and Electrical Engineering, University of Edinburgh Abstract.

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm Zhenqiu Liu, Dechang Chen 2 Department of Computer Science Wayne State University, Market Street, Frederick, MD 273,

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Linear Classification

Linear Classification Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

5. Discriminant analysis

5. Discriminant analysis 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

More information

Feature Extraction with Weighted Samples Based on Independent Component Analysis

Feature Extraction with Weighted Samples Based on Independent Component Analysis Feature Extraction with Weighted Samples Based on Independent Component Analysis Nojun Kwak Samsung Electronics, Suwon P.O. Box 105, Suwon-Si, Gyeonggi-Do, KOREA 442-742, nojunk@ieee.org, WWW home page:

More information

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30 Problem Set 2 MAS 622J/1.126J: Pattern Recognition and Analysis Due: 5:00 p.m. on September 30 [Note: All instructions to plot data or write a program should be carried out using Matlab. In order to maintain

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM. Dr. Xin Zhang Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Chapter 17: Undirected Graphical Models

Chapter 17: Undirected Graphical Models Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)

More information

STA 414/2104, Spring 2014, Practice Problem Set #1

STA 414/2104, Spring 2014, Practice Problem Set #1 STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and,

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information