A Robust Approach to Regularized Discriminant Analysis

Size: px

Start display at page:

Download "A Robust Approach to Regularized Discriminant Analysis"

Ethelbert Richards
6 years ago
Views:

1 A Robust Approach to Regularized Discriminant Analysis Moritz Gschwandtner Department of Statistics and Probability Theory Vienna University of Technology, Austria Österreichische Statistiktage, Graz, Austria September 08, 2011 Vienna University of Technology

2 Joint work with Peter Filzmoser, Vienna University of Technology, Austria Christophe Croux, ORSTAT and University Center of Statistics, K. U. Leuven, Belgium Gentiane Haesbroeck, University of Liege, Liege, Belgium

3 Contents 1. Overview of discriminant analysis 2. Introduction of the proposed method 3. Choice of parameters 4. Real and simulated data examples

4 Discriminant Analysis (DA): Example Haemophilia data: 30 normal persons and 22 obligatory carriers of hemophilia A AHFactivity AHFantigen normal carrier

5 Notation Given n observations of training data, measured at p variables. Observations originate from k different populations G 1,..., G k, according to prior probabilities π 1,..., π k, where k j=1 π j = 1, with sample sizes n 1,..., n k, where k j=1 n j = n. Usual assumption: Observations are distributed according to a normal distribution N(µ j, Σ j ), with mean µ j and covariance matrix Σ j, j = 1,..., k.

6 DA: Classification and Prediction Find a classification function f based on the training data that assigns a new, unlabelled observation to one (and only one) of the k groups: f : Ω p {1,..., k} Bayes Rule: Given an observation x, the posterior probability for group G j equals P (G j x) = p(x G j) π j ki=1 p(x G i ) π i

7 Bayesian DA Rule A test set observation x is assigned to that population G j, for which ln P (G j x) is a maximum over all groups j = 1,..., k. ( f(x) = arg max ln(p (Gj x)) ) ( = arg max ln(p(x Gj ) π j ) ) j j Quadratic Discriminant Analysis: f QDA (x) = arg max j ( 1 2 ln(det Σ j) 1 2 (x µ j ) Σ 1 j (x µ j ) + ln π j ) Linear Discriminant Analysis: assume Σ 1 =... = Σ k = Σ, and use f LDA (x) = arg max (µ j j Σ 1 x 12 ) µ j Σ 1 µ j + ln π j

8 Parameter Estimation The essential elements of the LDA rule are the group centers and the common group covariance matrix. Estimate group centers and the common group covariance matrix by the sample means and pooled sample covariance matrix. robust estimators of location and covariance, like the MCD estimators. regularized (sparse) estimators of location and covariance. Robust estimators lead to robust DA rules!

9 Maximum Likelihood Given a data sample X, the log-likelihood function of joint location µ and inverse scatter Θ := Σ 1 is given by L(µ, Θ) = log(det(θ)) 1 n x X (x µ) Θ(x µ) Maximization leads to classical estimators: ˆµ = 1 n x X x ˆΘ = ˆΣ 1 = ( 1 n x X (x ˆµ)(x ˆµ) ) 1

10 Regularization Problem: If n < p, ˆΣ is singular and the maximum likelihood estimator for Θ does not exist! Solution: Penalization of log-likelihood function based on penalty term λ > 0 and L1 Norm. 1 : L(µ, Θ) = log(det(θ)) 1 n x X Θ 1 = θ lm l,m (x µ) Θ(x µ) λ Θ 1 The maximization problem can be solved by an algorithm called graphical lasso. λ governs sparseness of ˆΣ and ˆΘ! package: glasso (Friedman, Hastie, Tibshirani, 2007).

11 Example Simulated three-dimensional data, X N(0, I 3 ), n = 100: > solve(cov(x)) [,1] [,2] [,3] [1,] [2,] [3,] > glasso(cov(x), rho=0.2)$wi [,1] [,2] [,3] [1,] [2,] [3,] λ = 0.2 leads to sparse estimate of concentration matrix!

12 Example Problem: glasso not robust! Adding 10 outliers distributed according to N(10, I 3 ) leads to > glasso(cov(x), rho=0.2)$wi [,1] [,2] [,3] [1,] [2,] [3,] Idea: Combine regularization of glasso with robust techniques!

13 Regularized MCD estimator Croux and Haesbroeck (2010): Improvement: Adapt MCD idea and integrate it into log-likelihood function: L(H, (µ, Θ)) = log(det(θ)) 1 h with i H H {1,..., n}, H = h < n (x i µ) Θ(x i µ) λ Θ 1 Maximization of L(H, (µ, Θ)) means to find an index subset H opt for which max L(H opt, (µ, Θ)) max L(H, (µ, Θ)) (µ,θ) (µ,θ) H {1,..., n} : H = h

14 Regularized MCD estimator Problem: ( ) n h subsets to check. Not applicable to large n. Improvement: C-Step Algorithm: Let H k be the subset derived at iteration k and (ˆµ Hk, ˆΘ Hk ) be the corresponding estimates maximizing L(H k, (µ, Θ)). Compute Mahalanobis distances with respect to (ˆµ Hk, ˆΘ Hk ): d (k) i (x i, ˆµ Hk, ˆΘ Hk ) = (x i ˆµ Hk ) ˆΘ Hk (x i ˆµ Hk ) Define next subset H k+1 as { H k+1 = i {1,..., n} : d (k) i where d (k) (j) are the ordered distances. {d (k) } (1),..., d(k) (h) } L(H k, (µ Hk, Θ Hk )) L(H k+1, (µ Hk+1, Θ Hk+1 ))

15 Algorithm The regularized MCD estimator is computed using the following algorithm: 1. Draw initial subset H 0 2. Maximize penalized likelihood function (glasso) to obtain (ˆµ H0, ˆΘ H0 ) 3. Compute ordered Mahalanobis distances w.r.t. (ˆµ H0, ˆΘ H0 ) 4. Choose next subset containing h observations with smallest distances 5. Repeat steps 2-4 until convergence to obtain ( ˆµ, ˆΘ) A local maximum of the likelihood value is reached. Algorithm can be repeated several times with different initial subsets.

16 LDA and RegMCD How to apply LDA with the regularized MCD estimator in the multi group setting: X = {x ij : i = 1,..., n j ; j = 1,..., k} Compute robust location estimates t j for j = 1,..., k Compute centered observations Z = {z ij } with z ij = x ij t j Apply the regularized MCD algorithm to Z to obtain common estimates (ˆµ, ˆΘ) Correct location estimates: ˆµ j = t j + ˆµ Apply LDA using parameters ˆµ 1,..., ˆµ k, ˆΘ

17 Centering x y x y

18 The penalty parameter λ How to choose the penalty parameter λ: Based on test error rates: Cross Validation Based on a model selection criterion: AIC, BIC BIC criterion: BIC(Γ) = 2 log L(Γ) + κ(γ) log n L(Γ)... Likelihood function of the model κ(γ)... Number of parameters in the model

19 The penalty parameter λ BIC(λ) is small if the value of the likelihood function L(H opt, ˆθ) is high the number of parameters in the model is small Choose λ according to λ opt = arg min λ BIC(λ) Best compromise between likelihood and sparseness!

20 Example: Fruit Data Three different sorts of the same fruit (cucumis melo) 256 different spectra measured Outliers due to different illumination systems Partition of data into 60% training and 40% test set Test errors measured for each group separately

21 Example: Fruit Data, BIC BIC and AER suggest a small λ value! BIC AER Lambda Lambda

22 Example: Fruit Data, Results Outliers in the third group lead to poor results for LDA. RRLDA remains stable! Err T est1 Err T est2 Err T est3 RRLDA (λ = 0.001) GLASSO (λ = 0.001) LDA

23 Example: Golub Data, Results 38 training samples and 34 test samples from two cancer classes. Absolute test errors were measured for various variable subsets. Variable selection was done according to the nearest shrunken centroids method. p LDA GLASSO RRLDA

24 Simulated Example Two groups (k = 2) both consisting of 100 observations and p variables with p {30, 100, 300, 500, 1000} Discrimination occurs in variables 1 and 2. Variables 3 - p are uncorrelated noise according to standard normal distributions. µ 1 = ( ) µ 2 = ( ) Σ = ( 1 )

25 Simulated Example x[, 1] x[, 2]

26 Simulated Example, BIC, p=30 BIC Lambda (optimal value = 0.19)

27 Simulated Example, Outliers Simulate contamination by adding 10% shift outliers to the data. Variables 3 - p are distributed like non-outliers. Mean of variable 1 is shifted. Means of variable 2 are swapped. µ 1 = ( ) µ 2 = ( )

28 Simulated Example, Outliers x[, 1] x[, 2]

29 Results without contamination Error Rates LDA GLASSO RRLDA AER TER p

30 Results with contamination Error Rates LDA GLASSO RRLDA AER TER p

31 Computation Times Seconds RegMCD GLASSO p

32 Conclusions RRLDA is a combination of regularization and robust methods. RRLDA is a good choice if data contain either outliers or many noisy variables or both. Penalty parameter λ is chosen according to an adapted BIC criterion.

33 Some references C. Croux and G. Haesbroeck. Robust scatter regularization. Compstat, Book of Abstracts, Paris: Conservatoire National des Arts et M etiers (CNAM) and the French National Institute for Research in Computer Science and Control (INRIA), J.H. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, , J.H. Friedman. Regularized discriminant analysis. Journal of the American Statistical Association, 84, , P. Filzmoser, R. Maronna, and M. Werner. Outlier identification in high dimensions. Computational Statistics and Data Analysis, 52, , 2008.

Cellwise robust regularized discriminant analysis

Cellwise robust regularized discriminant analysis JSM 2017 Stéphanie Aerts University of Liège, Belgium Ines Wilms KU Leuven, Belgium Cellwise robust regularized discriminant analysis 1 Discriminant analysis