Dimensional reduction of clustered data sets

Similar documents
Matching the dimensionality of maps with that of the data

Latent Variable Models and EM algorithm

Latent Variable Models and EM Algorithm

Feature Selection and Classification on Matrix Data: From Large Margins To Small Covering Numbers

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Lecture 7: Con3nuous Latent Variable Models

L11: Pattern recognition principles

STA 414/2104: Lecture 8

All you want to know about GPs: Linear Dimensionality Reduction

Variational Principal Components

Probabilistic & Unsupervised Learning

PCA, Kernel PCA, ICA

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Machine Learning Techniques for Computer Vision

Kernel Density Topic Models: Visual Topics Without Visual Words

A Generalization of Principal Component Analysis to the Exponential Family

Automated Segmentation of Low Light Level Imagery using Poisson MAP- MRF Labelling

Localized Sliced Inverse Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

STA 414/2104: Lecture 8

Dimension Reduction (PCA, ICA, CCA, FLD,

Self-Organization by Optimizing Free-Energy

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

The Variational Gaussian Approximation Revisited

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Dimensionality Reduction

CS281 Section 4: Factor Analysis and PCA

Unsupervised Learning

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Recent Advances in Bayesian Inference Techniques

Missing Data in Kernel PCA

Machine Learning - MT & 14. PCA and MDS

Feature Selection for SVMs

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Linear Models for Classification

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Clustering VS Classification

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Principal Component Analysis

Estimating Gaussian Mixture Densities with EM A Tutorial

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

Lecture 4: Probabilistic Learning

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

Machine learning for pervasive systems Classification in high-dimensional spaces

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Factor Analysis (10/2/13)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Modelling Transcriptional Regulation with Gaussian Processes

Mathematical Formulation of Our Example

Probabilistic Graphical Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Pattern Recognition and Machine Learning

Introduction to Graphical Models

Machine Learning (Spring 2012) Principal Component Analysis

Mixture Models and EM

Dimensionality Reduction

Latent Variable Models with Gaussian Processes

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Notes on Latent Semantic Analysis

Linear Regression and Its Applications

Cheng Soon Ong & Christian Walder. Canberra February June 2018

PATTERN CLASSIFICATION

NMF based Gene Selection Algorithm for Improving Performance of the Spectral Cancer Clustering

Grouping of correlated feature vectors using treelets

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Discriminant analysis and supervised classification

Lecture 6: April 19, 2002

CSCI-567: Machine Learning (Spring 2019)

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

Probabilistic Fisher Discriminant Analysis

What is Principal Component Analysis?

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning with Permuted Data

ECE521 week 3: 23/26 January 2017

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

PCA and admixture models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott

Improving Fusion of Dimensionality Reduction Methods for Nearest Neighbor Classification

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Widths. Center Fluctuations. Centers. Centers. Widths

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

Smart PCA. Yi Zhang Machine Learning Department Carnegie Mellon University

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Mixtures of Gaussians. Sargur Srihari

7. Variable extraction and dimensionality reduction

Transcription:

Dimensional reduction of clustered data sets Guido Sanguinetti 5th February 2007 Abstract We present a novel probabilistic latent variable model to perform linear dimensional reduction on data sets which contain clusters. The model simultaneously performs the clustering and the dimensional reduction in an unsupervised manner, providing an optimal visualisation of the clusterig. We prove that the resulting dimensional reduction results in Fisher s linear discriminant analysis as a limiting case. Testing of the model both on real and artificial data shows marked improvements in the quality of the visualisation and clustering. 1 Introduction Dimensional reduction techniques form an important chapter in statistical machine learning. Linear methods such as principal component analysis (PCA) and multidimensional scaling are classical statistical tools. While a lot of current research focuses on non-linear dimensional reduction techniques, linear methods are still widely used in a number of applications, from computer vision to bioinformatics. In particular, PCA is still the preferred tool for dimensional reduction in applications such as microarray analysis, where the noisy nature of the data and the lack of knowledge of the underlying processes means that linear models are generally more trusted. A major advance in the understanding of dimensional reduction was provided by Tipping and Bishop s probabilistic interpretation of PCA [Tipping and Bishop, 1999b, PPCA]. The starting point for PPCA is in defining a factoranalysis style latent variable model y = W x + µ + ɛ. (1) Here y is a D-dimensional vector, W is a tall, thin matrix with D rows and q columns (with q < D), x is a q-dimensional latent variable, µ is a constant (mean) vector and ɛ N ( 0, σ 2 I ) is an error term. Placing a spherical unitary normal prior on the latent variable x, Tipping and Bishop proved that, having observed i.i.d. data y i, at maximum likelihood the columns of W spanned the principal subspace of the data. Therefore, the generative model (1) could be said to provide a probabilistic equivalent of PCA. 1

This result opened the way for a number of applications and generalisations: it allowed a principled treatment of missing data and mixtures of PCA models [Tipping and Bishop, 1999a], noise propagation through the model [Sanguinetti et al., 2005] and, through different choices of priors, allowed the linear dimensional reduction framework to be extended to cases where the data could not be modelled as Gaussian [Girolami and Breitling, 2004]. One major limitation of PCA, which is shared by its extensions, is that it is based on matching the covariance of the data. As such, it is not obvious how to incorporate information on the structure of the data. For example, if the data set has a cluster structure, the dimensional reduction obtained through PCA is not guaranteed to respect this structure. A specific application where this problem is particularly acute is bioinformatics: many microarray studies contain samples from different conditions (e.g. different types of tumour, or control versus diseased), so the data can be expected to be clustered according to conditions [e.g. Golub et al., 1999, Pomeroy et al., 2002]. Dimensional reduction using PCA does not lead to a good visualisation, and further information needs to be included (such as restricting the data set to include only genes that are significantly differentially expressed in the different conditions). The problem is even more significant if we view dimensional reduction as a feature extracting technique. It is generally assumed that the genes capturing most of the variation in the data are the most important for the physiological process under study [Alter et al., 2000]. However, when the data set contains clusters, the most important features are clearly the ones that discriminate between clusters, which, in general, do not coincide with the directions of greatest variation in the whole dataset. While unsupervised techniques for dimensional reduction of data sets with clusters are relatively understudied, the supervised case leads to linear discriminant analysis (LDA). It was Fisher [1936] who first provided the solution to the problem of identifying the optimal projection in order to separate two classes. Under the assumption that the class conditional probabilities are Gaussian with identical covariance, he proved that the optimal 1-dimensional projection to separate the data maximises the Rayleigh coefficient d 2 J (e) = σ1 2 +. (2) σ2 2 Here e is the unit vector that defines the projection, d is the projected distance of the (empirical) class centers and σi 2 is the projected empirical variance of class i. This criterion can be generalised to the general covariance, multi-class case [De la Torre and Kanade, 2005]. In this note we propose a latent variable model to perform dimensional reduction of clustered data sets in an unsupervised way. The maximum likelihood solution for the model fulfills an optimality criterion which is a generalisation of Fisher s criterion, and reduces to maximising Rayleigh s coefficient under particular limiting conditions. The rest of the paper is organised as follows: in the next section we describe the latent variable model and provide an expectation-maximisation (EM) algo- 2

rithm for finding the maximum likelihood solution. We then prove that, under certain assumptions, the likelihood of the model is a monotonic function of Rayleigh s coefficient, so that LDA can be retrieved as a limiting case of our model. Finally, we discuss the advantages of our model, as well as possible generalisations and relationships with existing models. 2 Latent Variable Model Let our data set be composed of N D-dimensional vectors y i, y = 1,..., N. We have prior knowledge that the data set contains K clusters, and we wish to obtain a q < D-dimensional projection of the data which is optimal with respect to the clustered structure of the data. The latent variable model formulation we present is closely related to and inspired by the extreme component analysis (XCA) model of Welling et al. [2003]. We explain the data using two sets of latent variables which are forced to map to orthogonal subspaces in the data; the model then takes the form y = Ex + µ + W z. (3) In this equation, E is a D q matrix which defines the projection, so that E T E = I q, the identity matrix in q dimensions. µ is a D dimensional mean vector, and W is a D (D q) matrix whose columns span the orthogonal complement of the subspace spanned by the columns of E, i.e. W T E = 0. Notice a key difference from the PPCA model of equation (1) is the absence of a noise term; this is common to our model and the XCA model. Choosing spherical Gaussian prior distributions for the latent variables x and z returns the XCA model of Welling et al. [2003]. Instead, in order to incorporate the clustered structure in the data, we will select the following prior distributions z N (0, I D q ) K x π k N ( m k, σ 2 ) I q K π k = 1. (4) Thus, we have a mixture of Gaussians generative distribution in the directions determined by the columns of E, and a general Gaussian covariance structure in the orthogonal directions. Intuitively, the mixture of Gaussians part will tend to optimise the fit to the clusters, which (given the spherical covariances) will be obtained when the subspace spanned by E interpolates the centres of the cluster in the optimal (least squares) way. Meanwhile, the general Gaussian covariance in the orthogonal directions will seek to maximise the variance of each cluster in the directions orthogonal to E, hence minimising the sum of the 3

projected variances. Therefore, the projection defined by E will seek an optimal compromise between separating the cluster centres and obtaining tight clusters. From the definitions of the model and the priors, it is clear that we can set the mean vector µ to zero and consider centred data sets without loss of generality. We can therefore assume that the centres of the mixture components in the prior distribution for x sum to zero, 2.1 Likelihood K m k = 0. Given the model (3) and the prior distributions on the latent variables (4), it is straightforward to marginalise the latent variable and obtain a likelihood for the model. Using the orthogonality of E and W, we readily obtain a log-likelihood for the model L (y j, E, W, m k, π, σ) = Nq log σ 2 + [ N K ] log π k exp 1 ( E T ) 2 2σ 2 y j m k N log C 1 2 tr ( C 1 S ). Here S = 1 N N y jyj T is the empirical covariance of the (centred) data, C = W W T and we have omitted a number of constants. With a slight notational abuse, we denote C 1 = W ( W W ) T 2 W T (using the pseudoinverse of W ) and C as the product of the non-zero eigenvalues of C. We notice immediately that, as W is unconstrained (except for being orthogonal to E and full column rank), at maximum likelihood W W T will match exactly the covariance of the data in the directions perpendicular to E. Therefore, the last term in equation (5) will just be a constant D q 2 irrespective of the other parameters and can be dropped from the likelihood. Also, we can use the orthogonality between E and W to rewrite C in terms of E alone. This can be done by observing that, if P and P represent two mutually orthogonal projections that sum to the identity, then the determinant of a symmetric matrix A can be expressed as A = P AP T P AP T. But since C matches exactly the covariance in the directions orthogonal to E, we obtain that C = S E T SE. Substituting these results into the log-likelihood (5), and omitting terms which do not depend on the parameters, we obtain L (y j, E, m k π, σ) = Nq log σ 2 [ N K ] + log π k exp 1 ( E T ) 2 2σ 2 y j m k + N log E T SE (6). 4 (5)

This expression is now simply a mixture of Gaussian likelihood in the projected space spanned by the columns of E, plus a term that depends only on the projection E. This suggests a simple iterative strategy to maximise the likelihood. 2.2 E-M algorithm We can optimise the likelihood (6) using an E-M algorithm [Dempster et al., 1977]. This is an iterative procedure that is proved to converge to a (possibly local) maximum of the likelihood. We introduce unobserved binary class membership vectors c R K, and denote by ỹ = ye the projected data. Introducing the responsibilities γ jk = π k exp 1 2σ 2 ( E T y j m k ) 2 K π k exp 1 2σ 2 (E T y j m k ) 2 (7) we obtain a lower bound on the log-likelihood in the form Q (E, m k, σ, π) = N K γ jk log [π k p (ỹ c k, m k, σ)] + N log E T SE (8) where the class conditional probabilities p (ỹ c k, Θ) are the Gaussian mixture components and we denoted collectively as Θ all the model parameters. Optimising the bound (8) with respect to the parameters leads to fixed point equations. For the latent means these are N m k = γ jkỹ j N γ. (9) jk This simply tells us that the latent means are the means of the projected data weighted by the responsibilities. Similarly, we obtain a fixed point equation for σ 2 N K σ 2 = γ jk (ỹ j m k ) 2 K = σ2 k (10) Nq Nq where we have introduced the projected variances σk 2. Again, we can easily obtain an update for the mixing coefficients π k = N N k N = γ jk. (11) N Substituting equations (9) into equation (10), and equation (10) into equation (8) we obtain, ignoring constants, Q (E) = log ( ES K E T ) 1 E T SE (12) where S K = K N N γ jk (y j γ ) ( N jky j N γ y j γ jky j N jk γ jk ) T. 5

Optimisation of the likelihood w.r.t. the projection directions E is then a matter of simple linear algebra, and is solved by a generalised eigenvalue problem. Equation (12) is noteworthy. The denominator is a soft assignment, multi dimensional generalisation of the sum of intra-cluster variances which appeared in the denominator of equation (2). Therefore, optimising the likelihood with respect to the projection subspace E will tend to make the projected clusters tighter. Furthermore, the matrix S can be decomposed as S K +S T [e.g. Bishop, 2006], where S T is the generalisation of the inter-cluster variance appearing in the numerator of (2). Therefore, the likelihood bound is a monotonic increasing function of a generalisation of the Rayleigh coefficient, and can be viewed as a probabilistic version of LDA. In the next section we will make precise this correspondence in the original Fisher s discriminant case, and show under which conditions the two techniques are equivalent. 3 Relationship with linear discriminant analysis We now make a number of simplifying assumptions: we assume our data set to be composed of two clusters, and we choose to project it to one dimension, so that q = 1 and the projection is defined by a single vector e rather than a matrix E. If the clusters are well separated, assuming they are balanced in the data, the likelihood of equation (6) can be simplified neglecting the probability that points in one cluster were generated by the other cluster, obtaining L = Nq log σ 2q + 1 2 K N k 1 2σ 2 [ (y j em k ) T e] 2 + N log e T Se. (13) where we replaced π i = 1 2 and N i is the number of points assigned cluster i. Using the fact that µ = N y j = 0, we define y j = µ i + v j, (14) where µ i = 1 N i Ni y j. The intra-cluster covariances are then S i = 1 N i vj v T j. Since there are two clusters, we also have µ 1 = µ 2 = µ and m 1 = m 2 = m. Using equation (14), we obtain that S = S 1 + S 2 + 6µµ T. Defining σi 2 = et S i e the projected variance of cluster i and d = 2µ T e the projected distance separating the clusters, it is easy to compute the maximum likelihood estimates for m and σ 2, given respectively by m = d, σ 2 = α ( σ1 2 + σ2 2 ) with α a constant. Notice that these can also be obtained from equations (9-10) by taking the limit when the responsibilities are 0 or 1 (hard assignments). 6

Therefore, we obtain that the likelihood (13) can be rewritten as ( σ 2 L = log 1 + σ2 2 + 6d 2 ) σ1 2 + + const = log [1 + 6J (e)] + const σ2 2 where J (e) is the Rayleigh coefficient of equation (2). Therefore, the maximum likelihood solution is obtained at the maximum of the Rayleigh coefficient, showing that the model reduces to Fisher s discriminant in these conditions. 4 Discussion In this paper we have addressed the question of selecting an optimal linear projection of a high dimensional data set, given prior knowledge of the presence of cluster structure in the data. The proposed solution is a generative model which is made up of two orthogonal latent variable models: one is an unconstrained, PPCA-like model [Tipping and Bishop, 1999b], the other is a projection (i.e. the loading vectors have unit lengths) with a mixture of Gaussians prior on the latent variables. The result of this rather artificial choice of priors is that the PPCA-like part will tend to maximise the variance of the data in the directions orthogonal to the projection, hence forcing the projected clusters to maximise the inter-cluster spread, while minimising the intra-cluster variance. Indeed, we prove that the maximum likelihood estimator for the projection plane in our model satisfies an unsupervised, soft assignment generalisation of the multidimensional version of the criterion for LDA. We also prove that when the projection is one dimensional and the clusters are well separated, the likelihood of the model reduces to a monotonic function of the Rayleigh coefficient used in Fisher s discriminant analysis. Our model is related to several previously proposed models. It can be viewed as an adaptation of the XCA model of Welling et al. [2003] to the case when the data set has a clustered structure; it is remarkable though that the generalised LDA criterion is obtained without imposing that the variance in the directions orthogonal to the projection be maximal. Bishop and Tipping [1998] dealt with the problem of visualising clustered data sets by proposing a hierarchical model based on a mixture of PPCAs [Tipping and Bishop, 1999a]. While this approach addresses successfully the problem of analysing the detailed structure of each cluster, it does not provide a single optimal projection for visualising all the clusters simultaneously. Perhaps the contribution that is closest to ours is in Girolami [2001]. This paper considered the clustering problem using an Independent Component Analysis (ICA) model with one latent binary variable corrupted by Gaussian noise. The author then proved that the maximum likelihood solution returned an unsupervised analogue of Fisher s discriminant. This approach is somewhat complementary to ours in that it uses discrete rather than continuous variables, but the end result is remarkably close. In fact, in the one dimensional, two classes case the objective function of the ICA model is the same as the likelihood (6). The major difference comes when considering latent spaces which are more than 7

one dimensional, which is generally the most important case. Generalising the ICA approach would enforce an independence constraint between the directions of the projection which is not plausible; this is not a problem in our approach. Also, treatment of cases where the intra-cluster spread is different for the different clusters is much more straightforward in our model; in Girolami [2001] s approach one has to introduce a class-dependent noise level, while in our approach it just amounts to a straightforward modification of the E-M algorithm. There are several interesting directions for generalising the present work. One interesting direction would be to extend it to non-linear dimensional reductions techniques such as Bishop et al. [1998], Lawrence [2005]. While this is in principle possible due to the shared probabilistic nature of these models, in practice it will require considerable further work. Another important direction would be to automate model selection issues (such as the number of clusters or number of latent dimensions) using Bayesian techniques such as Bayesian PCA and non-parametric Dirichlet process priors Bishop and Tipping [1998], Neal [2000]. While this is feasible, careful consideration should be given to issues of numerical efficiency. Acknowledgments The author wishes to thank Mark Girolami for useful discussions and suggestions on ICA models for clustering. References O. Alter, P. O. Brown, and D. Botstein. Singular value decomposition for genome-wide expression data processing and modeling. PNAS, 97(18):10101 10106, 2000. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, Singapore, 2006. C. M. Bishop, M. Svensen, and C. K. Williams. GTM: the generative topographic mapping. Neural Computation, 10(1):215 234, 1998. C. M. Bishop and M. E. Tipping. A hierarchical latent variable model for data visualisation. IEEE Transactions PAMI, 20(3):281 293, 1998. F. De la Torre and T. Kanade. Multimodal oriented discriminant analysis. In Proceedings of the International Conference on Machine Learning (ICML), 2005. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39(1):1 38, 1977. 8

R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179 188, 1936. M. Girolami. Latent class and trait models for data classification and visualisation. In Independent Component Analysis: principles and practice. Cambridge University Press, Cambridge, 2001. M. Girolami and R. Breitling. Biologically valid linear factor models of gene expression. Bioinformatics, 20(17):3021 3033, 2004. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531 7, 1999. N. D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6:1783 1816, 2005. R. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249 265, 2000. S. M. Pomeroy, P. Tamayo, M. Gaasenbeek, L. M. Sturla, M. Angelo, M. E. McLaughlin, J. Y. H. Kim, L. C. Goumnerova, P. M. Black, C. lau, J. C. Allen, D. Zagzag, J. M. Olson, T. Curran, C. Wetmore, J. A. Biegel, T. Poggio, S. Mukherjee, R. Rifdin, A. Califano, G. Stolovitzky, D. N. Louis, J. P. Mesirov, E. S. Lander, and T. R. Golub. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415:436 442, 2002. G. Sanguinetti, M. Milo, M. Rattray, and N. D. Lawrence. Accounting for probe-level noise in principal component analysis of microarray data. Bioinformatics, 21(19):3748 3754, 2005. M. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2):443 482, 1999a. M. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 21(3):611 622, 1999b. M. Welling, F. Agakov, and C. K. Williams. Extreme component analysis. In Advances in Neural Information Processing Systems, 2003. 9