Unsupervised Learning of Ranking Functions for High-Dimensional Data

Size: px

Start display at page:

Download "Unsupervised Learning of Ranking Functions for High-Dimensional Data"

George Chester Strickland
5 years ago
Views:

1 Unsupervised Learning of Ranking Functions for High-Dimensional Data Sach Mukherjee and Stephen J. Roberts Department of Engineering Science University of Oxford U.K. Abstract A growing number of problems in data analysis involve ranking variables from extremely high-dimensional datasets. However, when labelled data is unavailable, and underlying statistical models poorly understood, it becomes difficult to assess the likely effectiveness of various ranking functions, and thus make an appropriate choice of method. In this paper, we present an unsupervised approach to learning ranking functions, based on a simple but powerful notion of consistency. We present results on real and simulated data, which demonstrate the effectiveness of our learningbased method compared with widely-used statistical techniques. 1 Introduction A number of important problems in data analysis involve ranking variables by some notion of relevance. In an increasing number of domains, the data upon which such rankings are based have certain properties which make their analysis highly non-trivial. The type of problems we will be concerned with in this paper are characterized by: i) extremely high dimensionality and relatively small sample-size (i.e. very few datapoints), ii) the absence of labelled data and iii) a poor understanding of the statistical model underlying the data. Variable ranking tasks based on data of this kind (we shall use the term high-throughput data ) are now commonplace in molecular biology, chemistry, pharmacology and many other areas, and have led to an explosion of interest in relevant statistical methods. As a running example with which to illustrate the issues involved, consider the differential analysis of gene microarrays [1]. Here the variables to be ranked represent genes, and the data are expression levels measured under two or more conditions, such as healthy and diseased. Biologically relevant genes are expected to be up- or down-regulated between conditions. Ranking is therefore done using a function which scores genes in terms of differential expression (a canonical example being the two-sample t-statistic). The three characteristics mentioned above each has a serious effect in microarray analysis. The mismatch between dimensionality (typically 10 4 ) and sample-size ( 10 1 ) means that models rich enough to capture interactions between genes become too complex to estimate. At the same time, there is usually little prior knowledge about the underlying model, and with genes not being flagged as relevant/irrelevant in the dataset, the correctness of rankings cannot easily be checked against ground-truth. Thus, while many well-founded methods

2 exist for such data, making a reasonable choice of method on either empirical or theoretical grounds is difficult. Yet the effectiveness of the ranking function used is critical - incorrect results lead to a waste of resources, with no real sanity-check until late in the investigative life-cycle. The ability of a ranking function to distinguish relevant and irrelevant variables can be captured as a probability of success (this is defined formally in Section 2.1 as the expected proportion of true positives selected). Recent research [2] has shown that probability of success for a given ranking function is jointly determined by statistical properties of the system under study and the form of the ranking function, and can be calculated explicitly under a fully-specified model for the data. However, as the microarray example 1 illustrates, in practice the model is poorly characterized and data unlabelled, so there is no obvious way to determine probability of success. This makes it difficult to choose a ranking function appropriate for given data, or have confidence in rankings obtained from data. In this paper we address the problem of variable ranking in a wholly unsupervised, highdimensional setting. Our approach is based around a measure of stability in ranking called consistency, which can be computed without ground-truth knowledge, but nonetheless allows us to infer the underlying probability of success of a ranking function from data. The notion of consistency is essentially used as a proxy for (unobservable) probability of success, and used to learn effective ranking functions. Our method actively exploits the presence of large numbers of irrelevant variables, in effect making something of a blessing out of the curse of dimensionality which plagues high-throughput data analysis. The remainder of this paper is organized as follows. We first define consistency and probability of success formally, and examine the relationship between them. We then show how consistency can be used to infer underlying probability of success and learn appropriate ranking functions from data, and finally present results on real and simulated data. 2 Consistency 2.1 Definitions Consider two sets of data (collectively D) pertaining to the same scientific question (these data can be regarded as equivalent to two datasets drawn from a full generative model M for the underlying system). Each dataset has the same variables; a ranking function r produces two potentially distinct orderings of these variables from the two datasets. Let the n s highest-ranked variables in each case be selected 2 as result-sets S a and S b respectively. Note that S a and S b are sets of variable-indices. Sample consistency C is then defined as the number of elements in common between the sets: C(r, n s, D) def = S a S b (1) As a function of random data, sample consistency C is itself a random variable; its expected value is called expected consistency κ: κ def = E[C] (2) Expected consistency κ is thus simply the average number of elements in common between pairs of result-sets. 1 Microarray analysis is a topical example, but we emphasize that the methods developed here apply to any task involving the ranking of variables from high-dimensional, low sample-size data. 2 The number n s of variables selected in practice tends to depend on experimental objectives and follow-up plans [1] - we therefore treat n s as known. However, the approach developed in this paper can be easily adapted to deal with algorithms which automatically determine n s.

3 We define ground-truth probability of success 3 q as the expected proportion of true positives among the variables selected. Assuming n s is non-zero: q def = E[ S relevant /n s ] (3) Where, S is a result-set and relevant the full set of relevant variables. 2.2 The relationship between consistency and probability of success Our intention is to use consistency as a proxy for underlying probability of success: in this Section we examine the relationship between the two, and show that expected consistency is positively correlated with probability of success. Let Nt a and Nt b be random variables representing the number of true positives obtained from two sets of data. If we think of the selection of variables as a series of Bernoulli trials, with some probability of selecting a relevant variable at each trial, then given probability of success q, and the total number n t of relevant variables in the system, Nt a and Nt b are independent and identically distributed according to (shown only for Nt a ): P (N a t = na t q, n t) = B (n a t q max(n s/n t, 1), min(n s, n t )) (4) Where, lower-case n a t refers to a realization of the random variable N t a and B(x π, η) is a Binomial distribution with η Bernoulli trials and probability parameter π. Recall that sample consistency C is the number of elements in common between two resultsets. Let C comprise C t relevant variables and C f irrelevant ones, such that C = C t + C f. How are C t and C f distributed, given the numbers Nt a and Nt b of true positives in the two result-sets? Consider taking one relevant variable at a time from result-set b and trying to find it in result-set a. This is in effect a series of Bernoulli trials, with the number of successes being the number of relevant variables C t in common between the two sets. If we make the simplifying assumption that every relevant variable has the same chance of appearing in the top n s places under the ranking function, the distribution over C t can be approximated by the following Binomial: P (C t = c t N a t, N b t, n t ) B(c t N a t /n t, N b t ) (5) Note that we have assumed (without loss of generality) that Nt a > Nt b. We can make a similar argument for the irrelevant variables, such that if n tot is the total number of variables (i.e. dimensionality), the distribution over C f can be approximated as: ( P (C f = c f Nt a, N t b, n t) B c f n s Nt b ), n s Nt a (6) n tot n t We find empirically that these Binomial approximations are very accurate for the type of data in which we are interested, mainly because the combined effects of very high dimensionality and small sample-size mean that relatively few variables are selected deterministically from such data. Now, expected consistency κ is the expectation of C: κ = E[C] = E[C t + C f ] = E[C t ] + E[C f ] (7) From Equations 4, 5, 6 and 7, κ can be expressed in terms of probability of success q: κ = q2 n 2 s n 2 s + (1 q) 2 (8) n t n tot n t 3 This quantity is known in the statistical literature as True Discovery Rate.

4 We can now ask under what conditions probability of success q and expected consistency κ are positively correlated. Suppose probabilities of success for ranking functions #1 and #2 are q 1 and q 2 respectively, and that function #1 is more effective than function #2, such that the difference q (= q 1 q 2 ) is positive. Then, using Equation 8 and simplifying, the corresponding difference in expected consistencies κ is given by: ( ) ] 1 κ = n 2 s [(q q q 2 ) + (9) n t n tot n t n tot n t Examining this expression we can see that q and κ are positively correlated if and only if the term within square brackets is positive. It is easy to show that this is the case when the average probability of success for functions #1 and #2 exceeds what would be expected if variables were selected entirely at random 4. This condition is expected to hold for any plausible ranking function. It is also interesting to note that for given q, a large proportion of irrelevant variables helps produce a pronounced response in terms of consistency 5. An example will provide an intuitive sense of the reason why consistency and probability of success are correlated when irrelevant variables significantly outnumber relevant ones. Consider a scenario where the total number of variables runs into the thousands, with only a few dozen being truly relevant. Suppose also that some proportion of the variables selected by an algorithm are false positives. Then, provided a good number of these false positives are chosen more-or-less at random from the large pool of irrelevant variables, the variability in their identities will tend to be high, compared to the corresponding variation among the relevant variables selected. Hence, the greater the proportion of relevant variables among those selected, the more agreement there will tend to be between result-sets. The correlation between q and κ can also be verified by simulation. 3 Inferring probability of success from data The results given above show that expected consistency κ correlates with ground-truth probability of success. However, in order to be able to use the notion of consistency in practice, we must be able to infer probability of success from a single observed consistency. The required posterior density over probability of success q, given sample consistency C, can be obtained using Bayes theorem: p(q C) = P (C q)p(q) 1 0 P (C q)p(q) dq (10) A beta distribution, symmetric about its mode, is used as a prior for q, with parameters chosen to assign relatively little probability mass to the extremes of 0 and 1 (this reflects our intuition that ranking functions are unlikely to be either perfect or entirely useless). The term P (C q) can be expressed as follows by marginalizing over Nt a, Nt b and n t : P (C q) = P (C Nt a, N t b, n t, q)p (Nt a, N t b, n t q) (11) Nt a,n t b,nt Since Nt a and Nt b are independent, and the total number n t of relevant variables does not depend on probability of success q, P (Nt a, N t b, n t q) can be expressed as follows: P (N a t, N b t, n t q) = P (N a t, N b t n t, q)p (n t q) = P (N a t n t, q)p (N b t n t, q)p (n t ) (12) 4 That is, [sign( q) = sign( κ)] iff [ q 1+q 2 > nt 2 n tot ]. 5 For given q, the corresponding difference κ in consistency increases with ntot nt n tot to the sufficient condition n t < ntot, which holds for most systems of interest). 2 (subject

5 Now, recall that C = C t + C f and that C t and C f in turn depend only on Nt a, N t b and n t (Equations 5 and 6). Then, substituting Equation 12 into Equation 11, we get P (C q) explicitly in terms of the conditional distributions derived in Section 2.2: P (C q) n tot = n t n t=0 Nt b=0 n t N a t =0 P (C N a t, N b t, n t )P (N a t n t, q)p (N b t n t, q)p (n t ) (13) The term P (C Nt a, Nt b, n t ) is given by Equations 5 and 6 together, and the distributions of Nt a and N t b by Equation 4. A prior is required for n t; any suitable discrete distribution may be chosen in accordance with background knowledge. At the very least, it is usually possible to impose bounds on n t which are plausible in the context of the experiment. In some situations, only a single dataset (rather than a pair of datasets) is available. In such cases, our approach is to obtain sample consistency by repeatedly and randomly partitioning the dataset into halves, computing consistency between results obtained from each partition, and averaging those values. We find that a stable estimate of consistency can usually be obtained after <50 iterations. 4 Learning ranking functions We now have, in consistency, an easy-to-compute proxy for ground-truth probability of success, and a framework within which we can, if desired, infer probability of success. We are therefore in a position to address the question of learning effective ranking functions from data. Starting with data D, and a suitable parameterized family of ranking functions, our approach is to simply use consistency to automatically choose the member of the family most likely to have highest probability of success. Suppose the family is defined by f(θ), with θ being a vector of parameters which when instantiated specifies a particular function. Then, the most appropriate member of the family, in terms of sample consistency, is f(ˆθ): ˆθ = arg max θ C(f(θ), n s, D) (14) Note that the computation of consistency for each proposed parameter vector θ involves only ranking and set-intersection and is therefore very rapid. This means that although the optimization in Equation 14 will not in general permit a closed-form solution, sampling methods can be used to rapidly explore search-spaces for most families of functions. Depending on the precise nature of the task being addressed, the family f may potentially represent any kind of ranking function, but for a concrete example, consider again the differential analysis of microarray data. Recall, the aim is to score each variable in terms of how likely it is to have distinct means in two conditions. The most common choices of ranking function thus tend to be correlation criteria of various kinds [3]. A suitable family of functions for differential analyses of this kind is therefore: f(θ 1, θ 2, θ 3 ) = d + θ 1 θ 2 σ + θ 3 (15) Where, d is the absolute difference of sample means between conditions, and σ the sample standard deviation for the variable to be scored. The θ s are scalar parameters analogous to the vector θ above. We use a straightforward Monte Carlo scheme to learn the θ s. The parameters are sampled uniformly in the range [ 1, 1], and rejected in case of division by zero. Sample consistency is calculated for the function corresponding to each set of parameters drawn, and the parameters with highest consistency returned after 10 3 iterations.

6 Probability of success Learning ranking SAM statistic Fisher score Mann Whitney statistic Lowest prob. of success Learning ranking (LR) SAM statistic (SAM) Mann Whitney statistic (MW) Fisher score (FS) Variance ratio 0 LR SAM MW FS Ranking function (a) Probability of success (b) Lowest probability of success Figure 1: Results on simulated microarray data. Panel (a) shows probability of success plotted against the ratio of variances of relevant variables to irrelevant ones ( variance ratio ). Panel (b) shows lowest probability of success observed for the variance ratios considered. Learning an appropriate function and ranking using that function can be performed as successive steps, so that we can think of a combined learning-ranking algorithm which takes data as its input and produces a ranking of variables as output. Results presented and discussed in the next Section use a learning-ranking approach based on the family defined by Equation 15. The results show that learning under the consistency framework outperforms widely-used statistical methods on both real and simulated data. 5 Results Simulated data: Figure 1 shows results on simulated microarray data for our method and three widely-used ranking functions: the non-parametric Mann-Whitney statistic, the Fisher score, and SAM [4]. SAM is a regularized statistic developed specifically for microarray analysis, and is widely regarded as one of the best choices for differential ranking problems of this kind. The computational procedure was as follows. At each iteration, 15 datapoints 6 under each of two conditions were sampled under a 1000-dimensional Gaussian model. Only 25 of the 1000 dimensions had distinct underlying means in the two conditions; these are the relevant variables or differentially expressed genes. The ranking methods were applied to the sampled data, with n s = 50 variables being selected 7. The identities of the relevant variables were used to check results and compute probabilities of success, but of course remained hidden from the algorithms. The ratio of variances of relevant to irrelevant variables ( variance ratio ) while in practice unobservable, is known to have a major impact on the performance of ranking functions of this kind [2]. To examine the robustness of the ranking functions we therefore considered a range of variance ratios, generating 500 datasets for each ratio to obtain accurate estimates of probability of success. The results demonstrate the robustness of our algorithm. The other methods do well in some regions of the curve but quite badly in others (which makes results from real data, where the true variance ratio is unknown, hard to trust). In contrast, our method does well 6 Sample-sizes as small as this make analysis difficult, but are quite typical in microarray analysis. 7 Note that the aim here is to select the relevant variables rather than to classify datapoints.

7 Table 1: Results on colon cancer microarray data. NORMALIZED PROB. OF Confidence score for FUNCTION vs: FUNCTION CONSISTENCY SUCCESS LR SAM FS MW Learning-ranking (LR) SAM statistic (SAM) Fisher score (FS) Mann-Whitney (MW) across a range of ratios, outperforming SAM in every region of the curve. The fundamental reason for the success of our method is its ability to use a measure of probability of success for guidance. Here, it has effectively adapted the simple function defined by Equation 15 to the unobserved variance ratio, and done so without needing to directly consider that ratio. Genomic data: We applied our method to a widely-studied microarray dataset pertaining to colon cancer [5], allowing it to learn an appropriate member of the family defined by Equation 15. Posterior distributions over probability of success for the learned function and the three functions mentioned above were inferred following Equation 10 (with a beta prior for q and a uniform prior for n t, bounded between [10, 100]). Table 1 shows normalized consistencies (i.e.c/n s ) and MAP-estimated probabilities of success for the four methods; our method does noticeably better than the others. But how significant are the observed differences? A good way of answering this question is to ask how confident we can be that our method is more effective on this data than the other functions: that is, by considering the posterior probability that its probability of success is higher. This measure of confidence can be computed directly from the inferred posteriors 8, and is shown, for each pair of ranking functions, in Table 1. (These confidence scores read from left-to-right, so that we can, for example, be 72% confident that our method is more effective compared with SAM, and 98% confident in comparison with the Fisher score.) 6 Discussion and conclusions This paper has sought to address unsupervised variable ranking in the kind of extremely high-dimensional setting now common in several important areas of science. Our approach was centred around a measure of stability called consistency. The notion that stability is a good thing occurs widely in the literature (e.g. [6] in the context of variable selection). However, to the best of our knowledge, our explicitly probabilistic view of stability in ranking, and subsequent inference of underlying probability of success, are novel (and indeed made possible by the very properties of high-throughput data which are otherwise a curse ). There are also two interesting similarities between our approach and the Rankprop [7] algorithm (which otherwise addresses a quite different, supervised problem). Firstly, quality of ranking is emphasized as a learning objective, as it is here. Secondly, Rankprop avoids learning a difficult target function by instead learning a simpler function positively correlated with the target. This is similar in spirit to our use of consistency as an easy-tocompute and positively correlated proxy for probability of success. An interesting feature of the inference scheme discussed in Section 3 was that consistency 8 Suppose the posteriors over probability of success for Algorithms #1 and #2 are given by the densities p 1 and p 2 respectively. If φ 2 is the cumulative distribution function corresponding to p 2, the probability of Algorithm #1 having higher underlying probability of success is just: P (q 2 < q 1) = 1 0 φ 2(u)p 1(u) du

8 PSfrag replacements M q C n t Figure 2: A graphical model for consistency. depended only implicitly on the statistical model underlying the data. Figure 2 illustrates the relevant high-level dependencies as a graphical model. Sample consistency C is conditionally independent of model M, given probability of success q. Since we were interested only in inferring P (q C), we were able to place priors over q and the total number n t of relevant variables and avoid having to explicitly deal with the model altogether. This is an appealing property of our approach given that little is known about underlying models in many applications, especially in molecular biology. Roth and Lange [8] note that while supervised feature selection has been widely addressed in the literature [3, 9] the unsupervised case has received comparatively little attention. Unsupervised learning is generally a hard problem, but the high-dimensionality and relatively small sample-size typical of the kind of problems addressed here in many ways make a hard problem harder. The mismatch between dimensionality and sample-size represents what is in effect a statistical bottleneck : models rich enough to adequately describe the underlying system become too complex to estimate from the samples available. Our notion of consistency can be thought of as an attempt to side-step this problem by looking at ranking functions from a purely combinatorial perspective. The abstraction away from model-based statistics to a combinatorial view makes our approach conceptually distinct from existing methods, and also inherently flexible. Applications to practical problems in molecular biology and information retrieval are a focus of current research. Acknowledgements: SNM thanks Nick Hughes, Peter Sykacek and Andrew Zisserman. References [1] I. Lonnstedt and T. P. Speed. Replicated microarray data. Stat Sinica, 12:31 46, [2] S. Mukherjee and S. J. Roberts. A Theoretical Analysis of Gene Selection. In Proceedings of the IEEE Computer Society Bioinformatics Conference. IEEE Press, To appear. [3] I. Guyon and A. Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3(Mar): , Special Issue on Variable and Feature Selection. [4] V. Tusher, R. Tibshirani, and G. Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA, 98(9): , [5] U. Alon et. al. Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proc. Natl Acad. Sci. USA, 96(12): , [6] J. Bi et al. Dimensionality reduction via sparse support vector machines. Journal of Machine Learning Research, 3(Mar): , Special Issue on Variable and Feature Selection. [7] R. Caruana, S. Baluja, and T. Mitchell. Using the future to sort out the present: Rankprop and multitask learning for medical risk evaluation. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural Information Processing Systems 8, [8] V. Roth and T. Lange. Feature selection in clustering problems. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, [9] J. Weston et al. Feature Selection for SVMs. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13. MIT Press, 2001.

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University