Unsupervised Learning of Ranking Functions for High-Dimensional Data

Size: px
Start display at page:

Download "Unsupervised Learning of Ranking Functions for High-Dimensional Data"

Transcription

1 Unsupervised Learning of Ranking Functions for High-Dimensional Data Sach Mukherjee and Stephen J. Roberts Department of Engineering Science University of Oxford U.K. Abstract A growing number of problems in data analysis involve ranking variables from extremely high-dimensional datasets. However, when labelled data is unavailable, and underlying statistical models poorly understood, it becomes difficult to assess the likely effectiveness of various ranking functions, and thus make an appropriate choice of method. In this paper, we present an unsupervised approach to learning ranking functions, based on a simple but powerful notion of consistency. We present results on real and simulated data, which demonstrate the effectiveness of our learningbased method compared with widely-used statistical techniques. 1 Introduction A number of important problems in data analysis involve ranking variables by some notion of relevance. In an increasing number of domains, the data upon which such rankings are based have certain properties which make their analysis highly non-trivial. The type of problems we will be concerned with in this paper are characterized by: i) extremely high dimensionality and relatively small sample-size (i.e. very few datapoints), ii) the absence of labelled data and iii) a poor understanding of the statistical model underlying the data. Variable ranking tasks based on data of this kind (we shall use the term high-throughput data ) are now commonplace in molecular biology, chemistry, pharmacology and many other areas, and have led to an explosion of interest in relevant statistical methods. As a running example with which to illustrate the issues involved, consider the differential analysis of gene microarrays [1]. Here the variables to be ranked represent genes, and the data are expression levels measured under two or more conditions, such as healthy and diseased. Biologically relevant genes are expected to be up- or down-regulated between conditions. Ranking is therefore done using a function which scores genes in terms of differential expression (a canonical example being the two-sample t-statistic). The three characteristics mentioned above each has a serious effect in microarray analysis. The mismatch between dimensionality (typically 10 4 ) and sample-size ( 10 1 ) means that models rich enough to capture interactions between genes become too complex to estimate. At the same time, there is usually little prior knowledge about the underlying model, and with genes not being flagged as relevant/irrelevant in the dataset, the correctness of rankings cannot easily be checked against ground-truth. Thus, while many well-founded methods

2 exist for such data, making a reasonable choice of method on either empirical or theoretical grounds is difficult. Yet the effectiveness of the ranking function used is critical - incorrect results lead to a waste of resources, with no real sanity-check until late in the investigative life-cycle. The ability of a ranking function to distinguish relevant and irrelevant variables can be captured as a probability of success (this is defined formally in Section 2.1 as the expected proportion of true positives selected). Recent research [2] has shown that probability of success for a given ranking function is jointly determined by statistical properties of the system under study and the form of the ranking function, and can be calculated explicitly under a fully-specified model for the data. However, as the microarray example 1 illustrates, in practice the model is poorly characterized and data unlabelled, so there is no obvious way to determine probability of success. This makes it difficult to choose a ranking function appropriate for given data, or have confidence in rankings obtained from data. In this paper we address the problem of variable ranking in a wholly unsupervised, highdimensional setting. Our approach is based around a measure of stability in ranking called consistency, which can be computed without ground-truth knowledge, but nonetheless allows us to infer the underlying probability of success of a ranking function from data. The notion of consistency is essentially used as a proxy for (unobservable) probability of success, and used to learn effective ranking functions. Our method actively exploits the presence of large numbers of irrelevant variables, in effect making something of a blessing out of the curse of dimensionality which plagues high-throughput data analysis. The remainder of this paper is organized as follows. We first define consistency and probability of success formally, and examine the relationship between them. We then show how consistency can be used to infer underlying probability of success and learn appropriate ranking functions from data, and finally present results on real and simulated data. 2 Consistency 2.1 Definitions Consider two sets of data (collectively D) pertaining to the same scientific question (these data can be regarded as equivalent to two datasets drawn from a full generative model M for the underlying system). Each dataset has the same variables; a ranking function r produces two potentially distinct orderings of these variables from the two datasets. Let the n s highest-ranked variables in each case be selected 2 as result-sets S a and S b respectively. Note that S a and S b are sets of variable-indices. Sample consistency C is then defined as the number of elements in common between the sets: C(r, n s, D) def = S a S b (1) As a function of random data, sample consistency C is itself a random variable; its expected value is called expected consistency κ: κ def = E[C] (2) Expected consistency κ is thus simply the average number of elements in common between pairs of result-sets. 1 Microarray analysis is a topical example, but we emphasize that the methods developed here apply to any task involving the ranking of variables from high-dimensional, low sample-size data. 2 The number n s of variables selected in practice tends to depend on experimental objectives and follow-up plans [1] - we therefore treat n s as known. However, the approach developed in this paper can be easily adapted to deal with algorithms which automatically determine n s.

3 We define ground-truth probability of success 3 q as the expected proportion of true positives among the variables selected. Assuming n s is non-zero: q def = E[ S relevant /n s ] (3) Where, S is a result-set and relevant the full set of relevant variables. 2.2 The relationship between consistency and probability of success Our intention is to use consistency as a proxy for underlying probability of success: in this Section we examine the relationship between the two, and show that expected consistency is positively correlated with probability of success. Let Nt a and Nt b be random variables representing the number of true positives obtained from two sets of data. If we think of the selection of variables as a series of Bernoulli trials, with some probability of selecting a relevant variable at each trial, then given probability of success q, and the total number n t of relevant variables in the system, Nt a and Nt b are independent and identically distributed according to (shown only for Nt a ): P (N a t = na t q, n t) = B (n a t q max(n s/n t, 1), min(n s, n t )) (4) Where, lower-case n a t refers to a realization of the random variable N t a and B(x π, η) is a Binomial distribution with η Bernoulli trials and probability parameter π. Recall that sample consistency C is the number of elements in common between two resultsets. Let C comprise C t relevant variables and C f irrelevant ones, such that C = C t + C f. How are C t and C f distributed, given the numbers Nt a and Nt b of true positives in the two result-sets? Consider taking one relevant variable at a time from result-set b and trying to find it in result-set a. This is in effect a series of Bernoulli trials, with the number of successes being the number of relevant variables C t in common between the two sets. If we make the simplifying assumption that every relevant variable has the same chance of appearing in the top n s places under the ranking function, the distribution over C t can be approximated by the following Binomial: P (C t = c t N a t, N b t, n t ) B(c t N a t /n t, N b t ) (5) Note that we have assumed (without loss of generality) that Nt a > Nt b. We can make a similar argument for the irrelevant variables, such that if n tot is the total number of variables (i.e. dimensionality), the distribution over C f can be approximated as: ( P (C f = c f Nt a, N t b, n t) B c f n s Nt b ), n s Nt a (6) n tot n t We find empirically that these Binomial approximations are very accurate for the type of data in which we are interested, mainly because the combined effects of very high dimensionality and small sample-size mean that relatively few variables are selected deterministically from such data. Now, expected consistency κ is the expectation of C: κ = E[C] = E[C t + C f ] = E[C t ] + E[C f ] (7) From Equations 4, 5, 6 and 7, κ can be expressed in terms of probability of success q: κ = q2 n 2 s n 2 s + (1 q) 2 (8) n t n tot n t 3 This quantity is known in the statistical literature as True Discovery Rate.

4 We can now ask under what conditions probability of success q and expected consistency κ are positively correlated. Suppose probabilities of success for ranking functions #1 and #2 are q 1 and q 2 respectively, and that function #1 is more effective than function #2, such that the difference q (= q 1 q 2 ) is positive. Then, using Equation 8 and simplifying, the corresponding difference in expected consistencies κ is given by: ( ) ] 1 κ = n 2 s [(q q q 2 ) + (9) n t n tot n t n tot n t Examining this expression we can see that q and κ are positively correlated if and only if the term within square brackets is positive. It is easy to show that this is the case when the average probability of success for functions #1 and #2 exceeds what would be expected if variables were selected entirely at random 4. This condition is expected to hold for any plausible ranking function. It is also interesting to note that for given q, a large proportion of irrelevant variables helps produce a pronounced response in terms of consistency 5. An example will provide an intuitive sense of the reason why consistency and probability of success are correlated when irrelevant variables significantly outnumber relevant ones. Consider a scenario where the total number of variables runs into the thousands, with only a few dozen being truly relevant. Suppose also that some proportion of the variables selected by an algorithm are false positives. Then, provided a good number of these false positives are chosen more-or-less at random from the large pool of irrelevant variables, the variability in their identities will tend to be high, compared to the corresponding variation among the relevant variables selected. Hence, the greater the proportion of relevant variables among those selected, the more agreement there will tend to be between result-sets. The correlation between q and κ can also be verified by simulation. 3 Inferring probability of success from data The results given above show that expected consistency κ correlates with ground-truth probability of success. However, in order to be able to use the notion of consistency in practice, we must be able to infer probability of success from a single observed consistency. The required posterior density over probability of success q, given sample consistency C, can be obtained using Bayes theorem: p(q C) = P (C q)p(q) 1 0 P (C q)p(q) dq (10) A beta distribution, symmetric about its mode, is used as a prior for q, with parameters chosen to assign relatively little probability mass to the extremes of 0 and 1 (this reflects our intuition that ranking functions are unlikely to be either perfect or entirely useless). The term P (C q) can be expressed as follows by marginalizing over Nt a, Nt b and n t : P (C q) = P (C Nt a, N t b, n t, q)p (Nt a, N t b, n t q) (11) Nt a,n t b,nt Since Nt a and Nt b are independent, and the total number n t of relevant variables does not depend on probability of success q, P (Nt a, N t b, n t q) can be expressed as follows: P (N a t, N b t, n t q) = P (N a t, N b t n t, q)p (n t q) = P (N a t n t, q)p (N b t n t, q)p (n t ) (12) 4 That is, [sign( q) = sign( κ)] iff [ q 1+q 2 > nt 2 n tot ]. 5 For given q, the corresponding difference κ in consistency increases with ntot nt n tot to the sufficient condition n t < ntot, which holds for most systems of interest). 2 (subject

5 Now, recall that C = C t + C f and that C t and C f in turn depend only on Nt a, N t b and n t (Equations 5 and 6). Then, substituting Equation 12 into Equation 11, we get P (C q) explicitly in terms of the conditional distributions derived in Section 2.2: P (C q) n tot = n t n t=0 Nt b=0 n t N a t =0 P (C N a t, N b t, n t )P (N a t n t, q)p (N b t n t, q)p (n t ) (13) The term P (C Nt a, Nt b, n t ) is given by Equations 5 and 6 together, and the distributions of Nt a and N t b by Equation 4. A prior is required for n t; any suitable discrete distribution may be chosen in accordance with background knowledge. At the very least, it is usually possible to impose bounds on n t which are plausible in the context of the experiment. In some situations, only a single dataset (rather than a pair of datasets) is available. In such cases, our approach is to obtain sample consistency by repeatedly and randomly partitioning the dataset into halves, computing consistency between results obtained from each partition, and averaging those values. We find that a stable estimate of consistency can usually be obtained after <50 iterations. 4 Learning ranking functions We now have, in consistency, an easy-to-compute proxy for ground-truth probability of success, and a framework within which we can, if desired, infer probability of success. We are therefore in a position to address the question of learning effective ranking functions from data. Starting with data D, and a suitable parameterized family of ranking functions, our approach is to simply use consistency to automatically choose the member of the family most likely to have highest probability of success. Suppose the family is defined by f(θ), with θ being a vector of parameters which when instantiated specifies a particular function. Then, the most appropriate member of the family, in terms of sample consistency, is f(ˆθ): ˆθ = arg max θ C(f(θ), n s, D) (14) Note that the computation of consistency for each proposed parameter vector θ involves only ranking and set-intersection and is therefore very rapid. This means that although the optimization in Equation 14 will not in general permit a closed-form solution, sampling methods can be used to rapidly explore search-spaces for most families of functions. Depending on the precise nature of the task being addressed, the family f may potentially represent any kind of ranking function, but for a concrete example, consider again the differential analysis of microarray data. Recall, the aim is to score each variable in terms of how likely it is to have distinct means in two conditions. The most common choices of ranking function thus tend to be correlation criteria of various kinds [3]. A suitable family of functions for differential analyses of this kind is therefore: f(θ 1, θ 2, θ 3 ) = d + θ 1 θ 2 σ + θ 3 (15) Where, d is the absolute difference of sample means between conditions, and σ the sample standard deviation for the variable to be scored. The θ s are scalar parameters analogous to the vector θ above. We use a straightforward Monte Carlo scheme to learn the θ s. The parameters are sampled uniformly in the range [ 1, 1], and rejected in case of division by zero. Sample consistency is calculated for the function corresponding to each set of parameters drawn, and the parameters with highest consistency returned after 10 3 iterations.

6 Probability of success Learning ranking SAM statistic Fisher score Mann Whitney statistic Lowest prob. of success Learning ranking (LR) SAM statistic (SAM) Mann Whitney statistic (MW) Fisher score (FS) Variance ratio 0 LR SAM MW FS Ranking function (a) Probability of success (b) Lowest probability of success Figure 1: Results on simulated microarray data. Panel (a) shows probability of success plotted against the ratio of variances of relevant variables to irrelevant ones ( variance ratio ). Panel (b) shows lowest probability of success observed for the variance ratios considered. Learning an appropriate function and ranking using that function can be performed as successive steps, so that we can think of a combined learning-ranking algorithm which takes data as its input and produces a ranking of variables as output. Results presented and discussed in the next Section use a learning-ranking approach based on the family defined by Equation 15. The results show that learning under the consistency framework outperforms widely-used statistical methods on both real and simulated data. 5 Results Simulated data: Figure 1 shows results on simulated microarray data for our method and three widely-used ranking functions: the non-parametric Mann-Whitney statistic, the Fisher score, and SAM [4]. SAM is a regularized statistic developed specifically for microarray analysis, and is widely regarded as one of the best choices for differential ranking problems of this kind. The computational procedure was as follows. At each iteration, 15 datapoints 6 under each of two conditions were sampled under a 1000-dimensional Gaussian model. Only 25 of the 1000 dimensions had distinct underlying means in the two conditions; these are the relevant variables or differentially expressed genes. The ranking methods were applied to the sampled data, with n s = 50 variables being selected 7. The identities of the relevant variables were used to check results and compute probabilities of success, but of course remained hidden from the algorithms. The ratio of variances of relevant to irrelevant variables ( variance ratio ) while in practice unobservable, is known to have a major impact on the performance of ranking functions of this kind [2]. To examine the robustness of the ranking functions we therefore considered a range of variance ratios, generating 500 datasets for each ratio to obtain accurate estimates of probability of success. The results demonstrate the robustness of our algorithm. The other methods do well in some regions of the curve but quite badly in others (which makes results from real data, where the true variance ratio is unknown, hard to trust). In contrast, our method does well 6 Sample-sizes as small as this make analysis difficult, but are quite typical in microarray analysis. 7 Note that the aim here is to select the relevant variables rather than to classify datapoints.

7 Table 1: Results on colon cancer microarray data. NORMALIZED PROB. OF Confidence score for FUNCTION vs: FUNCTION CONSISTENCY SUCCESS LR SAM FS MW Learning-ranking (LR) SAM statistic (SAM) Fisher score (FS) Mann-Whitney (MW) across a range of ratios, outperforming SAM in every region of the curve. The fundamental reason for the success of our method is its ability to use a measure of probability of success for guidance. Here, it has effectively adapted the simple function defined by Equation 15 to the unobserved variance ratio, and done so without needing to directly consider that ratio. Genomic data: We applied our method to a widely-studied microarray dataset pertaining to colon cancer [5], allowing it to learn an appropriate member of the family defined by Equation 15. Posterior distributions over probability of success for the learned function and the three functions mentioned above were inferred following Equation 10 (with a beta prior for q and a uniform prior for n t, bounded between [10, 100]). Table 1 shows normalized consistencies (i.e.c/n s ) and MAP-estimated probabilities of success for the four methods; our method does noticeably better than the others. But how significant are the observed differences? A good way of answering this question is to ask how confident we can be that our method is more effective on this data than the other functions: that is, by considering the posterior probability that its probability of success is higher. This measure of confidence can be computed directly from the inferred posteriors 8, and is shown, for each pair of ranking functions, in Table 1. (These confidence scores read from left-to-right, so that we can, for example, be 72% confident that our method is more effective compared with SAM, and 98% confident in comparison with the Fisher score.) 6 Discussion and conclusions This paper has sought to address unsupervised variable ranking in the kind of extremely high-dimensional setting now common in several important areas of science. Our approach was centred around a measure of stability called consistency. The notion that stability is a good thing occurs widely in the literature (e.g. [6] in the context of variable selection). However, to the best of our knowledge, our explicitly probabilistic view of stability in ranking, and subsequent inference of underlying probability of success, are novel (and indeed made possible by the very properties of high-throughput data which are otherwise a curse ). There are also two interesting similarities between our approach and the Rankprop [7] algorithm (which otherwise addresses a quite different, supervised problem). Firstly, quality of ranking is emphasized as a learning objective, as it is here. Secondly, Rankprop avoids learning a difficult target function by instead learning a simpler function positively correlated with the target. This is similar in spirit to our use of consistency as an easy-tocompute and positively correlated proxy for probability of success. An interesting feature of the inference scheme discussed in Section 3 was that consistency 8 Suppose the posteriors over probability of success for Algorithms #1 and #2 are given by the densities p 1 and p 2 respectively. If φ 2 is the cumulative distribution function corresponding to p 2, the probability of Algorithm #1 having higher underlying probability of success is just: P (q 2 < q 1) = 1 0 φ 2(u)p 1(u) du

8 PSfrag replacements M q C n t Figure 2: A graphical model for consistency. depended only implicitly on the statistical model underlying the data. Figure 2 illustrates the relevant high-level dependencies as a graphical model. Sample consistency C is conditionally independent of model M, given probability of success q. Since we were interested only in inferring P (q C), we were able to place priors over q and the total number n t of relevant variables and avoid having to explicitly deal with the model altogether. This is an appealing property of our approach given that little is known about underlying models in many applications, especially in molecular biology. Roth and Lange [8] note that while supervised feature selection has been widely addressed in the literature [3, 9] the unsupervised case has received comparatively little attention. Unsupervised learning is generally a hard problem, but the high-dimensionality and relatively small sample-size typical of the kind of problems addressed here in many ways make a hard problem harder. The mismatch between dimensionality and sample-size represents what is in effect a statistical bottleneck : models rich enough to adequately describe the underlying system become too complex to estimate from the samples available. Our notion of consistency can be thought of as an attempt to side-step this problem by looking at ranking functions from a purely combinatorial perspective. The abstraction away from model-based statistics to a combinatorial view makes our approach conceptually distinct from existing methods, and also inherently flexible. Applications to practical problems in molecular biology and information retrieval are a focus of current research. Acknowledgements: SNM thanks Nick Hughes, Peter Sykacek and Andrew Zisserman. References [1] I. Lonnstedt and T. P. Speed. Replicated microarray data. Stat Sinica, 12:31 46, [2] S. Mukherjee and S. J. Roberts. A Theoretical Analysis of Gene Selection. In Proceedings of the IEEE Computer Society Bioinformatics Conference. IEEE Press, To appear. [3] I. Guyon and A. Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3(Mar): , Special Issue on Variable and Feature Selection. [4] V. Tusher, R. Tibshirani, and G. Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA, 98(9): , [5] U. Alon et. al. Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proc. Natl Acad. Sci. USA, 96(12): , [6] J. Bi et al. Dimensionality reduction via sparse support vector machines. Journal of Machine Learning Research, 3(Mar): , Special Issue on Variable and Feature Selection. [7] R. Caruana, S. Baluja, and T. Mitchell. Using the future to sort out the present: Rankprop and multitask learning for medical risk evaluation. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural Information Processing Systems 8, [8] V. Roth and T. Lange. Feature selection in clustering problems. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, [9] J. Weston et al. Feature Selection for SVMs. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13. MIT Press, 2001.

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES Alberto Bertoni, 1 Raffaella Folgieri, 1 Giorgio Valentini, 1 1 DSI, Dipartimento di Scienze

More information

A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

More information

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Lecture 14. Clustering, K-means, and EM

Lecture 14. Clustering, K-means, and EM Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.

More information

Microarray Data Analysis: Discovery

Microarray Data Analysis: Discovery Microarray Data Analysis: Discovery Lecture 5 Classification Classification vs. Clustering Classification: Goal: Placing objects (e.g. genes) into meaningful classes Supervised Clustering: Goal: Discover

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Structure learning in human causal induction

Structure learning in human causal induction Structure learning in human causal induction Joshua B. Tenenbaum & Thomas L. Griffiths Department of Psychology Stanford University, Stanford, CA 94305 jbt,gruffydd @psych.stanford.edu Abstract We use

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Algorithm Independent Topics Lecture 6

Algorithm Independent Topics Lecture 6 Algorithm Independent Topics Lecture 6 Jason Corso SUNY at Buffalo Feb. 23 2009 J. Corso (SUNY at Buffalo) Algorithm Independent Topics Lecture 6 Feb. 23 2009 1 / 45 Introduction Now that we ve built an

More information

Stability-Based Model Selection

Stability-Based Model Selection Stability-Based Model Selection Tilman Lange, Mikio L. Braun, Volker Roth, Joachim M. Buhmann (lange,braunm,roth,jb)@cs.uni-bonn.de Institute of Computer Science, Dept. III, University of Bonn Römerstraße

More information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models for Image Analysis - Lecture 1 Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.

More information

Lecture 7: Hypothesis Testing and ANOVA

Lecture 7: Hypothesis Testing and ANOVA Lecture 7: Hypothesis Testing and ANOVA Goals Overview of key elements of hypothesis testing Review of common one and two sample tests Introduction to ANOVA Hypothesis Testing The intent of hypothesis

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

Sample Size Estimation for Studies of High-Dimensional Data

Sample Size Estimation for Studies of High-Dimensional Data Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Predictive analysis on Multivariate, Time Series datasets using Shapelets 1 Predictive analysis on Multivariate, Time Series datasets using Shapelets Hemal Thakkar Department of Computer Science, Stanford University hemal@stanford.edu hemal.tt@gmail.com Abstract Multivariate,

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Many natural processes can be fit to a Poisson distribution

Many natural processes can be fit to a Poisson distribution BE.104 Spring Biostatistics: Poisson Analyses and Power J. L. Sherley Outline 1) Poisson analyses 2) Power What is a Poisson process? Rare events Values are observational (yes or no) Random distributed

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

INTRODUCTION TO PATTERN RECOGNITION

INTRODUCTION TO PATTERN RECOGNITION INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

ECE-271B. Nuno Vasconcelos ECE Department, UCSD ECE-271B Statistical ti ti Learning II Nuno Vasconcelos ECE Department, UCSD The course the course is a graduate level course in statistical learning in SLI we covered the foundations of Bayesian or generative

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

Structural Uncertainty in Health Economic Decision Models

Structural Uncertainty in Health Economic Decision Models Structural Uncertainty in Health Economic Decision Models Mark Strong 1, Hazel Pilgrim 1, Jeremy Oakley 2, Jim Chilcott 1 December 2009 1. School of Health and Related Research, University of Sheffield,

More information

Lectures 5 & 6: Hypothesis Testing

Lectures 5 & 6: Hypothesis Testing Lectures 5 & 6: Hypothesis Testing in which you learn to apply the concept of statistical significance to OLS estimates, learn the concept of t values, how to use them in regression work and come across

More information

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference:

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

One-shot Learning of Poisson Distributions Information Theory of Audic-Claverie Statistic for Analyzing cdna Arrays

One-shot Learning of Poisson Distributions Information Theory of Audic-Claverie Statistic for Analyzing cdna Arrays One-shot Learning of Poisson Distributions Information Theory of Audic-Claverie Statistic for Analyzing cdna Arrays Peter Tiňo School of Computer Science University of Birmingham, UK One-shot Learning

More information

Linking non-binned spike train kernels to several existing spike train metrics

Linking non-binned spike train kernels to several existing spike train metrics Linking non-binned spike train kernels to several existing spike train metrics Benjamin Schrauwen Jan Van Campenhout ELIS, Ghent University, Belgium Benjamin.Schrauwen@UGent.be Abstract. This work presents

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Lecture: Mixture Models for Microbiome data

Lecture: Mixture Models for Microbiome data Lecture: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data Outline: - - Sequencing thought experiment Mixture Models (tangent) - (esp. Negative Binomial) - Differential abundance

More information

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Christopher R. Genovese Department of Statistics Carnegie Mellon University joint work with Larry Wasserman

More information

Learning to Learn and Collaborative Filtering

Learning to Learn and Collaborative Filtering Appearing in NIPS 2005 workshop Inductive Transfer: Canada, December, 2005. 10 Years Later, Whistler, Learning to Learn and Collaborative Filtering Kai Yu, Volker Tresp Siemens AG, 81739 Munich, Germany

More information

The formal relationship between analytic and bootstrap approaches to parametric inference

The formal relationship between analytic and bootstrap approaches to parametric inference The formal relationship between analytic and bootstrap approaches to parametric inference T.J. DiCiccio Cornell University, Ithaca, NY 14853, U.S.A. T.A. Kuffner Washington University in St. Louis, St.

More information

Uncertain Inference and Artificial Intelligence

Uncertain Inference and Artificial Intelligence March 3, 2011 1 Prepared for a Purdue Machine Learning Seminar Acknowledgement Prof. A. P. Dempster for intensive collaborations on the Dempster-Shafer theory. Jianchun Zhang, Ryan Martin, Duncan Ermini

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Local Likelihood Bayesian Cluster Modeling for small area health data. Andrew Lawson Arnold School of Public Health University of South Carolina

Local Likelihood Bayesian Cluster Modeling for small area health data. Andrew Lawson Arnold School of Public Health University of South Carolina Local Likelihood Bayesian Cluster Modeling for small area health data Andrew Lawson Arnold School of Public Health University of South Carolina Local Likelihood Bayesian Cluster Modelling for Small Area

More information

From Lasso regression to Feature vector machine

From Lasso regression to Feature vector machine From Lasso regression to Feature vector machine Fan Li, Yiming Yang and Eric P. Xing,2 LTI and 2 CALD, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA USA 523 {hustlf,yiming,epxing}@cs.cmu.edu

More information

Feature Selection for SVMs

Feature Selection for SVMs Feature Selection for SVMs J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik, Barnhill BioInformatics.com, Savannah, Georgia, USA. CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016

Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016 Decision-making, inference, and learning theory ECE 830 & CS 761, Spring 2016 1 / 22 What do we have here? Given measurements or observations of some physical process, we ask the simple question what do

More information

More Spectral Clustering and an Introduction to Conjugacy

More Spectral Clustering and an Introduction to Conjugacy CS8B/Stat4B: Advanced Topics in Learning & Decision Making More Spectral Clustering and an Introduction to Conjugacy Lecturer: Michael I. Jordan Scribe: Marco Barreno Monday, April 5, 004. Back to spectral

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

The Perceptron. Volker Tresp Summer 2014

The Perceptron. Volker Tresp Summer 2014 The Perceptron Volker Tresp Summer 2014 1 Introduction One of the first serious learning machines Most important elements in learning tasks Collection and preprocessing of training data Definition of a

More information

Notes on Noise Contrastive Estimation (NCE)

Notes on Noise Contrastive Estimation (NCE) Notes on Noise Contrastive Estimation NCE) David Meyer dmm@{-4-5.net,uoregon.edu,...} March 0, 207 Introduction In this note we follow the notation used in [2]. Suppose X x, x 2,, x Td ) is a sample of

More information

Chapter 9 Inferences from Two Samples

Chapter 9 Inferences from Two Samples Chapter 9 Inferences from Two Samples 9-1 Review and Preview 9-2 Two Proportions 9-3 Two Means: Independent Samples 9-4 Two Dependent Samples (Matched Pairs) 9-5 Two Variances or Standard Deviations Review

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Bootstrap for Regression Week 9, Lecture 1

MA 575 Linear Models: Cedric E. Ginestet, Boston University Bootstrap for Regression Week 9, Lecture 1 MA 575 Linear Models: Cedric E. Ginestet, Boston University Bootstrap for Regression Week 9, Lecture 1 1 The General Bootstrap This is a computer-intensive resampling algorithm for estimating the empirical

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

1 Differential Privacy and Statistical Query Learning

1 Differential Privacy and Statistical Query Learning 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be

More information

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract Published in: Advances in Neural Information Processing Systems 8, D S Touretzky, M C Mozer, and M E Hasselmo (eds.), MIT Press, Cambridge, MA, pages 190-196, 1996. Learning with Ensembles: How over-tting

More information

Non-Negative Factorization for Clustering of Microarray Data

Non-Negative Factorization for Clustering of Microarray Data INT J COMPUT COMMUN, ISSN 1841-9836 9(1):16-23, February, 2014. Non-Negative Factorization for Clustering of Microarray Data L. Morgos Lucian Morgos Dept. of Electronics and Telecommunications Faculty

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from Topics in Data Analysis Steven N. Durlauf University of Wisconsin Lecture Notes : Decisions and Data In these notes, I describe some basic ideas in decision theory. theory is constructed from The Data:

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information