A Bayesian Criterion for Clustering Stability
|
|
- Amos Henderson
- 5 years ago
- Views:
Transcription
1 A Bayesian Criterion for Clustering Stability B. Clarke 1 1 Dept of Medicine, CCS, DEPH University of Miami Joint with H. Koepke, Stat. Dept., U Washington 26 June 2012 ISBA Kyoto
2 Outline 1 Assessing Stability
3 Stability of Clustering, not Clustering per se Imagine n points say x i = (x 1,i,..., x D,i ) for i = 1,..., n. Our goal is to put the points that belong together in the same set and make sure points that don t belong together are in different sets. Such a set is called a cluster; a set of clusters is called a clustering (of the points). There are many ways to form clusters... Given a proposed clustering of almost any sort we propose a technique for seeing if it is reasonable, validation. There are also many ways to do this. Most of these are based on some notion of clustering stability...so is ours.
4 Key Definitions Suppose we have Ĉ = {Ĉ1,..., ĈK } where the Ĉk s are disjoint and k Ĉ k = {x i,..., x n }. We evaluate the stability of a fixed cluster Ĉk that has x i as a member using sets of the form Ŝ ik = {(λ 1,..., λ K ) l k : λ k d(x i, ˆµ k ) min l k λ ld(x i, ˆµ l )} the λ k s are parameters, d is a distance, and ˆµ k is the centroid of Ĉk. The larger Ŝik is the more stable Ĉ is, so choose a prior distribution F and calculate F(Ŝik).
5 The Main Stability Assessment Let I be an indicator function...then [φ ik ] i=1,...n;k=1,...,k = IŜik df (λ K ) This is the averaged assignment matrix. If F puts all its mass on λ K = (1,..., 1) then we get φ ik = 0, 1 depending on whether x i Ĉk. More generally F spreads the membership of x i across the K clusters. The pointwise stability of x i is where h i = k where x i Ĉ K. PW (x i ) = PW i = φ i,hi
6 Behavior of PW stability Suppose we fix the centroids ˆµ k of the Ĉ s and use PW for a general point x. Thus: PW (x) = F({λ K : λ k d(x, ˆµ k ) λ l d(x, ˆµ l ) l k}) Then we can visualize (in 2 dimensions) the effect of F. For K = 5 and F IID shifted exponential i.e., f (λ β) = βe β(λ 1) 1 λ 1, with β =.1,.75, 2, 10 As the hyperparameter β increases the regions of stability increase because it s harder and harder to regard a point as belonging to a cluster it wasn t assigned to.
7 β =.1 Assessing Stability 2.0 Pointwise Stability for β =
8 β = Pointwise Stability for β =
9 β = 2 Assessing Stability 2.0 Pointwise Stability for β =
10 β = Pointwise Stability for β =
11 Pointwise Stability with Data and Adaptive β 2 1 B A D 0 1 C 2 E
12 A Visualization not Limited to 2 Dimensions Recall i indexes data points, k indexes clusters. Sort the elements φ ik over cluster, then for i Ĉ k0 put the φ i,k0 s in decreasing order. The result is a heatmap...
13 Heatmap for the Data A B C D E 0. Dark lines at the bottom of diagonal blocks show instability.
14 But Is This Bayes? No: The prior F is not multiplying a likelihood. There is no posterior. There is no credible set, point estimator, hyothesis test... Yes: The prior F represents a distinct source of information and we are making inferences from a combination of the data and the prior. Editorializing: I think this is extended Bayes in the sense that we are combining the data and pre-experimental information. In this case, the prior represents our pre-experimental views about how much perturbation is reasonable.
15 And what does this have to do with prediction? This is intended for prediction with classification data. Sanity check that the classes are reasonable. An example below will give good reason for this skepticism. So, you want to find the classes from the clustering and then use the clusters to make predictions. Natural predictor: Assign x new to class k when k = arg min u d(x new, ˆµ u ) At this point we have only developed the stability method...we have not tested how well it gives predictions. We have only (so far) shown that clustering stability may give more reasonable classes than the classes in the data.
16 Let s look at the φ ik s in more detail We actually want the Average Pointwise Stability for Ĉ: APW = (1/n) n i=1 φ i,hi To understand the APW, let s imagine there is a true clustering C = (C 1,..., C KT ) with centroids µ k and S ik = {(λ 1,..., λ K ) l k : λ k d(x i, µ k ) min l k λ ld(x i, µ l )}
17 Relation Between Empirical and Population Quantities Let s also assume convexity : x Ĉk l k d(x, ˆµ k ) d(x, ˆµ l ) x C k l k d(x, µ k ) d(x, µ l ) This lets us go back and forth treating Ĉ as a partition of the data and as a partition of the sample space. Suppose P(Ĉ k C k ) 0, 1 depending on whether the empirical clusters go to the population clusters as n. As long as the ˆµ k s converge to their corresponding µ k s we get convergence of the clusters too. This is basically a law of large numbers.
18 Empirical and Population Criteria We approximate the APW by Q n (K ) = K j=1 1 n n i=1 I {xi ĈKj } We approximate the limit of the APW by Q (K ) = K EI {X1 C Kj } j=1 Theorem: Q n (K ) Q (K ). IŜij (λ K ) (X i)df(λ K 1 ) I S1j (λ K ) (X 1)dF(λ K 1 )
19 Choosing K Regard Q n (K ) as a data dependent objective function for choosing K. Let [K 1, K 2 ] be the (compact) set of integers strictly between K 1 1 and K Write ˆK = arg max Q n(k ). (1) K [K 1,K 2 ] By Theorem 1, for any bounded interval [K 1, K 2 ] we have that Q n (K ) Q (K ) uniformly in probability on [K 1, K 2 ]. Let K opt = arg max Q (K ). K [K 1,K 2 ]
20 Consistency Theorem Theorem: Suppose that K opt is the unique maximum of Q (K ) over K and that K opt [K 1, K 2 ]. Then, arg max Q n(k ) arg max Q (K ), (2) K [K 1,K 2 ] K [K 1,K 2 ] i.e., ˆK K opt, in probability as n. Proof: Convergence follows from a simple modification of Theorem 2.1 in Newey and McFadden (1994).
21 Three Steps to Make this Work Step 1: Get the input. Assume that for each K we have an optimal clustering e.g., K -means. Step 2: Rescale: For each K, get a bootstrap sample generate and B values S K,b = log APW APW b. Using this bootstrap distribution for S k, choose ˆK to be the smallest plausible maximum of (1/B) b S K,b. Like using a permutation distribution to get at relative stability.
22 Prior Selection Step 3: After some testing we found that the shifted exponential prior with hyperparameter β worked well. So...we estimated the hyperparameter essentially by maximizing the empirical form of K 2 1 E(S K ) K 2 K 1 K =K 1 For one dimensional hyperparameters this can be done by a simple bisection algorithm.
23 Synthetic Data and Five Stability Methods We used three classes of normal mixtures in in 2, 5, 10, and 20 dimensions, and with 100, 150, 200, and 300 data points each, respectively. These classes were: T 1: Four mixture components and with strong non-linear transformations to give the components more difficult shapes. T 2: Five mixture components, but easier non-linear operations so the clusters are better modeled by centroid-based methods. T 3: Samples from a single unimodal normal density to tests the ability of the methods to detect when there is no actual clustering, i.e. K = 1.
24 Five Stability Methods for Choosing K We used two versions of our stability measure one with d set to squared error, the other with d based on average linkage. (Take the average of point to centroid distances.) We also used three conceptually different assessments of stability more or less in their standard form: Subsampling Silhouette distance Gap Statistic
25 T1, K T = 4: 10d, n = 200 and 20d, n = 400. ˆK Gap Subsamp Silhouette Pertub Avg. Link Gap Subsamp Silhouette Pertub Avg. Link
26 T2, K T = 5: 10d, n = 200 and 20d, n = 400. ˆK Gap Subsamp Silhouette Pertub Avg. Link Gap Subsamp Silhouette Pertub Avg. Link
27 T3 data In this case the correct number of clusters was 1. In all cases we examined, both forms of the perturbation statistic were best... Except for one. With 2 dimensions the Gap statistic gave 1 correctly about 14% of the time and all the other methods did worse. In this one setting, no method did well except for our perturbation method and average linkage did better than Euclidean.
28 General Comparisons Gap statistic performs poorly it captures the first statistically significant clustering not the most significant clustering. Can t tell two separated groups of clusters from 2 clusters and tends to break long thin clusters into several clusters. Silhouette: Our method outperforms it in every case we looked at its intuition is similar to ours but ours is more thorough. Subsampling methods are the only real competitor...but can t detect K T = 1 case and they have the same problems as the Gap statistic. OTOH, in high dimensions may be competitive.
29 Yeast data This is classification data with 10 classes and n = but we threw out the class indicators and clustered the 8 explanatory variables. First written up in Nakai et al. (1991) Osaka! UC Irvine repository Clustering on this dataset is quite difficult as the classes do not separate easily. We sphered the data and then found the K -means clusterings for a range of K. The stability sequence shows K = 8 is most stable, but values five through nine were not bad.
30 Stability Plot The stability sequence for K -means clusterings for K = 2,..., 12 for the yeast data set. The most stable choice is K = 8, not K = 10 that is physically correct.
31 What we did... We sphered the data, rather than studentizing or using it as is because it gave the most reasonable stability sequence with K -means clustering. Also, the heatmap of the most stable clustering showed clear distinctions between the classes. The stability sequences for the raw or studentized data had a maximum at two and then declined. With 2 clusters the heatmaps did not show distinct components. It remains unclear when to sphere, studentize or use the data as is. Here, we show the most stable version. We plotted stability heatmaps for K = 5,..., 10.
32 The stability heatmaps for K = 6, C C
33 The stability heatmaps for K = 8, C C
34 What the Heatmaps Show: When K = 6, the blocks on the main diagonal are not very light and on each row there is not a lot of difference between the cluster from the main diagonal and its competing clusters. For K = 7, the blocks off the main diagonal are a little darker than those on the main diagonal. The contrast is stronger for K = 8 and K = 9. However, there is little (if any) improvement in the contrast between the blocks on and off the main diagonal in moving from K = 8 to K = 9. The heatmaps for K = 5, 10 show worse stability.
35 The Class Label Bar on the Right: The class label bars indicate how well the clustering tends to reproduce the class labels. These classes do not separate well but there are distinct groups forming each cluster, particularly at K = 8. The bars show that most stable blocks tend to be from the same class but the tendency is not strong. A practitioner can note which clusters are well separated and which are separated so little that merging them should be considered. (The reverse happens too!) Our analysis finds 7 or 8 classes is more stable than the ten apparent classes suggesting that some of the classes overlap significantly.
36 Summary I We have formulated a stability criterion based on how easily inequalities between point-to-cluster-centroid distances are preserved. Its absolute form approximates the average pointwise stability and gives consistent selection of K. Large values of our stability criterion indicate stability. It accurately encapsulates our usual notion of stability in several key cases, in particular responds to how populated boundary regions are.
37 Summary II For usage we need a squence of clusterings optimal for each K. We convert our criterion to a relative form. We estimate the hyperparameter by using the relative form. We transform the data (studentize, sphere, as is etc.) to find the ˆK of the most stable clustering. We analyze the most stable clusterings by heatmaps and class label bars. Sometimes stable clusterings suggest more or fewer classes with classification data; this reveals structure that might otherwise be overlooked.
38 Special Case: Concentration at the µ k s If the distribution of X concentrates at µ 1 and µ 2 then Q (2) goes to 1. This generalizes to K clusters. For K = 2, let µ j = E(X C j ) and D j = d(x, µ j ). Let Λ 1 = λ 2 /λ 1, Λ 2 = λ 1 /λ 2 and let G Λu be the survival function for Λ u. Can show: Q (2) = EI D1 /D 2 1G Λ1 (D 1 /D 2 ) + EI D2 /D 1 1G Λ2 (D 2 /D 1 ). So, if D 1 /D 2 small on C 1 then the first term is near P(C 1 ) so C 1 is stable. If D 2 /D 1 small on C 2 then the second term is near P(C 2 ) This means Q (2) is near its maximum P(C 1 ) + P(C 1 ) = 1. so ˆQ n (2) should be close to its maximum 1, too.
39 Special Case: Mass of P Near the Boundary If the mass of P accumulates near the boundary between C 1 and C 2 then Q (2) goes to 1/2. For K clusters, Q (K ) can decrease to 1/K, if P concentrates at a point. For K = 2, if D 1 /D 2 large on C 1, i.e., D 1 /D 2 1 we expect many points in C 1 to be close to the boundary between C 1 and C 2. Similarly if D 2 /D 1 close to 1 on C 2. In these cases, Q (2) P(C 1 )G Λ1 (1) + P(C 2 )G Λ2 (1) = 1/2 and one can prove Q (2) 1/2. Since 1/2 Q (2) 1, it seems reasonable to regard Q (2) as indicating stability.
40 Special Case: Moving the Centroids If P has 2 modes and they separate and the regions between them have probability going to zero then Q (2) tends to increase. Generalizes to arbitrary K. If P has 2 modes and they get closer then Q (2) tends to decrease. Generalizes to arbitrary K. In general, 1/K φ (K ) 1. Overall, φ (K ) seems to assess stability and therefore we think Q n (K ) does too. Informal Statement of a formal Theorem: As P concentrates at its modes/centroids, Q (K ) is largest for the correct number K of clusters.
41 Subsampling Subsampling: Take a subsample of the data, possibly perturb it with noise, then recluster and measure the difference between the original and perturbed data clustering. Take the average over several runs. Adjusted Rand and variation of information distances are most popular.
42 Silhouette Distance Defines the point-to-cluster distance for any fixed point to any fixed cluster as the average of the distances from that point to all the other points in a given cluster. The silhouette score for a point is then the difference between its point-to-cluster distances for the cluster it was asigned to and the next best cluster, scaled by the maximum of these two point-to-cluster distances. Thus each point is scored by its proximity to a boundary. The silhouette score for a clustering is the average of the silhouette scores for each point. K is chosen from the clustering with the smallest average silhouette distance.
43 Gap Statistic The gap statistic of Tibshirani et al. (2001) uses the difference in total cluster spread, defined in terms of the sum of all pairwise distances in a cluster, between the actual dataset and the clusterings of reference distributions with no true clusters. In Euclidean space with the squared error distance, the spread is the total empirical variance of all the clusters. The reference distributions adjust for the dependence on K in the measure and to guard against spurious clusters. Tibshirani et al. (2001) proposes using the uniform distribution within the bounding box of the original dataset using principal components.
Data Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More informationBAYESIAN CLUSTER VALIDATION. a thesis submitted in partial fulfillment of the requirements for the degree of. the faculty of graduate studies
BAYESIAN CLUSTER VALIDATION by HOYT ADAM KOEPKE B.A., The University of Colorado at Boulder, 2004 a thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationUnsupervised machine learning
Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels
More informationClassification 2: Linear discriminant analysis (continued); logistic regression
Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:
More informationClustering using Mixture Models
Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior
More informationMultivariate Statistics
Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering
More informationMSA220 Statistical Learning for Big Data
MSA220 Statistical Learning for Big Data Lecture 4 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology More on Discriminant analysis More on Discriminant
More informationPrincipal component analysis (PCA) for clustering gene expression data
Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1 Outline of talk Background and motivation Design of our empirical
More informationEM Algorithm & High Dimensional Data
EM Algorithm & High Dimensional Data Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Gaussian EM Algorithm For the Gaussian mixture model, we have Expectation Step (E-Step): Maximization Step (M-Step): 2 EM
More informationChapter 5. Understanding and Comparing. Distributions
STAT 141 Introduction to Statistics Chapter 5 Understanding and Comparing Distributions Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 1 / 27 Boxplots How to create a boxplot? Assume
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationOrdinary Least Squares Linear Regression
Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationCSE 546 Final Exam, Autumn 2013
CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,
More informationSubject CS1 Actuarial Statistics 1 Core Principles
Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and
More informationSYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I
SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability
More informationIntroduction to Machine Learning. Lecture 2
Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for
More informationClustering. CSL465/603 - Fall 2016 Narayanan C Krishnan
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification
More informationCluster Validity. Oct. 28, Cluster Validity 10/14/ Erin Wirch & Wenbo Wang. Outline. Hypothesis Testing. Relative Criteria.
1 Testing Oct. 28, 2010 2 Testing Testing Agenda 3 Testing Review of Testing Testing Review of Testing 4 Test a parameter against a specific value Begin with H 0 and H 1 as the null and alternative hypotheses
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationDiscrete Mathematics and Probability Theory Fall 2015 Lecture 21
CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationA Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data
A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data David S. Matteson Department of Statistical Science Cornell University matteson@cornell.edu www.stat.cornell.edu/~matteson
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationVariable selection for model-based clustering
Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition
More information1/sqrt(B) convergence 1/B convergence B
The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationComputer Vision Group Prof. Daniel Cremers. 3. Regression
Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the
More informationSection 3: Simple Linear Regression
Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationThe Informativeness of k-means for Learning Mixture Models
The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, 2018 1/35 Gaussian distribution For F dimensions,
More informationDissertation Defense
Clustering Algorithms for Random and Pseudo-random Structures Dissertation Defense Pradipta Mitra 1 1 Department of Computer Science Yale University April 23, 2008 Mitra (Yale University) Dissertation
More informationNotes on Blackwell s Comparison of Experiments Tilman Börgers, June 29, 2009
Notes on Blackwell s Comparison of Experiments Tilman Börgers, June 29, 2009 These notes are based on Chapter 12 of David Blackwell and M. A.Girshick, Theory of Games and Statistical Decisions, John Wiley
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationA General Overview of Parametric Estimation and Inference Techniques.
A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying
More informationCS 361: Probability & Statistics
March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the
More informationDS-GA 1002 Lecture notes 11 Fall Bayesian statistics
DS-GA 100 Lecture notes 11 Fall 016 Bayesian statistics In the frequentist paradigm we model the data as realizations from a distribution that depends on deterministic parameters. In contrast, in Bayesian
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationBayesian Regression of Piecewise Constant Functions
Marcus Hutter - 1 - Bayesian Regression of Piecewise Constant Functions Bayesian Regression of Piecewise Constant Functions Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA,
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationFall 07 ISQS 6348 Midterm Solutions
Fall 07 ISQS 648 Midterm Solutions Instructions: Open notes, no books. Points out of 00 in parentheses. 1. A random vector X = 4 X 1 X X has the following mean vector and covariance matrix: E(X) = 4 1
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationEstimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator
Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were
More informationSome Background Material
Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationECE531 Lecture 10b: Maximum Likelihood Estimation
ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So
More informationECE 592 Topics in Data Science
ECE 592 Topics in Data Science Final Fall 2017 December 11, 2017 Please remember to justify your answers carefully, and to staple your test sheet and answers together before submitting. Name: Student ID:
More informationDIMENSION REDUCTION AND CLUSTER ANALYSIS
DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833
More informationλ(x + 1)f g (x) > θ 0
Stat 8111 Final Exam December 16 Eleven students took the exam, the scores were 92, 78, 4 in the 5 s, 1 in the 4 s, 1 in the 3 s and 3 in the 2 s. 1. i) Let X 1, X 2,..., X n be iid each Bernoulli(θ) where
More informationLecture 6: Methods for high-dimensional problems
Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,
More informationScalable Bayesian Event Detection and Visualization
Scalable Bayesian Event Detection and Visualization Daniel B. Neill Carnegie Mellon University H.J. Heinz III College E-mail: neill@cs.cmu.edu This work was partially supported by NSF grants IIS-0916345,
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More informationStat 5101 Lecture Notes
Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random
More informationMachine Learning for Data Science (CS4786) Lecture 8
Machine Learning for Data Science (CS4786) Lecture 8 Clustering Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Announcement Those of you who submitted HW1 and are still on waitlist email
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationCS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)
CS68: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) Tim Roughgarden & Gregory Valiant April 0, 05 Introduction. Lecture Goal Principal components analysis
More informationCross-Validation with Confidence
Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationMinimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions
Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions Daniel F. Schmidt Enes Makalic Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School
More informationY (Nominal/Categorical) 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV
1 Neuendorf Discriminant Analysis The Model X1 X2 X3 X4 DF2 DF3 DF1 Y (Nominal/Categorical) Assumptions: 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV 2. Linearity--in
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationLECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)
LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationData Exploration and Unsupervised Learning with Clustering
Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a
More informationClusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved
Clusters Unsupervised Learning Luc Anselin http://spatial.uchicago.edu 1 curse of dimensionality principal components multidimensional scaling classical clustering methods 2 Curse of Dimensionality 3 Curse
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationTesting Statistical Hypotheses
E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions
More informationProbabilistic clustering
Aprendizagem Automática Probabilistic clustering Ludwig Krippahl Probabilistic clustering Summary Fuzzy sets and clustering Fuzzy c-means Probabilistic Clustering: mixture models Expectation-Maximization,
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationChapter 1 Statistical Inference
Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More information10708 Graphical Models: Homework 2
10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves
More informationAditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016
Lecture 1: Introduction and Review We begin with a short introduction to the course, and logistics. We then survey some basics about approximation algorithms and probability. We also introduce some of
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationLecture 5: Clustering, Linear Regression
Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 .0.0 5 5 1.0 7 5 X2 X2 7 1.5 1.0 0.5 3 1 2 Hierarchical clustering
More informationSTATS 200: Introduction to Statistical Inference. Lecture 29: Course review
STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout
More information1 Using standard errors when comparing estimated values
MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More information