A Bayesian Criterion for Clustering Stability

Size: px
Start display at page:

Download "A Bayesian Criterion for Clustering Stability"

Transcription

1 A Bayesian Criterion for Clustering Stability B. Clarke 1 1 Dept of Medicine, CCS, DEPH University of Miami Joint with H. Koepke, Stat. Dept., U Washington 26 June 2012 ISBA Kyoto

2 Outline 1 Assessing Stability

3 Stability of Clustering, not Clustering per se Imagine n points say x i = (x 1,i,..., x D,i ) for i = 1,..., n. Our goal is to put the points that belong together in the same set and make sure points that don t belong together are in different sets. Such a set is called a cluster; a set of clusters is called a clustering (of the points). There are many ways to form clusters... Given a proposed clustering of almost any sort we propose a technique for seeing if it is reasonable, validation. There are also many ways to do this. Most of these are based on some notion of clustering stability...so is ours.

4 Key Definitions Suppose we have Ĉ = {Ĉ1,..., ĈK } where the Ĉk s are disjoint and k Ĉ k = {x i,..., x n }. We evaluate the stability of a fixed cluster Ĉk that has x i as a member using sets of the form Ŝ ik = {(λ 1,..., λ K ) l k : λ k d(x i, ˆµ k ) min l k λ ld(x i, ˆµ l )} the λ k s are parameters, d is a distance, and ˆµ k is the centroid of Ĉk. The larger Ŝik is the more stable Ĉ is, so choose a prior distribution F and calculate F(Ŝik).

5 The Main Stability Assessment Let I be an indicator function...then [φ ik ] i=1,...n;k=1,...,k = IŜik df (λ K ) This is the averaged assignment matrix. If F puts all its mass on λ K = (1,..., 1) then we get φ ik = 0, 1 depending on whether x i Ĉk. More generally F spreads the membership of x i across the K clusters. The pointwise stability of x i is where h i = k where x i Ĉ K. PW (x i ) = PW i = φ i,hi

6 Behavior of PW stability Suppose we fix the centroids ˆµ k of the Ĉ s and use PW for a general point x. Thus: PW (x) = F({λ K : λ k d(x, ˆµ k ) λ l d(x, ˆµ l ) l k}) Then we can visualize (in 2 dimensions) the effect of F. For K = 5 and F IID shifted exponential i.e., f (λ β) = βe β(λ 1) 1 λ 1, with β =.1,.75, 2, 10 As the hyperparameter β increases the regions of stability increase because it s harder and harder to regard a point as belonging to a cluster it wasn t assigned to.

7 β =.1 Assessing Stability 2.0 Pointwise Stability for β =

8 β = Pointwise Stability for β =

9 β = 2 Assessing Stability 2.0 Pointwise Stability for β =

10 β = Pointwise Stability for β =

11 Pointwise Stability with Data and Adaptive β 2 1 B A D 0 1 C 2 E

12 A Visualization not Limited to 2 Dimensions Recall i indexes data points, k indexes clusters. Sort the elements φ ik over cluster, then for i Ĉ k0 put the φ i,k0 s in decreasing order. The result is a heatmap...

13 Heatmap for the Data A B C D E 0. Dark lines at the bottom of diagonal blocks show instability.

14 But Is This Bayes? No: The prior F is not multiplying a likelihood. There is no posterior. There is no credible set, point estimator, hyothesis test... Yes: The prior F represents a distinct source of information and we are making inferences from a combination of the data and the prior. Editorializing: I think this is extended Bayes in the sense that we are combining the data and pre-experimental information. In this case, the prior represents our pre-experimental views about how much perturbation is reasonable.

15 And what does this have to do with prediction? This is intended for prediction with classification data. Sanity check that the classes are reasonable. An example below will give good reason for this skepticism. So, you want to find the classes from the clustering and then use the clusters to make predictions. Natural predictor: Assign x new to class k when k = arg min u d(x new, ˆµ u ) At this point we have only developed the stability method...we have not tested how well it gives predictions. We have only (so far) shown that clustering stability may give more reasonable classes than the classes in the data.

16 Let s look at the φ ik s in more detail We actually want the Average Pointwise Stability for Ĉ: APW = (1/n) n i=1 φ i,hi To understand the APW, let s imagine there is a true clustering C = (C 1,..., C KT ) with centroids µ k and S ik = {(λ 1,..., λ K ) l k : λ k d(x i, µ k ) min l k λ ld(x i, µ l )}

17 Relation Between Empirical and Population Quantities Let s also assume convexity : x Ĉk l k d(x, ˆµ k ) d(x, ˆµ l ) x C k l k d(x, µ k ) d(x, µ l ) This lets us go back and forth treating Ĉ as a partition of the data and as a partition of the sample space. Suppose P(Ĉ k C k ) 0, 1 depending on whether the empirical clusters go to the population clusters as n. As long as the ˆµ k s converge to their corresponding µ k s we get convergence of the clusters too. This is basically a law of large numbers.

18 Empirical and Population Criteria We approximate the APW by Q n (K ) = K j=1 1 n n i=1 I {xi ĈKj } We approximate the limit of the APW by Q (K ) = K EI {X1 C Kj } j=1 Theorem: Q n (K ) Q (K ). IŜij (λ K ) (X i)df(λ K 1 ) I S1j (λ K ) (X 1)dF(λ K 1 )

19 Choosing K Regard Q n (K ) as a data dependent objective function for choosing K. Let [K 1, K 2 ] be the (compact) set of integers strictly between K 1 1 and K Write ˆK = arg max Q n(k ). (1) K [K 1,K 2 ] By Theorem 1, for any bounded interval [K 1, K 2 ] we have that Q n (K ) Q (K ) uniformly in probability on [K 1, K 2 ]. Let K opt = arg max Q (K ). K [K 1,K 2 ]

20 Consistency Theorem Theorem: Suppose that K opt is the unique maximum of Q (K ) over K and that K opt [K 1, K 2 ]. Then, arg max Q n(k ) arg max Q (K ), (2) K [K 1,K 2 ] K [K 1,K 2 ] i.e., ˆK K opt, in probability as n. Proof: Convergence follows from a simple modification of Theorem 2.1 in Newey and McFadden (1994).

21 Three Steps to Make this Work Step 1: Get the input. Assume that for each K we have an optimal clustering e.g., K -means. Step 2: Rescale: For each K, get a bootstrap sample generate and B values S K,b = log APW APW b. Using this bootstrap distribution for S k, choose ˆK to be the smallest plausible maximum of (1/B) b S K,b. Like using a permutation distribution to get at relative stability.

22 Prior Selection Step 3: After some testing we found that the shifted exponential prior with hyperparameter β worked well. So...we estimated the hyperparameter essentially by maximizing the empirical form of K 2 1 E(S K ) K 2 K 1 K =K 1 For one dimensional hyperparameters this can be done by a simple bisection algorithm.

23 Synthetic Data and Five Stability Methods We used three classes of normal mixtures in in 2, 5, 10, and 20 dimensions, and with 100, 150, 200, and 300 data points each, respectively. These classes were: T 1: Four mixture components and with strong non-linear transformations to give the components more difficult shapes. T 2: Five mixture components, but easier non-linear operations so the clusters are better modeled by centroid-based methods. T 3: Samples from a single unimodal normal density to tests the ability of the methods to detect when there is no actual clustering, i.e. K = 1.

24 Five Stability Methods for Choosing K We used two versions of our stability measure one with d set to squared error, the other with d based on average linkage. (Take the average of point to centroid distances.) We also used three conceptually different assessments of stability more or less in their standard form: Subsampling Silhouette distance Gap Statistic

25 T1, K T = 4: 10d, n = 200 and 20d, n = 400. ˆK Gap Subsamp Silhouette Pertub Avg. Link Gap Subsamp Silhouette Pertub Avg. Link

26 T2, K T = 5: 10d, n = 200 and 20d, n = 400. ˆK Gap Subsamp Silhouette Pertub Avg. Link Gap Subsamp Silhouette Pertub Avg. Link

27 T3 data In this case the correct number of clusters was 1. In all cases we examined, both forms of the perturbation statistic were best... Except for one. With 2 dimensions the Gap statistic gave 1 correctly about 14% of the time and all the other methods did worse. In this one setting, no method did well except for our perturbation method and average linkage did better than Euclidean.

28 General Comparisons Gap statistic performs poorly it captures the first statistically significant clustering not the most significant clustering. Can t tell two separated groups of clusters from 2 clusters and tends to break long thin clusters into several clusters. Silhouette: Our method outperforms it in every case we looked at its intuition is similar to ours but ours is more thorough. Subsampling methods are the only real competitor...but can t detect K T = 1 case and they have the same problems as the Gap statistic. OTOH, in high dimensions may be competitive.

29 Yeast data This is classification data with 10 classes and n = but we threw out the class indicators and clustered the 8 explanatory variables. First written up in Nakai et al. (1991) Osaka! UC Irvine repository Clustering on this dataset is quite difficult as the classes do not separate easily. We sphered the data and then found the K -means clusterings for a range of K. The stability sequence shows K = 8 is most stable, but values five through nine were not bad.

30 Stability Plot The stability sequence for K -means clusterings for K = 2,..., 12 for the yeast data set. The most stable choice is K = 8, not K = 10 that is physically correct.

31 What we did... We sphered the data, rather than studentizing or using it as is because it gave the most reasonable stability sequence with K -means clustering. Also, the heatmap of the most stable clustering showed clear distinctions between the classes. The stability sequences for the raw or studentized data had a maximum at two and then declined. With 2 clusters the heatmaps did not show distinct components. It remains unclear when to sphere, studentize or use the data as is. Here, we show the most stable version. We plotted stability heatmaps for K = 5,..., 10.

32 The stability heatmaps for K = 6, C C

33 The stability heatmaps for K = 8, C C

34 What the Heatmaps Show: When K = 6, the blocks on the main diagonal are not very light and on each row there is not a lot of difference between the cluster from the main diagonal and its competing clusters. For K = 7, the blocks off the main diagonal are a little darker than those on the main diagonal. The contrast is stronger for K = 8 and K = 9. However, there is little (if any) improvement in the contrast between the blocks on and off the main diagonal in moving from K = 8 to K = 9. The heatmaps for K = 5, 10 show worse stability.

35 The Class Label Bar on the Right: The class label bars indicate how well the clustering tends to reproduce the class labels. These classes do not separate well but there are distinct groups forming each cluster, particularly at K = 8. The bars show that most stable blocks tend to be from the same class but the tendency is not strong. A practitioner can note which clusters are well separated and which are separated so little that merging them should be considered. (The reverse happens too!) Our analysis finds 7 or 8 classes is more stable than the ten apparent classes suggesting that some of the classes overlap significantly.

36 Summary I We have formulated a stability criterion based on how easily inequalities between point-to-cluster-centroid distances are preserved. Its absolute form approximates the average pointwise stability and gives consistent selection of K. Large values of our stability criterion indicate stability. It accurately encapsulates our usual notion of stability in several key cases, in particular responds to how populated boundary regions are.

37 Summary II For usage we need a squence of clusterings optimal for each K. We convert our criterion to a relative form. We estimate the hyperparameter by using the relative form. We transform the data (studentize, sphere, as is etc.) to find the ˆK of the most stable clustering. We analyze the most stable clusterings by heatmaps and class label bars. Sometimes stable clusterings suggest more or fewer classes with classification data; this reveals structure that might otherwise be overlooked.

38 Special Case: Concentration at the µ k s If the distribution of X concentrates at µ 1 and µ 2 then Q (2) goes to 1. This generalizes to K clusters. For K = 2, let µ j = E(X C j ) and D j = d(x, µ j ). Let Λ 1 = λ 2 /λ 1, Λ 2 = λ 1 /λ 2 and let G Λu be the survival function for Λ u. Can show: Q (2) = EI D1 /D 2 1G Λ1 (D 1 /D 2 ) + EI D2 /D 1 1G Λ2 (D 2 /D 1 ). So, if D 1 /D 2 small on C 1 then the first term is near P(C 1 ) so C 1 is stable. If D 2 /D 1 small on C 2 then the second term is near P(C 2 ) This means Q (2) is near its maximum P(C 1 ) + P(C 1 ) = 1. so ˆQ n (2) should be close to its maximum 1, too.

39 Special Case: Mass of P Near the Boundary If the mass of P accumulates near the boundary between C 1 and C 2 then Q (2) goes to 1/2. For K clusters, Q (K ) can decrease to 1/K, if P concentrates at a point. For K = 2, if D 1 /D 2 large on C 1, i.e., D 1 /D 2 1 we expect many points in C 1 to be close to the boundary between C 1 and C 2. Similarly if D 2 /D 1 close to 1 on C 2. In these cases, Q (2) P(C 1 )G Λ1 (1) + P(C 2 )G Λ2 (1) = 1/2 and one can prove Q (2) 1/2. Since 1/2 Q (2) 1, it seems reasonable to regard Q (2) as indicating stability.

40 Special Case: Moving the Centroids If P has 2 modes and they separate and the regions between them have probability going to zero then Q (2) tends to increase. Generalizes to arbitrary K. If P has 2 modes and they get closer then Q (2) tends to decrease. Generalizes to arbitrary K. In general, 1/K φ (K ) 1. Overall, φ (K ) seems to assess stability and therefore we think Q n (K ) does too. Informal Statement of a formal Theorem: As P concentrates at its modes/centroids, Q (K ) is largest for the correct number K of clusters.

41 Subsampling Subsampling: Take a subsample of the data, possibly perturb it with noise, then recluster and measure the difference between the original and perturbed data clustering. Take the average over several runs. Adjusted Rand and variation of information distances are most popular.

42 Silhouette Distance Defines the point-to-cluster distance for any fixed point to any fixed cluster as the average of the distances from that point to all the other points in a given cluster. The silhouette score for a point is then the difference between its point-to-cluster distances for the cluster it was asigned to and the next best cluster, scaled by the maximum of these two point-to-cluster distances. Thus each point is scored by its proximity to a boundary. The silhouette score for a clustering is the average of the silhouette scores for each point. K is chosen from the clustering with the smallest average silhouette distance.

43 Gap Statistic The gap statistic of Tibshirani et al. (2001) uses the difference in total cluster spread, defined in terms of the sum of all pairwise distances in a cluster, between the actual dataset and the clusterings of reference distributions with no true clusters. In Euclidean space with the squared error distance, the spread is the total empirical variance of all the clusters. The reference distributions adjust for the dependence on K in the measure and to guard against spurious clusters. Tibshirani et al. (2001) proposes using the uniform distribution within the bounding box of the original dataset using principal components.

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

BAYESIAN CLUSTER VALIDATION. a thesis submitted in partial fulfillment of the requirements for the degree of. the faculty of graduate studies

BAYESIAN CLUSTER VALIDATION. a thesis submitted in partial fulfillment of the requirements for the degree of. the faculty of graduate studies BAYESIAN CLUSTER VALIDATION by HOYT ADAM KOEPKE B.A., The University of Colorado at Boulder, 2004 a thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification 2: Linear discriminant analysis (continued); logistic regression Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering

More information

MSA220 Statistical Learning for Big Data

MSA220 Statistical Learning for Big Data MSA220 Statistical Learning for Big Data Lecture 4 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology More on Discriminant analysis More on Discriminant

More information

Principal component analysis (PCA) for clustering gene expression data

Principal component analysis (PCA) for clustering gene expression data Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1 Outline of talk Background and motivation Design of our empirical

More information

EM Algorithm & High Dimensional Data

EM Algorithm & High Dimensional Data EM Algorithm & High Dimensional Data Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Gaussian EM Algorithm For the Gaussian mixture model, we have Expectation Step (E-Step): Maximization Step (M-Step): 2 EM

More information

Chapter 5. Understanding and Comparing. Distributions

Chapter 5. Understanding and Comparing. Distributions STAT 141 Introduction to Statistics Chapter 5 Understanding and Comparing Distributions Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 1 / 27 Boxplots How to create a boxplot? Assume

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Ordinary Least Squares Linear Regression

Ordinary Least Squares Linear Regression Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

CSE 546 Final Exam, Autumn 2013

CSE 546 Final Exam, Autumn 2013 CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Cluster Validity. Oct. 28, Cluster Validity 10/14/ Erin Wirch & Wenbo Wang. Outline. Hypothesis Testing. Relative Criteria.

Cluster Validity. Oct. 28, Cluster Validity 10/14/ Erin Wirch & Wenbo Wang. Outline. Hypothesis Testing. Relative Criteria. 1 Testing Oct. 28, 2010 2 Testing Testing Agenda 3 Testing Review of Testing Testing Review of Testing 4 Test a parameter against a specific value Begin with H 0 and H 1 as the null and alternative hypotheses

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data

A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data David S. Matteson Department of Statistical Science Cornell University matteson@cornell.edu www.stat.cornell.edu/~matteson

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Variable selection for model-based clustering

Variable selection for model-based clustering Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

Section 3: Simple Linear Regression

Section 3: Simple Linear Regression Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

The Informativeness of k-means for Learning Mixture Models

The Informativeness of k-means for Learning Mixture Models The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, 2018 1/35 Gaussian distribution For F dimensions,

More information

Dissertation Defense

Dissertation Defense Clustering Algorithms for Random and Pseudo-random Structures Dissertation Defense Pradipta Mitra 1 1 Department of Computer Science Yale University April 23, 2008 Mitra (Yale University) Dissertation

More information

Notes on Blackwell s Comparison of Experiments Tilman Börgers, June 29, 2009

Notes on Blackwell s Comparison of Experiments Tilman Börgers, June 29, 2009 Notes on Blackwell s Comparison of Experiments Tilman Börgers, June 29, 2009 These notes are based on Chapter 12 of David Blackwell and M. A.Girshick, Theory of Games and Statistical Decisions, John Wiley

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics DS-GA 100 Lecture notes 11 Fall 016 Bayesian statistics In the frequentist paradigm we model the data as realizations from a distribution that depends on deterministic parameters. In contrast, in Bayesian

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Bayesian Regression of Piecewise Constant Functions

Bayesian Regression of Piecewise Constant Functions Marcus Hutter - 1 - Bayesian Regression of Piecewise Constant Functions Bayesian Regression of Piecewise Constant Functions Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA,

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Fall 07 ISQS 6348 Midterm Solutions

Fall 07 ISQS 6348 Midterm Solutions Fall 07 ISQS 648 Midterm Solutions Instructions: Open notes, no books. Points out of 00 in parentheses. 1. A random vector X = 4 X 1 X X has the following mean vector and covariance matrix: E(X) = 4 1

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were

More information

Some Background Material

Some Background Material Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So

More information

ECE 592 Topics in Data Science

ECE 592 Topics in Data Science ECE 592 Topics in Data Science Final Fall 2017 December 11, 2017 Please remember to justify your answers carefully, and to staple your test sheet and answers together before submitting. Name: Student ID:

More information

DIMENSION REDUCTION AND CLUSTER ANALYSIS

DIMENSION REDUCTION AND CLUSTER ANALYSIS DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833

More information

λ(x + 1)f g (x) > θ 0

λ(x + 1)f g (x) > θ 0 Stat 8111 Final Exam December 16 Eleven students took the exam, the scores were 92, 78, 4 in the 5 s, 1 in the 4 s, 1 in the 3 s and 3 in the 2 s. 1. i) Let X 1, X 2,..., X n be iid each Bernoulli(θ) where

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

Scalable Bayesian Event Detection and Visualization

Scalable Bayesian Event Detection and Visualization Scalable Bayesian Event Detection and Visualization Daniel B. Neill Carnegie Mellon University H.J. Heinz III College E-mail: neill@cs.cmu.edu This work was partially supported by NSF grants IIS-0916345,

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Machine Learning for Data Science (CS4786) Lecture 8

Machine Learning for Data Science (CS4786) Lecture 8 Machine Learning for Data Science (CS4786) Lecture 8 Clustering Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Announcement Those of you who submitted HW1 and are still on waitlist email

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) CS68: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) Tim Roughgarden & Gregory Valiant April 0, 05 Introduction. Lecture Goal Principal components analysis

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions

Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions Daniel F. Schmidt Enes Makalic Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School

More information

Y (Nominal/Categorical) 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV

Y (Nominal/Categorical) 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV 1 Neuendorf Discriminant Analysis The Model X1 X2 X3 X4 DF2 DF3 DF1 Y (Nominal/Categorical) Assumptions: 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV 2. Linearity--in

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Data Exploration and Unsupervised Learning with Clustering

Data Exploration and Unsupervised Learning with Clustering Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a

More information

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Clusters. Unsupervised Learning. Luc Anselin.   Copyright 2017 by Luc Anselin, All Rights Reserved Clusters Unsupervised Learning Luc Anselin http://spatial.uchicago.edu 1 curse of dimensionality principal components multidimensional scaling classical clustering methods 2 Curse of Dimensionality 3 Curse

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Testing Statistical Hypotheses

Testing Statistical Hypotheses E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions

More information

Probabilistic clustering

Probabilistic clustering Aprendizagem Automática Probabilistic clustering Ludwig Krippahl Probabilistic clustering Summary Fuzzy sets and clustering Fuzzy c-means Probabilistic Clustering: mixture models Expectation-Maximization,

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016 Lecture 1: Introduction and Review We begin with a short introduction to the course, and logistics. We then survey some basics about approximation algorithms and probability. We also introduce some of

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 .0.0 5 5 1.0 7 5 X2 X2 7 1.5 1.0 0.5 3 1 2 Hierarchical clustering

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

1 Using standard errors when comparing estimated values

1 Using standard errors when comparing estimated values MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information