Multivariate Analysis, Clustering, and Classification

Size: px

Start display at page:

Download "Multivariate Analysis, Clustering, and Classification"

Crystal Hunter
6 years ago
Views:

1 Multivariate Analysis, Clustering, and Classification Jessi Cisewski Yale University Astrostatistics Summer School

2 Multivariate Analysis Statistical analysis of data containing observations each with > 1 variable measured. Examples: 1 Measurements on a star: luminosity, color, environment, metallicity, number of exoplanets 2 Functions such as light curves and spectra 3 Images 2

3 Common goals 1 Describe the p-dimensional distribution Multivariate means, variances, and covariances Multivariate probability distributions 2 Reduce the number of variables without losing significant information Linear functions of variables (principal components) 3 Investigate dependence between variables 4 Statistical inference Confidence regions, multivariate regression, hypothesis testing 5 Clustering and Classification 3

4 Organizing the data p = number of variables n = number of observations x ij = i th observation of the j th variable Observations Variables 1 2 p 1 x 11 x 12 x 1p 2 x 21 x 22 x 2p n x n1 x n2 x np 4

5 Data matrix: X = x 11 x 12 x 1p x 21 x 22 x 2p x n1 x n2 x np This can also be written as n rows or as p columns x T 1 x T X = 2. x T n = [ ] x 1, x 2,, x p 5

6 Descriptive statistics Sample mean of the j th variable x j = 1 n n i=1 x ij Sample mean vector x = x 1 x 2. x p 6

7 Sample covariance of variables i and j s ij = s ji = 1 n 1 n (x ki x i )(x kj x j ) k=1 Sample variance of the j th variable is s jj Sample covariance matrix S = s 11 s 12 s 1p s 21 s 22 s 2p s p1 s p2 s pp 7

8 Sample correlation coefficient of variables i and j r ij = s ij sii s jj r ii = 1 and r ij = r ji Sample correlation matrix R = 1 r 12 r 1p r 21 1 r 2p r p1 r p2 1 8

9 Caution Correlations measure the strength of the linear relationships between variables if such relationships are valid. 9

10 Multivariate Probability Distributions Goal: make probability statements about random vectors p-dimensional random vector: X = X 1 X 2. X p where X 1,..., X p are random variables. X is a continuous random vector if X 1,..., X p are all continuous random variables We ll focus on continuous random vectors 10

11 Three important properties of X s probability density function, f 1 f (x) 0 for all x R p (or wherever the x s take values) 2 The total area below the function f is 1: R p f (x)dx = 1 3 For all t 1, t 2,..., t p, P(X 1 t 1, X 2 t 2,..., X p t p ) = t1 t2 tp f (x)dx 11

12 Recall: an expected value is an average over the entire population Mean vector: µ = µ 1 µ 2. µ p where µ i = E(X i ) = x i f (x)d(x) R p is the mean of the ith component of X 12

13 Covariance between X i and X j : σ ij = E(X i µ i )(X j µ j ) = E(X i X j ) µ i µ j And the variance of each X i is: σ ii = E(X i µ i ) 2 = E(X 2 i ) µ 2 i Covariance matrix of X: Σ = σ 11 σ 12 σ 1p σ 21 σ 22 σ 2p σ p1 σ p2 σ pp Σ = E(X µ)(x µ) T = E(XX) T = µµ T Going forward, let s assume Σ is nonsingular 13

14 Multivariate Normal Distribution Consider the following random vector whose possible values range over all of R p : X 1 X 2 X =. X p X has a multivariate normal distribution if it has a pdf of the form [ 1 f (X) = exp 1 ] (2π) p 2 Σ (X µ)t Σ 1 (X µ) X N p (µ, Σ) 14

15 Image: 15

16 Special case: diagonal Σ σ σ2 2 0 Σ = σp 2 In this case, X 1,..., X p are mutually independent and normally distributed with density { p 1 1 2σ f (x) = (2πσj 2 e 2 (x j µ j ) 2} j )1/2 j=1 16

17 χ 2 Distribution X N p (µ, Σ) with pdf 1 f (X) = (2π) p 2 Σ 1 2 [ exp 1 ] 2 (X µ)t Σ 1 (X µ) Then (X µ) T Σ 1 (X µ) χ 2 p 17

18 Tests for multivariate normality If the data contain a substantial number of outliers then it goes against the hypothesis of multivariate normality If one variable is not normally distributed, then the full set of variables does not have a multivariate normal distribution A possible resolution is to transform the original variables to produce new variables which are normally distributed Example: Box-Cox transformations When datasets arise from a multivariate normal distribution, we can perform accurate inference on its mean vector and covariance matrix 18

19 Variables (random vector): X N p (µ, Σ) The parameters µ and Σ are unknown Data (measurements): x 1, x 2,..., x n Goal: estimate µ and Σ x is an unbiased estimator of µ (and the MLE of µ) The MLE of Σ is n 1 n S this is not unbiased (i.e. it is biased) The sample covariance matrix, S, is an unbiased estimator of Σ 19

20 Principal Components Analysis (PCA) Data: X = p-dimensional random vector with covariance matrix Σ PCA is an unsupervised approach to learning about X Principal components find directions of variability in X Can be used for visualization, dimension reduction, regression, etc. Image: 20

21 Statistical Learning Learning from data 1 Unsupervised learning 2 Supervised learning

22 One schematic for addressing problems in machine learning... Image: Andy s Computer Vision and Machine Learning Blog 22

23 Clustering Find subtypes or groups that are not defined a priori based on measurements Unsupervised learning or Learning without labels Classification Use a priori group labels in analysis to assign new observations to a particular group or class Supervised learning or Learning with labels Some content and notation used throughout derived from notes by Rebecca Nugent (CMU), Ryan Tibshirani (CMU), and textbooks Hastie et al. (2009) and James et al. (2013). 23

24 Clustering and Classification 24

25 Sample Data data1 ng <- 50 set.seed(321) #ensures same dataset g1 <- cbind(rnorm(ng,-3,.1), rnorm(ng,1,1)) g2 <- cbind(rnorm(ng,3,.1), rnorm(ng,1,1)) gtemp <- cbind(rnorm(ng*2,0,.75), rnorm(ng*2,0,.75)) rad1 <- sqrt(gtemp[,1]^2+gtemp[,2]^2) g3 <- gtemp[order(rad1)[1:ng],] g4 <- gtemp[order(rad1)[(ng+1):(2*ng)],] g5.1<-seq(-2.75,2.75, length.out = ng) g5.2 <- (g5.1/2)^2-4 g5 <- cbind(g5.1,g5.2 + rnorm(ng,0,.5)) data1<-rbind(g1,g2,g3,g4,g5) labels1<-c(rep(1,ng),rep(2,ng),rep(3,ng),rep(4,ng),rep(5,ng)) This dataset will be used to illustrate clustering and classification methodologies throughout the lecture. 25

Good references An Introduction to Statistical

26 Good references An Introduction to Statistical Learning (James et al. 2013) good introduction to get started using methods/not technical The Elements of Statistical Learning (Hastie et al. 2009) very thorough and technical coverage of statistical learning All of Statistics (Wasserman 2004) great overview of statistics 26

27 CLUSTERING Grouping of similar objects (unsupervised learning) members of the same cluster are close in some sense Pattern recognition Data segmentation 27

Clustering in Astronomy Use properties of GRBs (e.g. location in the sky, arrival time, duration, fluence, spectral hardness) to find subtypes/classes of events Image: Mukherjee et al.

28 Clustering in Astronomy Use properties of GRBs (e.g. location in the sky, arrival time, duration, fluence, spectral hardness) to find subtypes/classes of events Image: Mukherjee et al. (1998) Regions with an excess of gamma rays that correspond to locations of dwarf galaxies could be evidence of particle dark matter Image: 28

29 Clustering set-up Notation: given vectors X = {X 1, X 2,..., X n } R p n observations in p - dimensional space Variables/features/attributes indexed by j = 1,..., p: jth variable is X j Observations indexed by i = 1,..., n: ith variable is X i Want to learn properties about the joint distribution P(X) of these vectors: organize, summarize, categorize, explain No direct measure of success (e.g. no notion of a misclassification rate) Successful if true structure is captured... Image: 29

30 General goals of clustering Partition observations such that 1 Observations within a cluster are similar Compactness Property 2 Observations in different clusters are non similar Closeness Property Typically want compact clusters that are well-separated 30

31 Dissimilarity Measure Characterizes degree of closeness Dissimilarity matrix D = {d ii } such that d ii = 0 and d j ii = d(x ij, x i j) Some examples of d j ii are (x ij x i j) 2 or x ij x i j D ii = D(X i, X i ) = p j=1 w j d j ii where p j=1 w j = 1 31

32 Dissimilarity Measure: within cluster variation Total cluster variability = 1 2 = 1 2 n n i=1 i =1 K D ii k=1 C(i)=k D ii + C(i )=k C(i ) k D ii where C(i) = k is the assignment of observation i to cluster k Total within cluster variability: 1 K 2 k=1 C(i)=k C(i )=k D ii Total between cluster variability: 1 K 2 k=1 C(i)=k C(i ) k D ii 32

33 Clustering methods 1 Combinatorial algorithms 1 K - means clustering 2 Hierarchical clustering 2 Mixture modeling/statistical clustering (parametric) 3 Mode seeking/bump Hunting/Statistical clustering (nonparametric) 33

34 K-means Clustering Image: 34

35 K-means clustering Main idea: partition observations in K separate clusters that do not overlap 35

36 K-means clustering - procedure Goal: minimize total within-cluster scatter using D ii = p j=1 (x ij x i j) 2 = X i X i 2. Then the within-cluster scatter is written as 1 2 K k=1 C(i)=k C(i )=k X i X i 2 = K C k k=1 C(i)=k X i X k 2 C k = number of observations in cluster C k X k = ( X k 1,..., X k p ) 36

37 K-means clustering - recipe Pick K (number of clusters) Select K centers Alternate between the following: 1 Assign each observation to closest center 2 Recalculate centers R: kmeans(data, K, nstart = 20) 37

38 Simple code: K-means set.seed(321) g1 <- cbind(rnorm(50,0,1), rnorm(50,-1,.1)) g2 <- cbind(rnorm(50,0,1), rnorm(50,1,.1)) g3 <- cbind(rnorm(50,0,1), rnorm(50,3,.1)) data.points <- rbind(g1, g2, g3) km.out <- kmeans(data.points,3, nstart = 20) plot(data.points, pch = 16, xlab = "X1", ylab = "X2", col = km.out$cluster) points(km.out$centers[,1], km.out$centers[,2],pch=4,lwd=4, col="blue", cex = 2) Try moving the clusters closer together (in terms of X 2) and see what happens with the K-means clusters. 38

39 K-means clustering - determining K Choose the k that has the last significant reduction in the within groups sum of squares (i.e. find the elbow ) set.seed(123); wss <- c(); number.k <- c(1:15) for(ii in number.k){km0 <- kmeans(data1,ii,nstart = 20) wss[ii] <- km0$tot.withinss} plot(number.k, wss, xlab="k",ylab="wss", type = "b", pch = 19) 39

K-means clustering - tips Can be unstable; solution depends on starting set

centers kmeans(x, K, nstart = 20) Cluster assignments are strict no notion of

centers centers are averages: - OK for clustering things like apartment

But what if observations are images of faces? Images: http://cdn1.

40 K-means clustering - tips Can be unstable; solution depends on starting set of centers finds local optima, but want global optima start with different centers kmeans(x, K, nstart = 20) Cluster assignments are strict no notion of degree or strength of cluster membership Possible lack of interpretability of centers centers are averages: - OK for clustering things like apartment prices based on price, square footage, distance to nearest grocery store - But what if observations are images of faces? Images: //mrnussbaum.com, 40

41 K-means clustering - tips, continued. Influenced by outliers use medoids - the observation in the data set whose average distance to all other observations is minimal medoids is more computationally intensive centers are actual observations leading to better interpretability R: pam(data, K) ( partitioning around medoids ) in cluster package 41

42 Hierarchical Clustering Image: 42

43 Hierarchical vs. Flat Partitions 1 Flat partitioning (e.g. K-means clustering) partitions data into K clusters; K determined by user no sense of the relationships among the clusters 2 Hierarchical partitioning Generates a hierarchy of partitions; user selects the partition P 1 = 1 cluster,..., P n = n clusters (agglomerative clustering) Partition P i is the union of one or more clusters from Partition P i+1 43

44 Hierarchical clustering - recipe Define a dissimilarity d kk = d(c k, C k ) between clusters C k and C k as a function of a distance between points in the clusters 1 Start with every observation in its own cluster 2 Find min d(c k, C k ) merge C k and C k (Minimum is across all pairs of clusters) 3 Repeat until only one cluster remains R: hclust(distances) 44

45 Hierarchical clustering - distances 1 Friends-of-friends = Single-linkage clustering: intergroup distance is smallest possible distance d(c k, C k ) = min d(x, y) x C k,y C k 2 Complete-linkage clustering: intergroup distance is largest possible distance d(c k, C k ) = max d(x, y) x C k,y C k 3 Average-linkage clustering: average intergroup distance 4 Ward s clustering d(c k, C k ) = Ave x Ck,y C k d(x, y) d(c k, C k ) = 2 ( C k C k ) X Ck X Ck 2 C k + C k 45

46 Single-linkage clustering 46

47 data1 47

48 data1 (K = 4 ) 47

49 Clustering Recap 1 Algorithmic clustering (no statistical assumptions) 1 K-means 2 Hierarchical linkage 2 Statistical clustering 1 Parametric - associates a specific model with the density (e.g. Gaussian, Poisson) parameters associated with each cluster 2 Nonparametric - looks at contours of the density to find cluster information (e.g. kernel density estimate) 48

50 Mixture Modeling/ Parametric Statistical Clustering Image: Li and Henning (2011) 49

51 Model-based clustering (parametric) Assumes that each subgroup/cluster/component in the population has its own density K p(x ) = π k p k (X ; θ k ) k=1 where K k=1 π k = 1 and 0 π k 1. Entire dataset is modeled by a mixture of these distributions. p(x) = 0.5φ(x; 4; 1) + 0.5φ(x; 0; 1) (middle plot) p(x) = 0.75φ(x; 4; 1) φ(x; 0; 1) (right plot) 50

52 Model-based clustering: Gaussian example Suppose there are K clusters - each cluster is modeled by a particular distribution (e.g. a Gaussian distribution with parameters µ k, Σ k ) The density of each cluster k is p k (X ) = φ(x µ k, Σ k ) ( 1 = (2π) p Σ k exp (X µ k) T Σ 1 k (X µ ) k) 2 Letting π k be the weight of cluster k, the mixture density is p(x ) = K π k p k (X ) = k=1 K π k φ(x µ k, Σ k ) k=1 51

53 Model-based clustering: advantages 1 Well-studied statistical inference techniques available 2 Flexibility in choosing the component distributions 3 Obtain a density estimate for each cluster 4 Soft classification is available 52

54 Model-based clustering: fitting the model Suppose there are K clusters - each cluster is modeled by a particular distribution (e.g. a Gaussian distribution with parameters µ k, Σ k ) Expectation - Maximization (EM) algorithm Finds maximum likelihood estimates in incomplete data (e.g. missing cluster labels). Alternates between expectation step and maximization steps: E-step: compute conditional expectation of the cluster labels M-step: maximize the likelihood and estimate parameters given the current labels; update parameter estimates 53

55 Model-based clustering: EM algorithm Observations {X 1, X 2,..., X n } are incomplete (i.e. no labels) Complete observations: {(X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n )} (where Y i are the labels) The collections of parameters, θ are (π k, µ k, Σ k ) for k = 1,..., K The log-likelihood function is ( n K ) l(x 1,..., X n θ) = log π k φ(x i µ k, Σ k ) i=1 k=1 l(x 1,..., X n θ) is the objective function of the EM algorithm. Numerical difficulty comes from the sum inside the log. Goal: get MLEs for (π k, µ k, Σ k ), k = 1,..., K 54

56 Nonparametric Clustering Image: Feigelson et al. (2011) 55

57 Nonparametric clustering Associate groups with high frequency areas groups in the population correspond to modes of the density p(x) Goal: find the modes of the density p(x), or ˆp(x). Assign observations to the domain of attraction of a mode. 56

58 Nonparametric clustering Associate groups with high frequency areas groups in the population correspond to modes of the density p(x) Goal: find the modes of the density p(x), or ˆp(x). Assign observations to the domain of attraction of a mode. 56

59 Nonparametric clustering Associate groups with high frequency areas groups in the population correspond to modes of the density p(x) Goal: find the modes of the density p(x), or ˆp(x). Assign observations to the domain of attraction of a mode. NP clustering is very dependent on the density estimate ˆp(x) 56

60 Final comment on selecting K Slide from Ryan Tibshirani s lecture, which was from George Cassella s CMU seminar on 1/16/

61 Final comment on selecting K Slide from Ryan Tibshirani s lecture, which was from George Cassella s CMU seminar on 1/16/

62 Concluding Remarks: clustering Clustering - unsupervised/no labels find structure 1 K - means 2 Agglomerative hierarchical clustering 3 Parametric/Nonparametric 58

63 CLASSIFICATION Build a model, classifier, etc. to separate data into known groups/classes (supervised learning) Response variable is not continuous want to predict labels 1 Bayes classifiers 2 K Nearest Neighbors (KNN) classifiers 3 Logistic regression 4 Linear Discriminant Analysis (LDA) 5 Support Vector Machines (SVM) 6 Tree classifiers 59

Classification in Astronomy Stars can be classified into OBFGKMLTY Image: https://writescience.

64 Classification in Astronomy Stars can be classified into OBFGKMLTY Image: Classification of Active galactic nuclei (e.g. Starburst, Seyfert (I or II), Quasars, Blazars, BL Lac, OVV, Radio) Galaxies can be classified into Hubble morphological types (E/S0/S/Irr), clustering environments, and by star formation activity Image: 60

65 Classification set-up Notation: given vectors X = {X 1, X 2,..., X n } R p and class labels Y = {y 1, y 2,..., y n } the y i s are qualitative let ŷ i be the predicted class label for observation i main interest is P(Y X) The classification training error rate is often estimated using a training dataset as 1 n I(y i ŷ i ) n i=1 where I( ) is the indicator function. The classification test error rate is often estimated using a test dataset, (x test, y test ) as E (I(y test ŷ test )) good classifiers have small test errors 61

66 Bayes Classifiers Test error is minimized by assigning observations with predictors x to the class that has the largest probability: for classes j = 1,..., J P(Y = j X = x) If there are two classes (J = 2), the Bayes decision rule given predictors x is { class 1 if P(Y = 1 X = x) > 0.50 ŷ i = class 2 if P(Y = 2 X = x) > 0.50 In general, intractable because the distribution of Y X is unknown. 62

67 Image: 63

68 K Nearest Neighbors (KNN) Main idea: An observation is classified based on the K observations in the training set that are nearest to it A probability of each class can be estimated by P(Y = j X = x) = K 1 I(y i = j) i N(x) where j = 1,..., #classes in training set, and I = indicator function. K = 3 nearest neighbors to the X are within the circle. The predicted class of X would be blue because there are more blue observations than green among the 3 NN. 64

69 KNN: data1 R: knn(training.set, test.set, training.set.labels, K) in class package 65

70 KNN: data1 decision boundary Using all observations of data1, K = 1 66

71 Linear Classifiers Decision boundary is linear Logistic regression Linear Discriminant Analysis Image: 67

72 Logistic Regression Predicting two groups: binary labels { 1 with probability π Y = 0 with probability 1 π E[Y ] = P(Y = 1) = π We assume the following model for logistic regression ( ) π logit(π) = log = β 0 + β T X 1 π where β 0 R, β R p 68

73 Logistic Regression, continued. ( ) π logit(π) = log = β 0 + β T X 1 π = π = eβ 0+β T X 1 + e β 0+β T X Can fit β 0,..., β p via MLE l(β 0, β) = n i=1 log (P(Y = y i X = x i )) = ( ˆβ 0, ˆβ) = argmax β 0 R,β R p n i=1 [ ] y i (β 0 + β T x i ) log(1 + e β 0+β T x i ) 69

74 Logistic Regression for data1 logit.labels = matrix(0, nrow = length(labels1)) logit.labels[c(which(labels1 == 2), which(labels1 == 5))] <- 1 logit.fit <- glm(logit.labels ~ data1[,1]+data1[,2],family = binomial) summary(logit.fit) #<---provides details of the fit model logit.fit.probs <- predict(logit.fit, type = "response") logit.class = matrix(0,nrow = length(logit.fit.probs)) logit.class[logit.fit.probs >=.5] = 1 par(mfrow = c(1,2)) plot(data1, xlim = c(-4,4), ylim = c(-5,4), xlab = "X1", ylab = "X2", col = logit.labels+1, pch = logit.labels+1, lwd = 3, main = "True classes") plot(data1, xlim = c(-4,4), ylim = c(-5,4), xlab = "X1", ylab = "X2", col = logit.class+1, pch = logit.class+1, lwd = 3, main = "Predicted classes") 70

75 Logistic Regression for data1: fit model logit(ˆπ ) = ˆβ 0 + ˆβ 1 x 1 + ˆβ 2 x 2 = x x 2 71

76 Multiple classes What do we do if our response is not binary (i.e. J > 2)? There are extensions of logistic regression for multiple classes: Multiple - class logistic regression Linear Discriminant Analysis (LDA) is another option 72

77 Discriminant Analysis Goal: estimate a decision boundary that gives a classification rule Basic idea: estimate the posterior probabilities of class membership If an observation is in a particular location, what is the probability it belongs to a particular class Bayes Rule P(Y = j x) = π j p j (x) J j=1 π j p j (x) π j = prior probabilities of class j p j (x) = P(X = x Y = j) Need a way to estimate p j (x) in order to do classification 73

78 Example using Bayes Rule Consider X = 1.9, P(Y = 1) =

79 Linear Discriminant Analysis (LDA) Multivariate Gaussian p j (X ) = φ(x µ j, Σ j ) ( 1 = (2π) p Σ j exp (X µ j) T Σ 1 ) j (X µ j ) 2 LDA assumes all covariance matrices are equal (Σ 1 = = Σ J ) 75

80 LDA continued A predictor x is classified in the class j = 1,..., J according to the following: argmax {P(Y = j X = x)} = j = argmax {φ(x µ j, Σ)π j } j = argmax {log (φ(x µ j, Σ)π j )} j. = argmax j = argmax j { 1 } 2 (x µ j) T Σ 1 (x µ j ) + log(π j ) { x T Σ 1 µ j 1 } 2 µt j Σ 1 µ j + log(π j ) 76

81 LDA for data1 R: lda(y x) 77

82 Quadratic Discriminant Analysis (QDA) Assuming a common covariance matrix, Σ, is not always reasonable. Allowing for different covariance matrices, Σ j, a predictor x is classified in the class j = 1,..., J according to the following: argmax {P(Y = j X = x)} = j = argmax {φ(x µ j, Σ j )π j } j = argmax {log (φ(x µ j, Σ j )π j )} j. = argmax j = argmax j { 1 } 2 (x µ j) T Σ 1 j (x µ j ) + log(π j ) { 1 2 x T Σ 1 j x + x T Σ 1 µ j 1 2 µt j Σ 1 j } j µ j + log(π j ) 78

83 QDA for data1 R: qda(y x) 79

84 Support Vector Machines 80

85 Support Vector Machines (SVM) Goal: Find the hyperplane that best separates the two classes (i.e. maximize the margin between the classes) Data: {x i, y i }, i = 1,..., n, and y i = { 1, 1} Image: A separating hyperplane does not always exist incorporate a parameter to penalize misclassifications 81

86 (X i w + b)y i 1 ξ i, ξ i 0 for i = 1,, n 1 n ) minimize( 2 w 2 + C ξ i i=1 ξ i is capturing the degree to which observation i is misclassified C is a misclassification penalty Image: 82

87 SVM kernel trick Image: 83

88 SVM for data1 (linear) - A R: svm(x,y,kernel = linear, class.weights) in e1071 package 84

89 SVM for data1: fit model (linear) - A 85

90 SVM for data1 (linear) - B R: svm(x,y,kernel = linear, class.weights) in e1071 package 86

91 SVM for data1: fit model (linear) - B 87

92 SVM for data1 (radial basis) R: svm(x,y,kernel = radial, class.weights) in e1071 package 88

93 SVM for data1: fit model (radial basis) 89

Classification Trees Image: http://astronomy.swin.

94 Classification Trees Image: Image: 90

95 Classification trees Goal: determine which variables are best at separating observations into the labeled groups 1 Predictor space is partitioned into hyper-rectangles 2 Any observations in the hyper-rectangle would be predicted to have the same label 3 Next split is chosen to maximize purity of hyper-rectangles Tree-based methods are not typically the best classification methods based on prediction accuracy, but they are often more easily interpreted (James et al. 2013) CART = Classification and Regression Trees 91

96 Classification Tree for data1 92

97 Classification Trees - recipe Recipe Start at the top of the tree and use recursive binary splitting to grow the tree At each split, determine the classification error rate for the new region (i.e. the number of observations that are not of the majority class in their region) The Gini Index or cross-entropy are better for node purity. R: tree(y x) in tree package. 93

98 Classification Trees - remarks Tree pruning - the classification tree may be over fit, or too complex; pruning removes portions of the tree that are not useful for the classification goals of the tree. Bootstrap aggregation (aka bagging ) - there is a high variance in classification trees, and bagging (averaging over many trees) provides a means for variance reduction. (Boosting is another approach, but grows trees sequentially rather using a bootstrapped sample.) Random forest - similar idea to bagging except it incorporates a step that helps to decorrelate the trees. 94

99 Concluding Remarks Clustering - unsupervised/no labels find structure 1 K - means/ K - medoids 2 Agglomerative hierarchical clustering 3 Parametric/Non-parametric Classification - supervised/labels predict classes 1 KNN 2 Logistic regression 3 LDA/QDA 4 Support Vector Machines 5 Tree classifiers Clustering and classification are useful tools, but need to be familiar with assumptions that go into the methods 95

100 Bibliography Feigelson, E. D., Getman, K. V., Townsley, L. K., Broos, P. S., Povich, M. S., Garmire, G. P., King, R. R., Montmerle, T., Preibisch, T., Smith, N., et al. (2011), X-ray Star Clusters in the Carina Complex, The Astrophysical Journal Supplement Series, 194, 9. Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., and Tibshirani, R. (2009), The elements of statistical learning, vol. 2, Springer. James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Introduction to Statistical Learning with Applications in R, vol. 1 of Springer Texts in Statistics, Springer. Li, H.-b. and Henning, T. (2011), The alignment of molecular cloud magnetic fields with the spiral arms in M33, Nature, 479, Mukherjee, S., Feigelson, E. D., Babu, G. J., Murtagh, F., Fraley, C., and Raftery, A. (1998), Three types of gamma-ray bursts, The Astrophysical Journal, 508, 314. Wasserman, L. (2004), All of statistics: a concise course in statistical inference, Springer. 96

Clustering and Classification in Astronomy

Clustering and Classification in Astronomy Jessi Cisewski Carnegie Mellon University Wednesday, June 4, 2014: 10:30 AM - 12 PM Statistical Learning Learning from data 1 Unsupervised learning 2 Supervised