Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If p n and X T X is invertible, this has a unique solution: In regression, the outcome is continuous. Given Y R n and X R n p, the least squares regression problem is ˆβ = (X T X ) 1 X T Y. 1 ˆβ = argmin β R p 2 y X β 2 2. If p > n or, X T X is not invertible, the solution is not unique. 3 / 1 4 / 1
Often, due to some other knowledge, we might change the loss function to a Mahalanobis distance 1 2 (Y X β)t Σ 1 (Y X β). Example: Σ is diagonal is often used if Y X has variance dependent on X. Feature Selection in As noted, if p > n there is no unique least squares model. Leads to variable selection problem. Might try fixing a certain number of variables, k and solving best subsets regression by ˆβ k = 1 argmin J={j 1,...,j k } 2 Y X J ˆβ J 2 2 where ˆβ J is a/the least squares solution with features J. Example: if the cases are sampled with some structure, the errors might be correlated with covariance Σ. This is like clustering: a combinatorial optimization problem. 5 / 1 We might be able prove things about the minimizer but we have no algorithm to find it for p > 40 or so, unless X has very special structure. 6 / 1 The l 0 norm Defined as β 0 = # {j : β j 0} (which is not really a norm... ) The l 1 norm: Recall the l 1 norm Best subsets model is equivalent to ˆβ k = 1 argmin β R p : β 0 k 2 Y X β 2 2 A quasi-equivalent problem is its Lagrange version β 1 = (which really is a norm... ) Since it s a norm it is convex. p β j j=1 1 ˆβ λ,l0 = argmin β R p 2 Y X β 2 2 + λ β 0 It is also the best convex approximation to the l 0 norm. This is similar to cost-complexity pruning for the decision tree. 7 / 1 8 / 1
The Lasso Why does `1-penalty give sparse ˆ? The problem is 1 minimize 2R p 2 ky X k2 subject to k k 1 apple c ˆβ λ = argmin β R p Y X β 2 2 + λ β 1. β 2 In bound form ˆβ c = argmin β R p : β 1 c Y X β 2 2 + λ β 1. + β^ols This is a convex problem... lots of nice ways to solve it. With l 0 we knew we would get feature selection does this happen with l 1 as well? β^λ β 1 9 / 1 10 / 1 Jacob Bien path Why do we get sparse solutions with the? A Lasso for Hierarchical Interaction The parameter λ controls the sparsity. For λ > X T y, the sup-norm of X T y R p, the solution is ˆβ λ = 0 Just below X T Y only the feature with maximal absolute correlation with Y has a nonzero coefficient. For λ = 0, any least squares solution is a solution. It is possible, depending on X to have multiple solutions, at some fixed λ but for generic X it never happens. 11 / 1 12 / 1
So what? (Consistency) The yields sparse solutions, but are they good sparse solutions? Noting that the first variable to enter has largest absolute correlation suggests the answer is yes. Suppose E(Y X ) = j A X A β A. So what? (Consistency) Suppose #A is not too large; the matrix XI T X A is not very big; XA T X A is not too close to singular. Then, for certain values of λ, ˆβ λ makes no false positives. That is, ˆβ λ,j 0 = j A. We call A {1,..., p} the (true) active coefficients and I = A c the (true) inactive features. If (β j ) j A are large enough, then it also makes no false positives. That is, ˆβ λ,j = 0 = j I. 13 / 1 14 / 1 The Lasso Why does `1-penalty give sparse ˆ? 1 minimize 2R p 2 ky X k2 subject to k k 1 apple c β 2 + β^ols Other problems The sparsity property is a property of the l 1 norm, not the smooth loss function. For many smooth objectives, adding this l 1 penalty yields sparse solutions. β^λ β 1 15 / 1 16 / 1
Sparse Support Vector Machine We can add the l 1 norm to the support vector machine loss: minimize β,α n (1 y i (α + x T i β)) + + λ 2 β 2 2 + λ 1 β 1. i=1 Yields a sparse coefficient vector for the SVM. Sparse Logistic Regression We can add the l 1 norm to the logistic regression loss: minimize β,α n DEV(α + x T i β, y i ) + λ β 1 i=1 Yields a sparse coefficient vector for the logistic regression model. 17 / 1 18 / 1 Fused / Generalized What if we want a solution with small # {β : D i β 0} = Dβ 0? Example: change point detection, a form of outlier detection in a streaming data situation. 0 2 4 6 8 10 12 19 / 1 20 / 1
Fused / Generalized Might consider solving ˆβ λ = argmin β R p Y X β 2 2 + λ Dβ 1. Group What if there are disjoint groups (β g ) g G of coefficients in our model and we want a solution with small # {g : β g 0} = β 0,G? A lot of the theory from carries over to this case as well, but not quite all of it... Example: the groups of variables might be indicator variables for different categorical variables. 21 / 1 22 / 1 Group Might consider solving Other problems Why not add l 1 penalty to decision tree problem? ˆβ λ = argmin β R p Y X β 2 2 + λ g G β g 2. Because the original problem is non-convex so their combination is generally still non-convex... Leads to a group version of all your favorites: logistic regression; support vector machine; etc. There are tree-like group penalties... Similarly, we would not add l 1 penalty to K-means objective function... 23 / 1 24 / 1
No free lunch Of course, we must pay some price for all of this, but what price? Well, the produces biased estimates even when it finds the true active variables. But, general theory says that we should be using some bias in the form of shrinkage... Summary A lot of interesting work in high-dimensional statistics / machine learning over the last few years involves studying problems of the form ˆβ λ = argmin β R p L(β) + λp(β) where L is a loss function like the support vector loss, logistic loss, squared error loss, etc. P is a convex penalty that imparts structure on the solutions. Lots of interesting questions remain... 25 / 1 Try STATS315 for a more detailed introduction to the... 26 / 1 Final review Part II Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension reduction. Final review Distances. Multidimensional scaling. Multidimensional arrays. Decision trees. Performance measures for classifiers. Discriminant analysis. 27 / 1 28 / 1
Final review Rule based classifiers Rule-based Classifier (Example) Overview After Midterm More classifiers: Rule-based Classifiers Nearest-Neighbour Classifiers Naive Bayes Classifiers Neural Networks Support Vector Machines Random Forests Boosting (AdaBoost / Gradient Boosting) Clustering. Outlier detection. Rule based classifiers 29 / 1 R1: (Give Birth = no)! (Can Fly = yes) " Birds R2: (Give Birth = no)! (Live in Water = yes) " Fishes R3: (Give Birth = yes)! (Blood Type = warm) " Mammals R4: (Give Birth = no)! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians 30 / 1 Nearest Neighbor Classifiers Nearest neighbour classifier Tan,Steinbach, Kumar Introduction to 4/18/2004 3! Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Concepts coverage accuracy mutual exclusivity exhaustivity Laplace accuracy Training Records Compute Distance Choose k of the nearest records Test Record Tan,Steinbach, Kumar Introduction to 4/18/2004 34 31 / 1 32 / 1
s too large, Nearest neighborhood neighbour classifier may include points from r classes Naive Bayes classifiers Naive Bayes classifiers Model: ar Introduction to 4/18/2004 38 icial Neural networks: Networks single layer (ANN) 33 / 1 P(Y = c X 1 = x 1,..., X p = x p ( p ) P(X l = x l Y = c) P(Y = c) l=1 For continuous features, typically a 1-dimensional QDA model is used (i.e. Gaussian within each class). For discrete features: use the Laplace smoothed probabilities P(X j = l Y = c) = # {i : X ij = l, Y i = c} + α. # {Y i = c} + α k Neural networks: double layer 34 / 1 35 / 1 36 / 1
Support Support Vector vector machine Machines Support vector machines Support vector machines Solves the problem minimize β,α,ξ β 2 subject to y i (x T i β + α) 1 ξ i, ξ i 0, n i=1 ξ i C 37 / 1 38 / 1! Find hyperplane maximizes the margin => B1 is better than B2 Support vector machines Tan,Steinbach, Kumar Introduction to 4/18/2004 64 Logistic vs. SVM Non-separable problems The ξ i s can be removed from this problem, yielding 4.0 3.5 Logistic SVM n minimize β,α β 2 2 + γ (1 y i f α,β (x i )) + i=1 where (z) + = max(z, 0) is the positive part function. Or, n 3.0 2.5 2.0 1.5 1.0 minimize β,α i=1 (1 y i f α,β (x i )) + + λ β 2 2 0.5 0.0 3 2 1 0 1 2 3 39 / 1 40 / 1
General Idea Ensemble methods Ensemble methods Bagging / Random Forests In this method, one takes several bootstrap samples (samples with replacement) of the data. For each bootstrap sample S b, 1 b B, fit a model, retaining the classifier f,b. After all models have been fit, use majority vote f (x) = majority vote of (f,b (x)) 1 i B. Defined the OOB estimate of error. 41 / 1 42 / 1 Tan,Steinbach, Kumar Introduction to 4/18/2004 74 Ensemble methods Illustrating AdaBoost Initial weights for each data point Data points for training Ensemble methods Illustrating AdaBoost 43 / 1 Tan,Steinbach, Kumar Introduction to 4/18/2004 84 44 / 1
Ensemble methods Boosting as gradient descent It turns out that boosting can be thought of as something like gradient descent. In some sense, the boosting algorithm is a steepest descent algorithm to find What is Cluster Analysis? Cluster analysis! Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized argmin f F n L(y i, f (x i )). i=1 Tan,Steinbach, Kumar Introduction to 4/18/2004 2 45 / 1 46 / 1 Clustering Cluster analysis 502 14. Unsupervised Learning Types of clustering Partitional A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset. Hierarchical A set of nested clusters organized as a hierarchical tree. Each data object is in exactly one subset for any horizontal cut of the tree... X2 X 1 47 / 1 FIGURE 14.4. Simulated data in the plane, clustered into three classes (represented A partitional by orange, example blue and green) by the K-means clustering algorithm 48 / 1
K-means 520 14. Unsupervised Learning K-medoid log WK -3.0-2.5-2.0-1.5-1.0-0.5 0.0 2 4 6 8 Number of Clusters Gap -0.5 0.0 0.5 1.0 2 4 6 8 Number of Clusters Algorithm Same as K-means, except that centroid is estimated not by the average, but by the observation having minimum pairwise distance with the other cluster members. Advantage: centroid is one of the observations useful, eg when features are 0 or 1. Also, one only needs pairwise distances for K-medoids rather than the raw observations. FIGURE 14.11. (Left panel): observed (green) and expected (blue) values of log W K for the simulatedfigure data of : Figure Gap statistic 14.4. Both curves have been translated to equal zero at one cluster. (Right panel): Gap curve, equal to the difference between the observed and expected values of log W K.TheGapestimateK is the smallest K producing a gap within one standard deviation of the gap at K +1; here K 49 / 1 =2. 50 / 1 Silhouette plot This gives K = 2, which looks reasonable from Figure 14.4. Cluster analysis 14.3.12 Hierarchical Clustering The results of applying K-means or K-medoids clustering algorithms depend on the choice for the number of clusters to be searched and a starting configuration assignment. In contrast, hierarchical clustering methods do not require such specifications. Instead, they require the user to specify a measure of dissimilarity between (disjoint) groups of observations, based on the pairwise dissimilarities among the observations in the two groups. As the name suggests, they produce hierarchical representations in which the clusters at each level of the hierarchy are created by merging clusters at the next lower level. At the lowest level, each cluster contains a single observation. At the highest level there is only one cluster containing all of the data. Strategies for hierarchical clustering divide into two basic paradigms: agglomerative (bottom-up) and divisive (top-down). Agglomerative strategies start at the bottom and at each level recursively merge a selected pair of clusters into a single cluster. This produces a grouping at the next higher level with one less cluster. The pair chosen for merging consist of the two groups with the smallest intergroup dissimilarity. Divisive methods start at the top and at each level recursively split one of the existing clusters at 51 / 1 522 14. Unsupervised Learning LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA K562B-repro K562A-repro LEUKEMIA LEUKEMIA BREAST BREAST BREAST OVARIAN OVARIAN UNKNOWN OVARIAN OVARIAN OVARIAN PROSTATE OVARIAN PROSTATE CNS CNS CNS CNS CNS BREAST BREAST MCF7A-repro BREAST MCF7D-repro COLON COLON COLON COLON COLON COLON COLON FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering with average linkage to the human tumor microarray data. A hierarchical example chical structure produced by the algorithm. Hierarchical methods impose BREAST 52 / 1
Hierarchical clustering Mixture models Concepts Mixture models Similar to K-means but assignment to clusters is soft. Top-down vs. bottom up Different linkages: single linkage (minimum distance) complete linkage (maximum distance) Often applied with multivariate normal as the model within classes. EM algorithm used to fit the model: Estimate responsibilities. Estimate within class parameters replacing labels (unobserved) with responsibilities. 53 / 1 54 / 1 Model-based clustering Outliers Summary 1 Choose a type of mixture model (e.g. multivariate Normal) and a maximum number of clusters, K 2 Use a specialized hierarchical clustering technique: model-based hierarchical agglomeration. 3 Use clusters from previous step to initialize EM for the mixture model. 4 Uses BIC to compare different mixture models and models with different numbers of clusters. 55 / 1 56 / 1
Outliers General steps Build a profile of the normal behavior. Use these summary statistics to detect anomalies, i.e. points whose characteristics are very far from the normal profile. General types of schemes involve a statistical model of normal, and far is measured in terms of likelihood. Example: Grubbs test chooses an outlier threshold to control Type I error of any declared outliers if data does actually follow the model... 57 / 1