Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Similar documents
Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Statistics 202: Data Mining. c Jonathan Taylor. Week 6 Based in part on slides from textbook, slides of Susan Holmes. October 29, / 1

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

CSCI-567: Machine Learning (Spring 2019)

Course in Data Science

Stat 502X Exam 2 Spring 2014

VBM683 Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

CPSC 540: Machine Learning

Advanced Statistical Methods: Beyond Linear Regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Statistical Data Mining and Machine Learning Hilary Term 2016

The Bayes classifier

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 5. Introduction to Data Mining

Pattern Recognition and Machine Learning

MSA220/MVE440 Statistical Learning for Big Data

6.036 midterm review. Wednesday, March 18, 15

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

The exam is closed book, closed notes except your one-page cheat sheet.

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLĂ„RNING

Machine Learning for OR & FE

CMSC858P Supervised Learning Methods

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Classification Lecture 2: Methods

Data Preprocessing. Cluster Similarity

Final Exam, Machine Learning, Spring 2009

5. Discriminant analysis

ECE 5424: Introduction to Machine Learning

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Statistical Methods for Data Mining

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning Practice Page 2 of 2 10/28/13

CS7267 MACHINE LEARNING

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Chapter 14 Combining Models

Introduction to Machine Learning Midterm Exam

PATTERN CLASSIFICATION

10-701/ Machine Learning - Midterm Exam, Fall 2010

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Statistical Machine Learning

Microarray Data Analysis: Discovery

Statistical Methods for SVM

Midterm exam CS 189/289, Fall 2015

UVA CS 4501: Machine Learning

Learning with multiple models. Boosting.

Is the test error unbiased for these programs?

Final Exam, Fall 2002

Linear and Logistic Regression. Dr. Xiaowei Huang

Probabilistic Methods in Bioinformatics. Pabitra Mitra

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

ECE 5984: Introduction to Machine Learning

CSE446: Clustering and EM Spring 2017

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Machine Learning for Signal Processing Bayes Classification and Regression

CS145: INTRODUCTION TO DATA MINING

Machine Learning Lecture 5

CPSC 340: Machine Learning and Data Mining

Generative Clustering, Topic Modeling, & Bayesian Inference

Introduction to Support Vector Machines

Notes on Discriminant Functions and Optimal Classification

Click Prediction and Preference Ranking of RSS Feeds

CPSC 340: Machine Learning and Data Mining

Machine Learning 2017

Feature Engineering, Model Evaluations

CPSC 540: Machine Learning

BAYESIAN DECISION THEORY

CS534 Machine Learning - Spring Final Exam

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning, Midterm Exam

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Introduction to Machine Learning

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Machine Learning 2nd Edition

Building a Prognostic Biomarker

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Clustering using Mixture Models

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

CPSC 540: Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

COMS 4771 Regression. Nakul Verma

Support Vector Machines

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Support Vector Machines for Classification: A Statistical Portrait

Transcription:

Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If p n and X T X is invertible, this has a unique solution: In regression, the outcome is continuous. Given Y R n and X R n p, the least squares regression problem is ˆβ = (X T X ) 1 X T Y. 1 ˆβ = argmin β R p 2 y X β 2 2. If p > n or, X T X is not invertible, the solution is not unique. 3 / 1 4 / 1

Often, due to some other knowledge, we might change the loss function to a Mahalanobis distance 1 2 (Y X β)t Σ 1 (Y X β). Example: Σ is diagonal is often used if Y X has variance dependent on X. Feature Selection in As noted, if p > n there is no unique least squares model. Leads to variable selection problem. Might try fixing a certain number of variables, k and solving best subsets regression by ˆβ k = 1 argmin J={j 1,...,j k } 2 Y X J ˆβ J 2 2 where ˆβ J is a/the least squares solution with features J. Example: if the cases are sampled with some structure, the errors might be correlated with covariance Σ. This is like clustering: a combinatorial optimization problem. 5 / 1 We might be able prove things about the minimizer but we have no algorithm to find it for p > 40 or so, unless X has very special structure. 6 / 1 The l 0 norm Defined as β 0 = # {j : β j 0} (which is not really a norm... ) The l 1 norm: Recall the l 1 norm Best subsets model is equivalent to ˆβ k = 1 argmin β R p : β 0 k 2 Y X β 2 2 A quasi-equivalent problem is its Lagrange version β 1 = (which really is a norm... ) Since it s a norm it is convex. p β j j=1 1 ˆβ λ,l0 = argmin β R p 2 Y X β 2 2 + λ β 0 It is also the best convex approximation to the l 0 norm. This is similar to cost-complexity pruning for the decision tree. 7 / 1 8 / 1

The Lasso Why does `1-penalty give sparse ˆ? The problem is 1 minimize 2R p 2 ky X k2 subject to k k 1 apple c ˆβ λ = argmin β R p Y X β 2 2 + λ β 1. β 2 In bound form ˆβ c = argmin β R p : β 1 c Y X β 2 2 + λ β 1. + β^ols This is a convex problem... lots of nice ways to solve it. With l 0 we knew we would get feature selection does this happen with l 1 as well? β^λ β 1 9 / 1 10 / 1 Jacob Bien path Why do we get sparse solutions with the? A Lasso for Hierarchical Interaction The parameter λ controls the sparsity. For λ > X T y, the sup-norm of X T y R p, the solution is ˆβ λ = 0 Just below X T Y only the feature with maximal absolute correlation with Y has a nonzero coefficient. For λ = 0, any least squares solution is a solution. It is possible, depending on X to have multiple solutions, at some fixed λ but for generic X it never happens. 11 / 1 12 / 1

So what? (Consistency) The yields sparse solutions, but are they good sparse solutions? Noting that the first variable to enter has largest absolute correlation suggests the answer is yes. Suppose E(Y X ) = j A X A β A. So what? (Consistency) Suppose #A is not too large; the matrix XI T X A is not very big; XA T X A is not too close to singular. Then, for certain values of λ, ˆβ λ makes no false positives. That is, ˆβ λ,j 0 = j A. We call A {1,..., p} the (true) active coefficients and I = A c the (true) inactive features. If (β j ) j A are large enough, then it also makes no false positives. That is, ˆβ λ,j = 0 = j I. 13 / 1 14 / 1 The Lasso Why does `1-penalty give sparse ˆ? 1 minimize 2R p 2 ky X k2 subject to k k 1 apple c β 2 + β^ols Other problems The sparsity property is a property of the l 1 norm, not the smooth loss function. For many smooth objectives, adding this l 1 penalty yields sparse solutions. β^λ β 1 15 / 1 16 / 1

Sparse Support Vector Machine We can add the l 1 norm to the support vector machine loss: minimize β,α n (1 y i (α + x T i β)) + + λ 2 β 2 2 + λ 1 β 1. i=1 Yields a sparse coefficient vector for the SVM. Sparse Logistic Regression We can add the l 1 norm to the logistic regression loss: minimize β,α n DEV(α + x T i β, y i ) + λ β 1 i=1 Yields a sparse coefficient vector for the logistic regression model. 17 / 1 18 / 1 Fused / Generalized What if we want a solution with small # {β : D i β 0} = Dβ 0? Example: change point detection, a form of outlier detection in a streaming data situation. 0 2 4 6 8 10 12 19 / 1 20 / 1

Fused / Generalized Might consider solving ˆβ λ = argmin β R p Y X β 2 2 + λ Dβ 1. Group What if there are disjoint groups (β g ) g G of coefficients in our model and we want a solution with small # {g : β g 0} = β 0,G? A lot of the theory from carries over to this case as well, but not quite all of it... Example: the groups of variables might be indicator variables for different categorical variables. 21 / 1 22 / 1 Group Might consider solving Other problems Why not add l 1 penalty to decision tree problem? ˆβ λ = argmin β R p Y X β 2 2 + λ g G β g 2. Because the original problem is non-convex so their combination is generally still non-convex... Leads to a group version of all your favorites: logistic regression; support vector machine; etc. There are tree-like group penalties... Similarly, we would not add l 1 penalty to K-means objective function... 23 / 1 24 / 1

No free lunch Of course, we must pay some price for all of this, but what price? Well, the produces biased estimates even when it finds the true active variables. But, general theory says that we should be using some bias in the form of shrinkage... Summary A lot of interesting work in high-dimensional statistics / machine learning over the last few years involves studying problems of the form ˆβ λ = argmin β R p L(β) + λp(β) where L is a loss function like the support vector loss, logistic loss, squared error loss, etc. P is a convex penalty that imparts structure on the solutions. Lots of interesting questions remain... 25 / 1 Try STATS315 for a more detailed introduction to the... 26 / 1 Final review Part II Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension reduction. Final review Distances. Multidimensional scaling. Multidimensional arrays. Decision trees. Performance measures for classifiers. Discriminant analysis. 27 / 1 28 / 1

Final review Rule based classifiers Rule-based Classifier (Example) Overview After Midterm More classifiers: Rule-based Classifiers Nearest-Neighbour Classifiers Naive Bayes Classifiers Neural Networks Support Vector Machines Random Forests Boosting (AdaBoost / Gradient Boosting) Clustering. Outlier detection. Rule based classifiers 29 / 1 R1: (Give Birth = no)! (Can Fly = yes) " Birds R2: (Give Birth = no)! (Live in Water = yes) " Fishes R3: (Give Birth = yes)! (Blood Type = warm) " Mammals R4: (Give Birth = no)! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians 30 / 1 Nearest Neighbor Classifiers Nearest neighbour classifier Tan,Steinbach, Kumar Introduction to 4/18/2004 3! Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Concepts coverage accuracy mutual exclusivity exhaustivity Laplace accuracy Training Records Compute Distance Choose k of the nearest records Test Record Tan,Steinbach, Kumar Introduction to 4/18/2004 34 31 / 1 32 / 1

s too large, Nearest neighborhood neighbour classifier may include points from r classes Naive Bayes classifiers Naive Bayes classifiers Model: ar Introduction to 4/18/2004 38 icial Neural networks: Networks single layer (ANN) 33 / 1 P(Y = c X 1 = x 1,..., X p = x p ( p ) P(X l = x l Y = c) P(Y = c) l=1 For continuous features, typically a 1-dimensional QDA model is used (i.e. Gaussian within each class). For discrete features: use the Laplace smoothed probabilities P(X j = l Y = c) = # {i : X ij = l, Y i = c} + α. # {Y i = c} + α k Neural networks: double layer 34 / 1 35 / 1 36 / 1

Support Support Vector vector machine Machines Support vector machines Support vector machines Solves the problem minimize β,α,ξ β 2 subject to y i (x T i β + α) 1 ξ i, ξ i 0, n i=1 ξ i C 37 / 1 38 / 1! Find hyperplane maximizes the margin => B1 is better than B2 Support vector machines Tan,Steinbach, Kumar Introduction to 4/18/2004 64 Logistic vs. SVM Non-separable problems The ξ i s can be removed from this problem, yielding 4.0 3.5 Logistic SVM n minimize β,α β 2 2 + γ (1 y i f α,β (x i )) + i=1 where (z) + = max(z, 0) is the positive part function. Or, n 3.0 2.5 2.0 1.5 1.0 minimize β,α i=1 (1 y i f α,β (x i )) + + λ β 2 2 0.5 0.0 3 2 1 0 1 2 3 39 / 1 40 / 1

General Idea Ensemble methods Ensemble methods Bagging / Random Forests In this method, one takes several bootstrap samples (samples with replacement) of the data. For each bootstrap sample S b, 1 b B, fit a model, retaining the classifier f,b. After all models have been fit, use majority vote f (x) = majority vote of (f,b (x)) 1 i B. Defined the OOB estimate of error. 41 / 1 42 / 1 Tan,Steinbach, Kumar Introduction to 4/18/2004 74 Ensemble methods Illustrating AdaBoost Initial weights for each data point Data points for training Ensemble methods Illustrating AdaBoost 43 / 1 Tan,Steinbach, Kumar Introduction to 4/18/2004 84 44 / 1

Ensemble methods Boosting as gradient descent It turns out that boosting can be thought of as something like gradient descent. In some sense, the boosting algorithm is a steepest descent algorithm to find What is Cluster Analysis? Cluster analysis! Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized argmin f F n L(y i, f (x i )). i=1 Tan,Steinbach, Kumar Introduction to 4/18/2004 2 45 / 1 46 / 1 Clustering Cluster analysis 502 14. Unsupervised Learning Types of clustering Partitional A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset. Hierarchical A set of nested clusters organized as a hierarchical tree. Each data object is in exactly one subset for any horizontal cut of the tree... X2 X 1 47 / 1 FIGURE 14.4. Simulated data in the plane, clustered into three classes (represented A partitional by orange, example blue and green) by the K-means clustering algorithm 48 / 1

K-means 520 14. Unsupervised Learning K-medoid log WK -3.0-2.5-2.0-1.5-1.0-0.5 0.0 2 4 6 8 Number of Clusters Gap -0.5 0.0 0.5 1.0 2 4 6 8 Number of Clusters Algorithm Same as K-means, except that centroid is estimated not by the average, but by the observation having minimum pairwise distance with the other cluster members. Advantage: centroid is one of the observations useful, eg when features are 0 or 1. Also, one only needs pairwise distances for K-medoids rather than the raw observations. FIGURE 14.11. (Left panel): observed (green) and expected (blue) values of log W K for the simulatedfigure data of : Figure Gap statistic 14.4. Both curves have been translated to equal zero at one cluster. (Right panel): Gap curve, equal to the difference between the observed and expected values of log W K.TheGapestimateK is the smallest K producing a gap within one standard deviation of the gap at K +1; here K 49 / 1 =2. 50 / 1 Silhouette plot This gives K = 2, which looks reasonable from Figure 14.4. Cluster analysis 14.3.12 Hierarchical Clustering The results of applying K-means or K-medoids clustering algorithms depend on the choice for the number of clusters to be searched and a starting configuration assignment. In contrast, hierarchical clustering methods do not require such specifications. Instead, they require the user to specify a measure of dissimilarity between (disjoint) groups of observations, based on the pairwise dissimilarities among the observations in the two groups. As the name suggests, they produce hierarchical representations in which the clusters at each level of the hierarchy are created by merging clusters at the next lower level. At the lowest level, each cluster contains a single observation. At the highest level there is only one cluster containing all of the data. Strategies for hierarchical clustering divide into two basic paradigms: agglomerative (bottom-up) and divisive (top-down). Agglomerative strategies start at the bottom and at each level recursively merge a selected pair of clusters into a single cluster. This produces a grouping at the next higher level with one less cluster. The pair chosen for merging consist of the two groups with the smallest intergroup dissimilarity. Divisive methods start at the top and at each level recursively split one of the existing clusters at 51 / 1 522 14. Unsupervised Learning LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA K562B-repro K562A-repro LEUKEMIA LEUKEMIA BREAST BREAST BREAST OVARIAN OVARIAN UNKNOWN OVARIAN OVARIAN OVARIAN PROSTATE OVARIAN PROSTATE CNS CNS CNS CNS CNS BREAST BREAST MCF7A-repro BREAST MCF7D-repro COLON COLON COLON COLON COLON COLON COLON FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering with average linkage to the human tumor microarray data. A hierarchical example chical structure produced by the algorithm. Hierarchical methods impose BREAST 52 / 1

Hierarchical clustering Mixture models Concepts Mixture models Similar to K-means but assignment to clusters is soft. Top-down vs. bottom up Different linkages: single linkage (minimum distance) complete linkage (maximum distance) Often applied with multivariate normal as the model within classes. EM algorithm used to fit the model: Estimate responsibilities. Estimate within class parameters replacing labels (unobserved) with responsibilities. 53 / 1 54 / 1 Model-based clustering Outliers Summary 1 Choose a type of mixture model (e.g. multivariate Normal) and a maximum number of clusters, K 2 Use a specialized hierarchical clustering technique: model-based hierarchical agglomeration. 3 Use clusters from previous step to initialize EM for the mixture model. 4 Uses BIC to compare different mixture models and models with different numbers of clusters. 55 / 1 56 / 1

Outliers General steps Build a profile of the normal behavior. Use these summary statistics to detect anomalies, i.e. points whose characteristics are very far from the normal profile. General types of schemes involve a statistical model of normal, and far is measured in terms of likelihood. Example: Grubbs test chooses an outlier threshold to control Type I error of any declared outliers if data does actually follow the model... 57 / 1