Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

Similar documents
Clustering and Classification in Astronomy

Multivariate Analysis, Clustering, and Classification

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning Practice Page 2 of 2 10/28/13

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

UVA CS 4501: Machine Learning

Nonparametric Statistics

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

A search for dark matter annihilation in the newly discovered dwarf galaxy Reticulum II

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

day month year documentname/initials 1

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Course in Data Science

Optimal Ridge Detection using Coverage Risk

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Decision trees COMS 4771

The Challenges of Heteroscedastic Measurement Error in Astronomical Data

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

CS145: INTRODUCTION TO DATA MINING

Chapter 14 Combining Models

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Unsupervised Anomaly Detection for High Dimensional Data

Holdout and Cross-Validation Methods Overfitting Avoidance

18.9 SUPPORT VECTOR MACHINES

Lecture 4 Discriminant Analysis, k-nearest Neighbors

ECE521 week 3: 23/26 January 2017

CSCI-567: Machine Learning (Spring 2019)

Multivariate statistical methods and data mining in particle physics

CPSC 340: Machine Learning and Data Mining. Linear Least Squares Fall 2016

Introduction to SVM and RVM

Classification using stochastic ensembles

Machine Learning for Signal Processing Bayes Classification and Regression

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

10-701/ Machine Learning - Midterm Exam, Fall 2010

CMU-Q Lecture 24:

Tufts COMP 135: Introduction to Machine Learning

Day 3: Classification, logistic regression

Support Vector Machine. Industrial AI Lab.

Introduction to Machine Learning Midterm Exam

Recap from previous lecture

Data Analysis Methods

Introduction to Machine Learning

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Machine Learning Lecture 2

Machine Learning for Signal Processing Bayes Classification

Unsupervised clustering of COMBO-17 galaxy photometry

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

The Bayes classifier

Clustering VS Classification

Deconstructing Data Science

Hierarchical models for the rainfall forecast DATA MINING APPROACH

Day 4: Classification, support vector machines

Applied Multivariate and Longitudinal Data Analysis

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Data Preprocessing. Cluster Similarity

Logic and machine learning review. CS 540 Yingyu Liang

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Bayesian Machine Learning

Introduction to Machine Learning Midterm, Tues April 8

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Mul$variate clustering & classifica$on with R

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear Methods for Classification

COMS 4771 Introduction to Machine Learning. Nakul Verma

Microarray Data Analysis: Discovery

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Day 5: Generative models, structured classification

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

INTRODUCTION TO DATA SCIENCE

PATTERN CLASSIFICATION

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Kernel Methods and Support Vector Machines

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

Expectation Propagation for Approximate Bayesian Inference

Machine Learning 2017

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Motivating the Covariance Matrix

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Introduction to Probabilistic Machine Learning

Data Exploration vis Local Two-Sample Testing

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines

Transcription:

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers Jessi Cisewski Yale University American Astronomical Society Meeting Wednesday, January 6, 2016 1

Statistical Learning - learning from data We ll discuss some methods for classification and clustering today. Good references: 2

Co-Chairs: Shirley Ho (CMU, Cosmology) and Chad Schafer (CMU, Statistics) More info at http://www.scma6.org 3

Statistical and Applied Mathematical Sciences Institute (SAMSI) 2016-17 Program on Statistical, Mathematical and Computational Methods for Astronomy (ASTRO) Opening Workshop: August 22-26, 2016 Current list of proposed Working Groups 1 Uncertainty Quantification and Reduced Order Modeling in Gravitation, Astrophysics, and Cosmology 2 Synoptic Time Domain Surveys 3 Time Series Analysis for Exoplanets & Gravitational Waves: Beyond Stationary Gaussian Processes 4 Population Modeling & Signal Separation for Exoplanets & Gravitational Waves 5 Statistics, computation, and modeling in cosmology More info at http://www.samsi.info/workshop/2016-17-astronomy-opening-workshop-august-22-26-2016 4

Classification Use a priori group labels in analysis to assign new observations to a particular group or class Supervised learning or Learning with labels Data: X = {X 1, X 2,..., X n } R p, labels Y = {y 1, y 2,..., y n } Stars can be classified into labels Y = {O, B, F, G, K, M, L, T, Y } Using features X = {Temperature, Mass, Hydrogen lines,...} 5

Classification rules 6

Classification: evaluating performance Training error rate: number of misclassified observations over sample of size n is 1 n I(ŷ i y) n i=1 where ŷ i is the predicted class for observation i, and I is the indicator function. The test error rate is more important than training error; can estimate using cross-validation Class imbalance - strong imbalance in the number of observations in the classes can result in misleading performance measures 7

Bayes Classification Rule Test error is minimized by assigning observations with predictors x to the class that has the largest probability: for classes j = 1,..., J argmax P(Y = j X = x) j In general, intractable because the distribution of Y X is unknown. 8

K Nearest Neighbors (KNN) Main idea: An observation is classified based on the K observations in the training set that are nearest to it A probability of each class can be estimated by P(Y = j X = x) = K 1 I(y i = j) i N(x) where j = 1,..., #classes in training set, and I = indicator function. K = 3 nearest neighbors to the X are within the circle. The predicted class of X would be blue because there are more blue observations than green among the 3 NN. 9

Decision boundary is linear Linear Classifiers If p = 2 class boundary is a line (p = 3 is plane, p > 3 is hyperplane) Logistic regression Linear Discriminant Analysis (Quadratic Discriminant Analysis) Image: http://fouryears.eu/2009/02/ 10

11

Support Vector Machines Goal: Find the hyperplane that best separates the two classes (i.e. maximize the margin between the classes) If data are not linearly separable, can use the Kernel trick (transforms data to higher dimensional feature space) Image: http://en.wikipedia.org http://stackoverflow.com/questions/9480605/ 12

Classification Trees CART = Classification and Regression Trees 1 Predictor space is partitioned into hyper-rectangles 2 Any observations in the hyper-rectangle would be predicted to have the same label 3 Splits chosen to maximize purity of hyper-rectangles 13

Classification Trees - remarks Tree-based methods are not typically the best classification methods based on prediction accuracy, but they are often more easily interpreted (James et al. 2013) Tree pruning - the classification tree may be over fit, or too complex; pruning removes portions of the tree that are not useful for the classification goals of the tree. Bootstrap aggregation (aka bagging ) - there is a high variance in classification trees, and bagging (averaging over many trees) provides a means for variance reduction. Random forest - similar idea to bagging except it incorporates a step that helps to decorrelate the trees. 14

Clustering Find subtypes or groups that are not defined a priori based on measurements Unsupervised learning or Learning without labels Data: X = {X 1, X 2,..., X n } R p Galaxy clustering Bump-hunting (e.g. statistically significant excess of gamma-rays emissions compared to background (Geringer-Sameth et al., 2015)) Image: Li and Henning (2011) 15

K-means clustering Main idea: partition observations into K separate clusters that do not overlap Goal: minimize total within-cluster scatter: K C k X i X k 2 k=1 C(i)=k C k = number of observations in cluster C k, X k = ( X k 1,..., X k p ) 16

17

17

K-means clustering - comments Cluster assignments are strict no notion of degree or strength of cluster membership Not robust to outliers Possible lack of interpretability of centers centers are averages: - what if observations are images of faces? Images: http://cdn1.thefamouspeople.com,http://www.notablebiographies.com,http: //mrnussbaum.com,http://3.bp.blogspot.com 18

Hierarchical clustering Generates a hierarchy of partitions; user selects the partition P 1 = 1 cluster,..., P n = n clusters (agglomerative clustering) Partition P i is the union of one or more clusters from Partition P i+1 19

Single-linkage clustering 20

Hierarchical clustering - distances 1 Single-linkage clustering: intergroup distance is smallest possible distance d(c k, C k ) = min d(x, y) x C k,y C k 2 Complete-linkage clustering: intergroup distance is largest possible distance d(c k, C k ) = max d(x, y) x C k,y C k 3 Average-linkage clustering: average intergroup distance 4 Ward s clustering d(c k, C k ) = Ave x Ck,y C k d(x, y) d(c k, C k ) = 2 ( C k C k ) X Ck X Ck 2 C k + C k 21

22

K = 4 clusters 22

Statistical clustering 1 Parametric - associates a specific model with the density (e.g. Gaussian, Poisson) dataset is modeled by a mixture of these distributions parameters associated with each cluster 2 Nonparametric - looks at contours of the density to find cluster information (e.g. kernel density estimate) 23

How many clusters are there? JS Marron (UNC) Hidalgo Stamps Data to illustrate why histograms should not be used: The main points are illustrated by the Hidalgo Stamps Data, brought to the statistical literature by Izenman and Sommer, (1988), Journal of the American Statistical Association, 83, 941-953. They are thicknesses of a type of postage stamp that was printed over a long period of time in Mexico during the 19th century. The thicknesses are quite variable, and the idea is to gain insights about the number of different factories that were producing the paper for this stamp over time, by finding clusters in the thicknesses. http://www.stat.unc.edu/faculty/marron/dataanalyses/sizer/sizer_basics.html 24

Changing the bin width dramatically alters the number of peaks Images: JS Marron 25

These two histograms use the same bin width, but the second is slightly right-shifted. Are there 7 modes (left) or two modes (right)? See movie version of shifting issue here: http://www.stat.unc.edu/faculty/marron/dataanalyses/sizer/stampshistloc.mpg Images: JS Marron 26

Clustering - some final comments SiZer (Significance of Zero Crossings of the Derivative) - find statistically significant peaks http://www.unc.edu/~marron/dataanalyses/sizer/sizer_basics.html Nonparametric Inference For Density Modes (Genovese et al., 2015) Density ridges/filament finder (Chen et al., 2015b,a) Image: Yen-Chi Chen (http://www.stat.cmu.edu/~yenchic/research.html) 27

Concluding Remarks Classification - supervised/labels predict classes 1 KNN 2 Logistic regression 3 LDA/QDA 4 Support Vector Machines 5 Tree classifiers Clustering - unsupervised/no labels find structure 1 K - means 2 Hierarchical clustering 3 Parametric/Non-parametric Clustering and classification are useful tools, but should be familiar with assumptions associated with the method selected 28

Bibliography Chen, Y.-C., Ho, S., Brinkmann, J., Freeman, P. E., Genovese, C. R., Schneider, D. P., and Wasserman, L. (2015a), Cosmic Web Reconstruction through Density Ridges: Catalogue, arxiv preprint arxiv:1509.06443. Chen, Y.-C., Ho, S., Freeman, P. E., Genovese, C. R., and Wasserman, L. (2015b), Cosmic Web Reconstruction through Density Ridges: Method and Algorithm, arxiv preprint arxiv:1501.05303. Genovese, C. R., Perone-Pacifico, M., Verdinelli, I., and Wasserman, L. (2015), Non-parametric inference for density modes, Journal of the Royal Statistical Society: Series B (Statistical Methodology). Geringer-Sameth, A., Walker, M. G., Koushiappas, S. M., Koposov, S. E., Belokurov, V., Torrealba, G., and Evans, N. W. (2015), Indication of Gamma-ray Emission from the Newly Discovered Dwarf Galaxy Reticulum II, Physical review letters, 115, 081101. James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Introduction to Statistical Learning with Applications in R, vol. 1 of Springer Texts in Statistics, Springer. Li, H.-b. and Henning, T. (2011), The alignment of molecular cloud magnetic fields with the spiral arms in M33, Nature, 479, 499 501. 29