Star-Galaxy Separation in the Era of Precision Cosmology

Similar documents
Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Midterm exam CS 189/289, Fall 2015

18.9 SUPPORT VECTOR MACHINES

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Linear Classification. Prof. Matteo Matteucci

10-701/ Machine Learning - Midterm Exam, Fall 2010

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Homework on Properties of Galaxies in the Hubble Deep Field Name: Due: Friday, April 8 30 points Prof. Rieke & TA Melissa Halford

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning, Fall 2009: Midterm

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Learning algorithms at the service of WISE survey

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Homework #7: Properties of Galaxies in the Hubble Deep Field Name: Due: Friday, October points Profs. Rieke

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Logistic Regression. COMP 527 Danushka Bollegala

Midterm Exam Solutions, Spring 2007

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Observing Dark Worlds (Final Report)

1 Machine Learning Concepts (16 points)

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Support Vector Machines

Classifying Galaxy Morphology using Machine Learning

Statistical Methods for SVM

Bias-Variance Tradeoff

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning, Midterm Exam

FINAL: CS 6375 (Machine Learning) Fall 2014

Introduction. Chapter 1

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Introduction to Machine Learning Midterm Exam

Final Exam, Fall 2002

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Detecting Dark Matter Halos using Principal Component Analysis

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Be able to define the following terms and answer basic questions about them:

Machine Learning Basics

5. Discriminant analysis

Characterization of Jet Charge at the LHC

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Chemometrics: Classification of spectra

Introduction to Support Vector Machines

Generative Learning algorithms

CSCI-567: Machine Learning (Spring 2019)

Evaluation. Andrea Passerini Machine Learning. Evaluation

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Jeff Howbert Introduction to Machine Learning Winter

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Nearest Neighbors Methods for Support Vector Machines

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

CS145: INTRODUCTION TO DATA MINING

Separating Stars and Galaxies Based on Color

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Midterm Exam, Spring 2005

Evaluation requires to define performance measures to be optimized

What s Cooking? Predicting Cuisines from Recipe Ingredients

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Chapter 6 Classification and Prediction (2)

W vs. QCD Jet Tagging at the Large Hadron Collider

Machine Learning And Applications: Supervised Learning-SVM

arxiv: v1 [astro-ph.im] 20 Jan 2017

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

ECE521 week 3: 23/26 January 2017

Support Vector Machines

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning Midterm, Tues April 8

CMU-Q Lecture 24:

CS534 Machine Learning - Spring Final Exam

Advanced statistical methods for data analysis Lecture 2

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for NLP

Mining Classification Knowledge

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Notes on Discriminant Functions and Optimal Classification

Machine Learning for Signal Processing Bayes Classification

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Predicting flight on-time performance

TDT4173 Machine Learning

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

LEC 4: Discriminant Analysis for Classification

Learning from Examples

Metric-based classifiers. Nuno Vasconcelos UCSD

Generative v. Discriminative classifiers Intuition

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Machine Learning Midterm Exam Solutions

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Applied Machine Learning Annalisa Marsico

Transcription:

Star-Galaxy Separation in the Era of Precision Cosmology Michael Baumer, Noah Kurinsky, and Max Zimet Stanford University, Department of Physics, Stanford, CA 94305 November 14, 2014 Introduction To anyone who has seen the beautiful images of the intricate structure in nearby galaxies, distinguishing between stars and galaxies might seem like an easy problem. However, modern cosmological surveys such as the Dark Energy Survey (DES) [3] are primarily interested in observing as many distant galaxies as possible, as these provide the most useful data for constraining cosmology (the history of structure formation in the universe). However, at such vast distances, both stars and galaxies begin to look like low-resolution point sources, making it difficult to isolate a sample of galaxies (containing interesting cosmological information) from intervening dim stars in our own galaxy. This challenge, known as star-galaxy separation is a crucial step in any cosmological survey. Performing this classification more accurately can greatly improve the precision of scientific insights extracted from massive galaxy surveys like DES. The most widely-used methods for star-galaxy separation include class_star, which is a legacy classifier with a limited feature set, and spread_model, a linear discriminant based on weighted object size [1]. The performance of both of these methods is insufficient for the demands of modern precision cosmology, which is why the application of machine learning techniques to this problem has attracted recent interest in the cosmology community [6]. Since data on the realworld performance of such methods has yet to be published in the literature, we will use these two classifiers as benchmarks for assessing the performance of our models in this paper. Figure 1: In the top false-color image from DES, it is easy to tell that the large object in the foreground, NCG 1398, is a galaxy, but what about all the point sources behind it? In the bottom images of our actual data, we see sky maps of stars (right) and galaxies (left). The voids seen in the map of galaxies are regions of the sky which are blocked by bright nearby stars. HST data. The catalog contains 222,000 sources, of which 70% ( 150,000) are used for training and 30% ( 70,000) are (randomly) reserved for crossvalidation. From the 226 variables available in the catalog, we selected 45 variables which appeared to have disdata and Feature Selection criminating power based on plots like those shown in We use a matched catalog of objects observed by Figure 2. Features include the surface brightnesses, both DES, a ground-based observatory, and the Hub- astronomical magnitudes, and sizes of all objects as ble Space Telescope (HST) [4] over 1 square degree observed through 5 different color filters (near ultraof the sky. The true classifications (star or galaxy) violet to near infrared). We also use goodness-of-fit are binary labels computed from the more-detailed statistics for stellar and galactic luminosity profiles 1

(a) log χ 2 for a star-like luminosity profile vs. a galaxy-like profile. maximum) vs. infrared magnitude (b) log (size) (full-width at half- (c) Histograms of mean surface brightness Figure 2: A selection of catalog variables (before preprocessing) selected as input features to our models. (which describe the fall-off of brightness from centers of objects). As a preprocessing step, we transformed all variables to have a mean of 0 and variance of 1. For certain variables - namely chi-squares and object sizes - we decided to use their logarithm rather than their raw value, to increase their discriminating power and make each feature more Gaussian (see panels (a) and (b) in Figure 2). In performing feature selection, it was important to us that features were included in physicallymotivated groups. After making the plot of learning vs. included features shown in Figure 3, we implemented a backwards search, using a linear SVM, to determine the relative power of each of the variables. We were surprised to find that our less promising parameters (χ 2 and surface brightness) were often more powerful than magnitudes in certain color bands, and considered removing some magnitudes from our features. We ultimately decided against this, however, given that we wanted to select features in a uniform manner for all color bands to maintain a physicallymotivated feature set. Methods and Results To determine which method would obtain the best discrimination, we initially ran a handful of basic algorithms with default parameters and compared results between them. Given that we had continuous inputs and a binary output, we employed logistic regression (LR), Gaussian discriminant analysis (GDA), linear- and Gaussian-kernel SVMs, and Gaussian Naive Bayes (GNB) as classifiers. We implemented our models in Python, using the sklearn module [5]. Figure 3: A plot of training success vs. features used for four of our models. Method 1 ˆɛ train 1 ˆɛ test GDA + SMOTE 66.8% 91.4% GNB + SMOTE 70.5% 91.4% LR 86.8% 86.4% LinSVM 89.4% 89.0% GaussianSVM 95.4% 94.2% Table 1: Training and test error for multiple supervised learning methods. We opted to employ the l 1 regularized linear SVM classifier LinearSVC (l 2 regularization gave comparable results), the Naive Bayes algorithm GaussianNB, the logistic regression method SGDClassifier(loss="log"), and the GDA algorithm, using class weighting where implemented (namely, for SVM and LR) to compensate for the much larger number of galaxies than stars in our training sample. The results of these initial classi- 2

Method GDA + SMOTE GNB + SMOTE LR LinSVM GaussianSVM G S G S G S G S G S True G 95.5% 4.5% 95.0% 5.0% 87.3% 12.7% 90.2% 9.8% 95.2% 4.8% True S 59.8% 40.2 % 52.3% 47.7 % 19.7% 80.3% 19.8% 80.2% 17.8% 82.2% Table 2: Confusion matrices produced on our test set for different supervised learning techniques, showing the true/false positive rates for galaxies in the top row and the true/false negative rates for stars in the bottom row of each entry. fications can be seen in table 1, and the confusion matrices from these runs can be seen in table 2. The statistics in Table 1 show that we have sufficient training data for the LR, GDA, and SVM methods, since our test error is very similar to our training error. We decided to proceed further from here with Naive Bayes/GDA and SVM, in order to continue our search for a successful generative model and optimize our best discriminative algorithm. Gaussian Naive Bayes and GDA with Bootstrapping and SMOTE We first describe the generative algorithms we applied. Eq. (1) shows the probability distribution assumption for the Gaussian Naive Bayes classifier, where x i is the value of the i-th feature and y is the class (star or galaxy) of the object under consideration. p(x y) = n p(x i y), x i y N (µ i,y, σi,y) 2 (1) i=1 This makes stronger assumptions than the quadratic Gaussian Discriminant Analysis model, whose assumed probability distributions are described in Eq. (2): x y N (µ y, Σ y ). (2) In fact, this shows that the GNB model is equivalent to the quadratic GDA model, if we require in the latter that the covariance matrices Σ y be diagonal. To correct for the effects of our imbalanced data set, we applied the bootstrapping technique, whereby we sampled from our set of galaxies, and trained on this sample, plus the set of all of our stars; the intent was to take 50 such samples and average the results of these different classifiers. However, each individual classifier still performed poorly, misclassifying most stars, unless we heavily undersampled from our galaxies (that is, we chose far fewer galaxies than stars to be in our new data set), in which case the classifiers misclassified most galaxies. Figure 4: Training success vs. features used for data with equal numbers of stars and galaxies, produced using the SMOTE algorithm. We also performed a synthetic oversampling of stars using the Synthetic Minority Over-sampling TechniquE (SMOTE) algorithm [2]. The idea of SMOTE is to create new training examples which look like the stars in our data set. Specifically, given a training example with features x that is a star, we choose, uniformly at random, one of the 5 nearest stars in the training set (where nearest means we use the Euclidean l 2 norm on the space of our normalized features). Denoting this chosen neighbor by x, we then create a new training example at random location along the line segment connecting x and x. After oversampling our population of stars using the SMOTE algorithm to have equal numbers of stars and galaxies, the training errors for Gaussian Naive Bayes and GDA suffer, as they are no longer able to succeed by classifying almost everything as a galaxy. Given that these generative models fail to classify a balanced dataset well, we conclude that our data does not satisfy their respective assumptions sufficiently to warrant their use. This is not very surprising, as generative learning models make significant assumptions. As we will see next, the discriminative algorithms we 3

Figure 5: Results of two rounds of Gaussian SVM optimizations, varying the parameters C and γ logarithmically. The leftmost plot shows C {0.1, 10.0} and γ {0, 10 3 }, while the middle and right plots show C {10.0, 1000.0} and γ {0.01, 10 5 }. The left two plots show training versus test success percentages, while the rightmost plot compares success for stars and galaxies over the range of our better performing grid search. used performed much better. Grid-Search Optimized SVM with Gaussian Kernel, L1-Regularization Given the good performance of the Linear SVM and a preliminary run of the SVM with Gaussian Kernel, we decided to try to optimize the l 1 -regularized SVM by performing a grid search across the parameter space C {10 3, 10 3 } and γ {10 5, 1} spaced logarithmically in powers of 10 1. We employed the SVC method of sklearn, a python wrapper for libsvm, which provides automatic class weighting to counteract our unbalanced training sample. We used the barley batch server to run these grid samples in parallel, and had SVC record the training error, test error, and confusion matrices for each model. The resulting scores from these optimizations can be seen in Figure 5. Our model choice was based on of the requirement for lowest generalization error, highest test success, and best separation for stars, as a classifier which classified every point as a galaxy would have a good overall test performance but perform no source separation, and thus perform poorly on stars. In Figure 5, this was performed functionally by selecting, first, the model with the highest test success, then among those the models closest to their correspond- 1 At the poster session, our TAs suggested importance sampling or random sampling as alternatives to a grid search. Given the average run time of between 1 to 10 hours per optimization, and the fact that importance sampling is an iterative process, we opted for grid search to reduce computing time. It can be seen in our figures that our optimization is fairly convex, so the grid search appears to be sufficiently accurate. See also csie.ntu.edu.tw/ cjlin/papers/guide/guide.pdf. Figure 6: Histogram of the margin from the optimal separation plane (as determined by our Gaussian SVM) of each training example; a negative margin corresponds to a galaxy, positive to a star. ing training success rates, to minimize overfitting. Then, among this subset of models, we chose the model which performed best separating stars while not compromising galaxy separation performance. The sum total of these considerations leads us to select the model with C = 100.0, γ = 0.01, which has the confusion matrix shown in table 2 and an overall test success rate of 94.2%. Out of all methods tested, this SVM had the highest generalized success rate and best performance separating stars, at 82% success. The distance of each example from the separating hyperplane of the SVM can be seen plotted as a histogram separately for stars and galaxies in Figure 4

6. The factor which most limits the star separation performance seems to be the multimodal distribution of stars in our feature space, one of which is highly overlapping with galaxies. It is interesting to note the highly gaussian distribution of galaxy margins from the plane, indicating that these were better modeled, and overall had more continuous properties. Discussion Since we determined that the generative models we attempted to use were not appropriate for our dataset, we are left with three successful discriminative models: logistic regression, Linear SVM, and Gaussian SVM. We used a Receiver Operating Characteristic (ROC) curve to characterize the performance of our benchmark models, class_star and spread_model, since they both give continuous output scores. Since our three models all give binary outputs, they appear as points on the ROC curve, as shown in Figure 7. We see that all three of our models significantly outperform the two benchmarks! As noted in the previous section, the multimodal distribution of stars was the limiting factor in our ability to separate these object classes. Regardless of the model, we rarely saw a true star rate greater than about 80%, suggesting this class of stars comprises on average 20% of observed objects, and may be worth modeling. Given the high dimensionality of our features, it was not obvious what feature may have set these stars apart, but it is clear that they are confounded with similar galaxies, though galaxies do not show similar substructure. Conclusions and Future Work We have shown that machine learning techniques are remarkably successful in addressing the challenges of star-galaxy separation for modern cosmology. Though the assumptions of our generative models GDA and Naive Bayes were not borne out in the data, causing them to perform poorly, we had success with logistic regression and SVMs, and the largest challenge was in feature standardization and SVM optimization. Our best model, the Gaussian SVM, achieved very good performance, classifying 95.2% of true galaxies correctly, while achieving 82.2% accuracy in classifying true stars, surpassing both of our benchmark classifiers. The distribution of stars near the SVM decision Figure 7: An ROC curve for our best three supervised learning methods compared to benchmark classifiers from the literature. boundary indicates that there might be a class of stars that is not properly modeled by our current feature set. This motivates the future work of seeking out other astronomical catalogs with more or different features to enable better modeling of our stellar population. In addition, though we chose to focus on algorithms discussed in this course, deep learning also has great potential for improving star-galaxy separation. Such algorithms are the focus of some bleeding-edge cosmology research [6], though their performance on current-generation survey data has yet to be published. References [1] Bertin & Arnouts (1996). SExtractor: Software for Source Extraction. A&AS: 117, 393. [2] Chawla et al (2002). SMOTE: Synthetic Minority Over-sampling Technique. JAIR: 16, pp. 321-357. [3] DES Collaboration (2005). The Dark Energy Survey. arxiv:astro-ph/0510346. [4] Koekemoer et al (2007). The COSMOS Survey: Hubble Space Telescope Advanced Camera for Surveys Observations and Data Processing. ApJS: 172, 196. [5] Pedregosa et al (2011). Scikit-learn: Machine Learning in Python. JMLR: 12, pp. 2825-2830. [6] Soumagnac et al (2013). Star/galaxy separation at faint magnitudes: Application to a simulated Dark Energy Survey. arxiv:1306.5236 5