MACHINE LEARNING FOR CLUSTER- GALAXY CLASSIFICATION

MACHINE LEARNING FOR CLUSTER- GALAXY CLASSIFICATION Silvia de Castr García Directres: Dr. Ricard Pérez Martínez, Dra. Ana María Pérez García 16/03/2018 Machine Learning fr cluster-galaxy classificatin 1

INTRODUCTION Cntext Galaxy Clusters are giant csmic labratries harbring thusands f bjects with different rigins and characteristics. It is cmmnly accepted that the evlutin f galaxies within clusters differs frm that in the field, althugh the main prcesses are still prly understd. Key t a full characterizatin f these bjects in such a high density envirnments is a cmprehensive study f a cherent set f clusters, using a wide variety f phtmetric data frm different space bservatries and ptical surveys frm grund based telescpes. Galaxy Cluster SDSS J1044 +4112 2

INTRODUCTION Prblem Hwever, the current limited classificatin techniques d nt scale apprpriately with the vast vlume f data and data frmats available. 3

INTRODUCTION Slutin Apply machine learning techniques (bth supervised and unsupervised learning) t multi-wavelength datasets In rder t efficiently classify cluster galaxies. 4

INTRODUCTION Science Case Objective Cluster membership determinatin: Develp a fast pht-z estimatr able t establish memberships with accuracy cmparable t spectrscpic redshifts. 5

BACKGROUND Machine Learning techniques are starting t be widely used in Astrnmy. We find several wrks in phtmetric redshift estimatin in different dmains: Cperative phtmetric redshift estimatin S. Cavuti+ 2017 Metaphr: a ML based methd fr the prbability density estimatin f phtmetric redshifts S. Cavuti+ 2017 Mapping the galaxy clr-redshift relatin: ptimal phtmetric redshift calibratin strategies fr csmlgy surveys - D. Masters+ 2015 Phtmetric redshifts fr quasars in multi-band surveys M. Brescia+ 2013 6

THE DATA Multi-wavelength phtmetric catalgue f cluster ZwCl0024+1652 prduced by Pérez Martinez et. al. (2016) Cmbining data f 7 different catalgues: XMM-Newtn and Chandra catalgues fr X-ray data; GALEX fr ultravilet data; Mran et. al. (2005) catalgue f ptical/nir infrmatin including HST and grund-based brad-band data (frm CFHT and Hale 200- inch Telescpes); IRAC and MIPS data frm Spitzer; PACS & SPIRE frm Herschel. 19670 surces 1262 clustermember 32 phtmetric pints 2016-08-04 Title f the presentatin Cnfidential - Fr internal use nly 7

THE TOOL WEKA (Waikat Envirnment fr Knwledge Analysis) https://www.cs.waikat.ac.nz/ml/weka/witten_et_al_2016_appendix.pdf WEKA is a data mining framewrk prviding state-f-the-art techniques in machine learning. Weka GUI Explrer and Visualizatin Advantages Easy t use GUI available Highly prtable written in JAVA Wide set f ML techniques including: data preprcessing, classificatin, regressin, clustering, assciatin rules and visualizing capabilities. Open Surce GNU General Public License Drawbacks Specific-dedicated frmat (*.arff) N FITS cmpatible. Nt widely used in Astrnmy > few use-cases available Nt pssible t train mdels frm large data sets frm Weka Explrer GUI althugh wners claim shuld be pssible with the CLI (further wrk fr Big Data shall be explred). 8

SCIENCE CASE 1: PHOTO-Z ESTIMATOR 1 DATA PRE-PROCESSING 2 CLUSTERING FITS2ARFF cnversin Adding attributes (deriving clurs frm phtmetric pints) 10 clurs; Remving redundant/irrelevant attributes Objective: Find clusters in the clur-data f the training set (1262 galaxies with spectrscpic z) ML technique: K-means algrithm with Euclidian distance 3 CLASSIFYING 4 PHOTO-Z DETERMINATION Objective: Classify the test set, using the clusters fund in previus step ML-technique: K-nearest neighburs Objective: Estimate pht-z ML-technique: Cmputing the median pht-z f the surces f the cluster. 9

IN PROGRESS Pre-prcessing: Selecting the mst-significant clurs; Clustering: Imprving k selectin fr k-means (Elbw methd); Manhattan distance vs Euclidian distance; Classifying: Test different ptins f K-NN; 10

NEXT STEPS Keep n tuning clustering and classifying methds t imprve results; Explre ther ML techniques fr the pht-z estimatr (e.g. Self-Organised Maps r Expectatin Maximizatin fr clustering, Randm frest, SVM and Deep Learning fr classificatin); Explre the semi-supervised apprach; Extend methdlgy t different cluster data; Cmpare results and extract cnclusins; Technlgy: Test WEKA CLI perfrmance with larger datasets; Explre WEKA fr Big Data; Check suitability f WEKA vs. ther tls (Pythn SciPy / Keras) 11

QUESTIONS? THANK YOU 12