Machine Learning Methods for Radio Host Cross-Identification with Crowdsourced Labels

Similar documents
The Radio Galaxy Zoo Data Release 1: classifications for 75,589 sources

arxiv: v1 [astro-ph.ga] 8 Mar 2016

Machine Learning and Deep Learning! Vincent Lepetit!

Machine Learning in Large Radio Astronomy Surveys

Machine Learning Basics

ECE521 Lectures 9 Fully Connected Neural Networks

Logistic Regression. COMP 527 Danushka Bollegala

THE LAST SURVEY OF THE OLD WSRT: TOOLS AND RESULTS FOR THE FUTURE HI ABSORPTION SURVEYS

Observing Dark Worlds (Final Report)

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Machine Learning Midterm Exam

Introduction. Chapter 1

Generalization, Overfitting, and Model Selection

Logic and machine learning review. CS 540 Yingyu Liang

Radio Galaxy Zoo: host galaxies and radio morphologies derived from visual inspection

Deep Learning Architecture for Univariate Time Series Forecasting

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Neural networks and optimization

Essence of Machine Learning (and Deep Learning) Hoa M. Le Data Science Lab, HUST hoamle.github.io

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Pattern Recognition and Machine Learning

Data Mining und Maschinelles Lernen

Lecture 4 Discriminant Analysis, k-nearest Neighbors

arxiv: v1 [astro-ph.im] 20 Jan 2017

Holdout and Cross-Validation Methods Overfitting Avoidance

BAYESIAN DECISION THEORY

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Black Holes and Active Galactic Nuclei

Morphological Classification of Galaxies based on Computer Vision features using CBR and Rule Based Systems

CS534 Machine Learning - Spring Final Exam

Machine Astronomers to Discover the Unexpected in Astronomical Surveys. Ray Norris, Western Sydney University & CSIRO Astronomy & Space Science,

Statistical Machine Learning from Data

A Statistical Framework for Analysing Big Data Global Conference on Big Data for Official Statistics October, 2015 by S Tam, Chief

Statistical Learning Reading Assignments

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Nearest Neighbors Methods for Support Vector Machines

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Linear and Logistic Regression. Dr. Xiaowei Huang

Day 3: Classification, logistic regression

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Learning with multiple models. Boosting.

Artificial Neural Networks

Evaluation of Two Level Classifier for Predicting Compressor Failures in Heavy Duty Vehicles. Slawomir Nowaczyk SAIS 2017 workshop, May

A Deep Interpretation of Classifier Chains

Time Series Classification

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning Lecture 12

Nonlinear Classification

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Detecting the cold neutral gas in young radio galaxies. James Allison Triggering Mechanisms for AGN

Star-formation Across Cosmic Time: Initial Results from the e-merge Study of the μjy Radio Source Population. SPARCs VII The Precursors Awaken

GALAXIES. I. Morphologies and classification 2. Successes of Hubble scheme 3. Problems with Hubble scheme 4. Galaxies in other wavelengths

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Logistic Regression. Machine Learning Fall 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Learning to Classify Galaxy Shapes using the EM Algorithm

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Machine Learning to Automatically Detect Human Development from Satellite Imagery

Adaptive Crowdsourcing via EM with Prior

CSC321 Lecture 5: Multilayer Perceptrons

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Online Learning and Sequential Decision Making

Day 5: Generative models, structured classification

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning Practice Page 2 of 2 10/28/13

Logistic Regression & Neural Networks

Lecture 17: Neural Networks and Deep Learning

CSE 546 Final Exam, Autumn 2013

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Learning Time-Series Shapelets

Machine Learning Lecture 5

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Supervised Learning. George Konidaris

Support Vector Machine. Industrial AI Lab.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Some Applications of Machine Learning to Astronomy. Eduardo Bezerra 20/fev/2018

Information Extraction from Text

Machine Learning (CS 567) Lecture 2

Generalization and Overfitting

CS7267 MACHINE LEARNING

Jakub Hajic Artificial Intelligence Seminar I

Neural Networks: Backpropagation

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Transcription:

Machine Learning Methods for Radio Host Cross-Identification with Crowdsourced Labels Matthew Alger (ANU), Julie Banfield (ANU/WSU), Cheng Soon Ong (Data61/ANU), Ivy Wong (ICRAR/UWA) Slides: http://www.mso.anu.edu.au/~alger/sparcs-vii

Host Galaxy Cross-Identification Problem: match radio emission to its host galaxy at other wavelengths Hard: Radio emission can be extended at scales of tens of arcminutes Often no clear relationship between radio emission and host galaxy FIRSTJ023838.0+023450 at 1.4 GHz. Image: FIRST FIRSTJ023838.0+023450 ininfrared. Image: WISE

Host Galaxy Cross-Identification Current approaches: Manual Crowdsourcing Nearest neighbours Bayesian methods Likelihood ratio Bayesian model fit to a radio triple. Image: ATLAS (radio), SWIRE (infrared), Fan et al. 2015

Host Galaxy Cross-Identification Our approach: Casts cross-identification as object localisation so we can use algorithms from computer vision Allows training cross-identification methods using existing cross-identification datasets

Radio Galaxy Zoo Crowdsourced, citizen science project Volunteers cross-identify radio emission from FIRST and ATLAS-CDFS with infrared host galaxies from WISE and SWIRE-CDFS See Ivy s talk later this session

ATLAS-CDFS ~2000 sources in ATLAS DR3 Radio Galaxy Zoo source identifications and SWIRE host cross-identifications ~500 sources cross-identified with SWIRE in ATLAS DR1 ATLAS observations of CDFS. Image: ATLAS, Franzen et al. 2015

Supervised Machine Learning Encompasses classification, regression, and other function approximation tasks Promising methods for handling very large datasets Training requires a large set of labelled data Application requires converting problem into a function approximation problem Binary classification best understood

Machine Learning for Cross-Identification Allows use of Radio Galaxy Zoo data for training Need to convert cross-identification into a machine learning task First pass from computer vision: Sliding window approach Given an image of radio emission, classify each square patch based on whether the AGN is located there Not terribly efficient Binary classification! Scanning to find the host galaxy. Image: FIRST

Machine Learning for Cross-Identification Second attempt: Assume host galaxies visible in infrared Given an image of radio emission, classify each candidate host galaxy in that image based on whether it is the host galaxy Much more efficient! Candidate host galaxies. Image: FIRST/WISE

Cross-Identification with Binary Classification ( (,, ) ) Representation of galaxy 1 0 Whether galaxy has an AGN

Cross-Identification with Binary Classification

Cross-Identification with Binary Classification

Experimental Method Three classifiers: Logistic regression Random forests Convolutional neural networks Labelled training data: Inputs are square image cutouts centred on candidate host galaxies Expert labels from ATLAS DR1 Crowdsourced labels from Radio Galaxy Zoo Split CDFS into resolved/compact sources Train on 75% of CDFS Test by comparing outputs to ATLAS DR1 on remaining 25%

Classification Accuracy on SWIRE-CDFS Expert RGZ Expert RGZ Expert RGZ

Cross-Identification Accuracy on SWIRE-CDFS Expert RGZ Expert RGZ

Key Assumptions Assumptions on search radius: Assumptions on candidate host galaxies: Host galaxies visible in infrared Assumptions on sliding window radius: One host galaxy in radius All radio emission from a source is contained in radius Information in sliding window sufficient to determine host galaxy We defer these problems for now

Failure Case Multiple Hosts Assumption: One host galaxy in search radius Search radius = 1 (as in Radio Galaxy Zoo) Assumption often broken

Failure Case Nearby Candidate Hosts Hard to distinguish between nearby candidate hosts A prior could help resolve this issue

Failure Case Misidentified Lobe Abundance of compact objects in training data bias the classifier toward bright radio lobes Larger datasets with more varied radio doubles would likely resolve this issue Larger window sizes can help (but too large provides the classifier with too many inputs)

Failure Case Search Radius Search radius of 1 too small to find all host galaxies...but making the search radius too large worsens the problem of multiple hosts

Future Work More data for convolutional neural network training Radio Galaxy Zoo-FIRST? Simulations? Dynamically choose window sizes and search radii Combine computer vision methods with radio source identification methods

Summary We developed a machine learning approach for host galaxy cross-identification We trained the method on both expert cross-identifications from ATLAS DR1 and volunteer cross-identifications from Radio Galaxy Zoo Crowdsourcing provides a promising source of supervised machine learning training data Better model selection and incorporating source identification would improve accuracy