Outline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract

Similar documents
Doing Right By Massive Data: How To Bring Probability Modeling To The Analysis Of Huge Datasets Without Taking Over The Datacenter

Massive Event Detection. Abstract

Machine Learning Linear Classification. Prof. Matteo Matteucci

Fully Bayesian Analysis of Low-Count Astronomical Images

STA 4273H: Statistical Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Lecture 3: Statistical Decision Theory (Part II)

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Classifying Galaxy Morphology using Machine Learning

ECE521 week 3: 23/26 January 2017

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning for OR & FE

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Qualifying Exam in Machine Learning

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Experimental Design and Data Analysis for Biologists

Lecture #11: Classification & Logistic Regression

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

Bayesian Learning (II)

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Machine Learning Overview

Ages of stellar populations from color-magnitude diagrams. Paul Baines. September 30, 2008

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Introduction to SVM and RVM

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Statistical Tools and Techniques for Solar Astronomers

Machine Learning 4771

Click Prediction and Preference Ranking of RSS Feeds

TIME. SERIES CENTER Harvard University Initiative in Innovative Computing. I. Who. Projects-Challenges. Pavlos Protopapas -CfA-IIC

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

CPSC 340: Machine Learning and Data Mining

Nonparametric Bayesian Methods (Gaussian Processes)

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Feature Engineering, Model Evaluations

STA414/2104 Statistical Methods for Machine Learning II

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Logic and machine learning review. CS 540 Yingyu Liang

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

ECE521 Lecture7. Logistic Regression

Introduction to Signal Detection and Classification. Phani Chavali

CS7267 MACHINE LEARNING

Data Analysis Methods

Introduction. Chapter 1

LECTURE NOTE #3 PROF. ALAN YUILLE

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Machine Learning for Signal Processing Bayes Classification and Regression

Proteomics and Variable Selection

Lecture 9: Classification, LDA

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Lecture 3 Classification, Logistic Regression

Basics of Multivariate Modelling and Data Analysis

Probabilistic Machine Learning. Industrial AI Lab.

Pulse characterization with Wavelet transforms combined with classification using binary arrays

Introduction to Data Science

Dreem Challenge report (team Bussanati)

Discussion of Predictive Density Combinations with Dynamic Learning for Large Data Sets in Economics and Finance

Lecture 9: Classification, LDA

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

A first model of learning

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Bayesian Modeling for Type Ia Supernova Data, Dust, and Distances

Empirical Bayes Moderation of Asymptotically Linear Parameters

Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University

Pattern Recognition and Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Surprise Detection in Multivariate Astronomical Data Kirk Borne George Mason University

Overfitting, Bias / Variance Analysis

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Sparse Linear Models (10/7/13)

Bayesian Decision Theory

PDEEC Machine Learning 2016/17

Parametric Techniques Lecture 3

Be able to define the following terms and answer basic questions about them:

Introduction to Machine Learning Spring 2018 Note 18

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Introduction to Probabilistic Machine Learning

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Predicting New Search-Query Cluster Volume

Kernel methods, kernel SVM and ridge regression

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Advanced Introduction to Machine Learning CMU-10715

L11: Pattern recognition principles

Logistic Regression. Machine Learning Fall 2018

Astroinformatics in the data-driven Astronomy

Announcements. Proposals graded

Classification. Chapter Introduction. 6.2 The Bayes classifier

Transcription:

Abstract The analysis of extremely large, complex datasets is becoming an increasingly important task in the analysis of scientific data. This trend is especially prevalent in astronomy, as large-scale surveys such as SDSS, Pan-STARRS, and the LSST deliver (or promise to deliver) terabytes of data per night. While both the statistics and machine-learning communities have offered approaches to these problems, neither has produced a completely satisfactory approach. Working in the context of event detection for the MACHO LMC data, I will present an approach that combines much of the power of Bayesian probability modeling with the the efficiency and scalability typically associated with more ad-hoc machine learning approaches. This provides both rigorous assessments of uncertainty and improved statistical efficiency on a dataset containing approximately 20 million sources and 40 million individual time series. I will also discuss how this framework could be extended to related problems.

Doing Right By Massive Data: Using Probability Modeling To Advance The Analysis Of Huge Astronomical Datasets Alexander W Blocker 17 April, 2010

Outline 1 Challenges of Massive Data 2 Combining approaches 3 Application: Event Detection for Astronomical Data Overview Proposed method Probability Model Classification Results 4 Conclusion

What is massive data?

What is massive data? In short, it s data where our favorite methods stop working

What is massive data? In short, it s data where our favorite methods stop working We have orders of magnitude more observations than we are used to dealing with, often combined with high dimensionality (e.g. 40 million time series with thousands observations each)

What is massive data? In short, it s data where our favorite methods stop working We have orders of magnitude more observations than we are used to dealing with, often combined with high dimensionality (e.g. 40 million time series with thousands observations each) This scale of data is increasingly common in fields such as astronomy, computational biology, ecology, etc.

What is massive data? In short, it s data where our favorite methods stop working We have orders of magnitude more observations than we are used to dealing with, often combined with high dimensionality (e.g. 40 million time series with thousands observations each) This scale of data is increasingly common in fields such as astronomy, computational biology, ecology, etc. There is an acute need statistical methods that scale to these quantities of data

What is massive data? In short, it s data where our favorite methods stop working We have orders of magnitude more observations than we are used to dealing with, often combined with high dimensionality (e.g. 40 million time series with thousands observations each) This scale of data is increasingly common in fields such as astronomy, computational biology, ecology, etc. There is an acute need statistical methods that scale to these quantities of data However, we are faced with a tradeoff between statistical rigor and computational efficiency

Machine Learning methods: strengths & weaknesses, in broad strokes

Machine Learning methods: strengths & weaknesses, in broad strokes Strengths: Such method are typically very computationally efficient and scale well to large datasets They are relatively generic in their applicability Machine learning methods often just work (quite well) for tasks such as classification and prediction with clean data

Machine Learning methods: strengths & weaknesses, in broad strokes Strengths: Such method are typically very computationally efficient and scale well to large datasets They are relatively generic in their applicability Machine learning methods often just work (quite well) for tasks such as classification and prediction with clean data Weaknesses: ML methods do not usually provide built-in assessments of uncertainties A lack of application-specific modeling often means that data is not used as efficiently as possible Machine learning methods are typically unprincipled from a statistical perspective

Statistical methods / Probability models: strengths & weaknesses, in broader strokes

Statistical methods / Probability models: strengths & weaknesses, in broader strokes Strengths: These methods are built upon on sound theoretical principles We can build complex probability models appropriate to the particular application, incorporating detailed scientific knowledge Statistical methods can provide rigorous, built-in assessments of uncertainties

Statistical methods / Probability models: strengths & weaknesses, in broader strokes Strengths: These methods are built upon on sound theoretical principles We can build complex probability models appropriate to the particular application, incorporating detailed scientific knowledge Statistical methods can provide rigorous, built-in assessments of uncertainties Weaknesses: Computation often scales very poorly with the size of the dataset (O(n 2 ) or worse, especially for complex hierarchical models) While application-specific modeling can be a great strength of this approach, complex structure in the data can require an infeasibly large amount of case-specific modeling Computation for these models often does not parallelize well (for example, MCMC methods are inherently sequential to a

How can we get the best of both worlds?

How can we get the best of both worlds? Principled statistical methods are best for handling messy, complex data that we can effectively model, but scale poorly to massive datasets

How can we get the best of both worlds? Principled statistical methods are best for handling messy, complex data that we can effectively model, but scale poorly to massive datasets Machine learning methods handle clean data well, but choke on issues we often confront (outliers, nonlinear trends, irregular sampling, unusual dependence structures, etc.)

How can we get the best of both worlds? Principled statistical methods are best for handling messy, complex data that we can effectively model, but scale poorly to massive datasets Machine learning methods handle clean data well, but choke on issues we often confront (outliers, nonlinear trends, irregular sampling, unusual dependence structures, etc.) Idea: Inject probability modeling into our analysis in the right places

Overview The Problem We have a massive database of time series (approximately 40 million) from the MACHO project (these cover the LMC for several years)

Overview The Problem We have a massive database of time series (approximately 40 million) from the MACHO project (these cover the LMC for several years) Our goal is to identify and classify time series containing events

Overview The Problem We have a massive database of time series (approximately 40 million) from the MACHO project (these cover the LMC for several years) Our goal is to identify and classify time series containing events How do we define an event?

Overview The Problem We have a massive database of time series (approximately 40 million) from the MACHO project (these cover the LMC for several years) Our goal is to identify and classify time series containing events How do we define an event? We are not interested in isolated outliers. This differentiates our problem from traditional anomaly detection approaches and require more refined approaches.

Overview The Problem We have a massive database of time series (approximately 40 million) from the MACHO project (these cover the LMC for several years) Our goal is to identify and classify time series containing events How do we define an event? We are not interested in isolated outliers. This differentiates our problem from traditional anomaly detection approaches and require more refined approaches. We are looking for groups of observations that differ significantly from those nearby (ie, bumps and spikes )

Overview The Problem We have a massive database of time series (approximately 40 million) from the MACHO project (these cover the LMC for several years) Our goal is to identify and classify time series containing events How do we define an event? We are not interested in isolated outliers. This differentiates our problem from traditional anomaly detection approaches and require more refined approaches. We are looking for groups of observations that differ significantly from those nearby (ie, bumps and spikes ) We are also attempting to distinguish periodic and quasi-periodic time series from isolated events, as they have very different scientific interpretations

Overview Exemplar time series from the MACHO project: A null time series:

Overview Exemplar time series from the MACHO project: An isolated event (microlensing):

Overview Exemplar time series from the MACHO project: A quasi-periodic time series (LPV):

Overview Exemplar time series from the MACHO project: A variable time series (quasar):

Overview Exemplar time series from the MACHO project: A variable time series (blue star):

Overview Notable properties of this data

Overview Notable properties of this data Fat-tailed measurement errors

Overview Notable properties of this data Fat-tailed measurement errors These are common in astronomical data, especially from ground-based telescopes (atmospheric fluctuations are not kind to statisticians)

Overview Notable properties of this data Fat-tailed measurement errors These are common in astronomical data, especially from ground-based telescopes (atmospheric fluctuations are not kind to statisticians) Thus, we need more sophisticated models for the data than standard Gaussian approaches

Overview Notable properties of this data Fat-tailed measurement errors These are common in astronomical data, especially from ground-based telescopes (atmospheric fluctuations are not kind to statisticians) Thus, we need more sophisticated models for the data than standard Gaussian approaches Quasi-periodic and other variable sources

Overview Notable properties of this data Fat-tailed measurement errors These are common in astronomical data, especially from ground-based telescopes (atmospheric fluctuations are not kind to statisticians) Thus, we need more sophisticated models for the data than standard Gaussian approaches Quasi-periodic and other variable sources These changes the problem from binary classification (null vs. event) to k-class

Overview Notable properties of this data Fat-tailed measurement errors These are common in astronomical data, especially from ground-based telescopes (atmospheric fluctuations are not kind to statisticians) Thus, we need more sophisticated models for the data than standard Gaussian approaches Quasi-periodic and other variable sources These changes the problem from binary classification (null vs. event) to k-class So, we need more complex test statistics and classification techniques

Overview Notable properties of this data Fat-tailed measurement errors These are common in astronomical data, especially from ground-based telescopes (atmospheric fluctuations are not kind to statisticians) Thus, we need more sophisticated models for the data than standard Gaussian approaches Quasi-periodic and other variable sources These changes the problem from binary classification (null vs. event) to k-class So, we need more complex test statistics and classification techniques Non-linear, low-frequency trends confound our analysis further and make less sophisticated approaches (ie those without careful detrending) far less effective

Overview Notable properties of this data Fat-tailed measurement errors These are common in astronomical data, especially from ground-based telescopes (atmospheric fluctuations are not kind to statisticians) Thus, we need more sophisticated models for the data than standard Gaussian approaches Quasi-periodic and other variable sources These changes the problem from binary classification (null vs. event) to k-class So, we need more complex test statistics and classification techniques Non-linear, low-frequency trends confound our analysis further and make less sophisticated approaches (ie those without careful detrending) far less effective Irregular sampling is the norm in this data. If handled incorrectly, this can create artificial events

Overview Notable properties of this data Fat-tailed measurement errors These are common in astronomical data, especially from ground-based telescopes (atmospheric fluctuations are not kind to statisticians) Thus, we need more sophisticated models for the data than standard Gaussian approaches Quasi-periodic and other variable sources These changes the problem from binary classification (null vs. event) to k-class So, we need more complex test statistics and classification techniques Non-linear, low-frequency trends confound our analysis further and make less sophisticated approaches (ie those without careful detrending) far less effective Irregular sampling is the norm in this data. If handled incorrectly, this can create artificial events Oh my!

Overview Previous approaches to event detection

Overview Previous approaches to event detection Scan statistics are a common approach (Liang et al, 2004; Preston & Protopapas, 2009)

Overview Previous approaches to event detection Scan statistics are a common approach (Liang et al, 2004; Preston & Protopapas, 2009) However, they often discard data by working with ranks and account for neither trends nor irregular sampling

Overview Previous approaches to event detection Scan statistics are a common approach (Liang et al, 2004; Preston & Protopapas, 2009) However, they often discard data by working with ranks and account for neither trends nor irregular sampling Equivalent width methods (a scan statistic based upon local deviations) are common in astrophysics

Overview Previous approaches to event detection Scan statistics are a common approach (Liang et al, 2004; Preston & Protopapas, 2009) However, they often discard data by working with ranks and account for neither trends nor irregular sampling Equivalent width methods (a scan statistic based upon local deviations) are common in astrophysics However, these rely upon Gaussian assumptions and crude multiple testing corrections

Overview Previous approaches to event detection Scan statistics are a common approach (Liang et al, 2004; Preston & Protopapas, 2009) However, they often discard data by working with ranks and account for neither trends nor irregular sampling Equivalent width methods (a scan statistic based upon local deviations) are common in astrophysics However, these rely upon Gaussian assumptions and crude multiple testing corrections Numerous other approaches have been proposed in the literature, but virtually all rely upon Gaussian distributional assumptions, stationarity, and (usually) regular sampling

Overview Our approach

Overview Our approach We use a Bayesian probability model for both initial detection and to reduce the dimensionality of our data (by retaining posterior summaries)

Overview Our approach We use a Bayesian probability model for both initial detection and to reduce the dimensionality of our data (by retaining posterior summaries) Using these posterior summaries as features, apply a ML classification technique to differentiate between events, variables, and null time series

Overview Our approach We use a Bayesian probability model for both initial detection and to reduce the dimensionality of our data (by retaining posterior summaries) Using these posterior summaries as features, apply a ML classification technique to differentiate between events, variables, and null time series Symbolically, let V be the set of all time series with variation at an interesting scale (ie, the range of lengths for events), and let E be the set of events

Overview Our approach We use a Bayesian probability model for both initial detection and to reduce the dimensionality of our data (by retaining posterior summaries) Using these posterior summaries as features, apply a ML classification technique to differentiate between events, variables, and null time series Symbolically, let V be the set of all time series with variation at an interesting scale (ie, the range of lengths for events), and let E be the set of events For a given time series Y i, we are interested in P(Y i E)

Overview Our approach We use a Bayesian probability model for both initial detection and to reduce the dimensionality of our data (by retaining posterior summaries) Using these posterior summaries as features, apply a ML classification technique to differentiate between events, variables, and null time series Symbolically, let V be the set of all time series with variation at an interesting scale (ie, the range of lengths for events), and let E be the set of events For a given time series Y i, we are interested in P(Y i E) We will decompose this probability (conceptually) as P(Y i E) = P(Y i V ) P(Y i E Y i V ) using the above two steps

Proposed method Probability model

Proposed method Probability model We assume a linear model for our observations: Y = X l β l + X m β m + u

Proposed method Probability model We assume a linear model for our observations: Y = X l β l + X m β m + u We assume that our residuals u t are distributed as iid t ν (0, σ 2 ) random variables to account for extreme residuals (we set ν = 3).

Proposed method Probability model We assume a linear model for our observations: Y = X l β l + X m β m + u We assume that our residuals u t are distributed as iid t ν (0, σ 2 ) random variables to account for extreme residuals (we set ν = 3). X l contains the low-frequency components of a wavelet basis, and X m contains the mid-frequency components

Proposed method Probability model We assume a linear model for our observations: Y = X l β l + X m β m + u We assume that our residuals u t are distributed as iid t ν (0, σ 2 ) random variables to account for extreme residuals (we set ν = 3). X l contains the low-frequency components of a wavelet basis, and X m contains the mid-frequency components We use a Symmlet 4 (aka Least Asymmetric Daubechies 4) wavelet basis; it s profile matches the events of interest quite well

Proposed method Probability model We assume a linear model for our observations: Y = X l β l + X m β m + u We assume that our residuals u t are distributed as iid t ν (0, σ 2 ) random variables to account for extreme residuals (we set ν = 3). X l contains the low-frequency components of a wavelet basis, and X m contains the mid-frequency components We use a Symmlet 4 (aka Least Asymmetric Daubechies 4) wavelet basis; it s profile matches the events of interest quite well For a basis of length 2048, we build X l to contain the first 8 coefficients; X m contains the next 120 Idea: X l will model structure due to trends, and X m will model structure at the scales of interest for events

Proposed method Probability model Y = X l β l + X m β m + u

Proposed method Probability model Y = X l β l + X m β m + u We explicitly account for irregular sampling in our time series by stretching our basis to total observation time of our data and

Proposed method Probability model Y = X l β l + X m β m + u We explicitly account for irregular sampling in our time series by stretching our basis to total observation time of our data and We place independent Gaussian priors on all coefficients except for the intercept to reflect prior knowledge and regularize estimates in undersampled regions

Proposed method Probability model Y = X l β l + X m β m + u We explicitly account for irregular sampling in our time series by stretching our basis to total observation time of our data and We place independent Gaussian priors on all coefficients except for the intercept to reflect prior knowledge and regularize estimates in undersampled regions We use the optimal data augmentation scheme of Meng & Van Dyk (1997) with the EM algorithm to fit our model (average time for a full estimation procedure is 0.4 seconds including file I/O, using the speedglm package in R)

Proposed method Probability model Y = X l β l + X m β m + u We explicitly account for irregular sampling in our time series by stretching our basis to total observation time of our data and We place independent Gaussian priors on all coefficients except for the intercept to reflect prior knowledge and regularize estimates in undersampled regions We use the optimal data augmentation scheme of Meng & Van Dyk (1997) with the EM algorithm to fit our model (average time for a full estimation procedure is 0.4 seconds including file I/O, using the speedglm package in R) We use a likelihood ratio statistic to test for the presence of variation at the scales of interest (testing β m = 0). We use a modified Benjamini-Hochberg FDR procedure to set the

Proposed method Examples of model fit

Proposed method Examples of model fit The idea is that, if there is an event at the scale of interest, there will be a large discrepancy between the residuals using X m and X l :

Proposed method Examples of model fit The idea is that, if there is an event at the scale of interest, there will be a large discrepancy between the residuals using X m and X l :

Proposed method Example of model fit

Proposed method Example of model fit For null time series, the discrepency will be small:

Proposed method Example of model fit For null time series, the discrepency will be small:

Proposed method Example of model fit

Proposed method Example of model fit And for quasi-periodic time series, the discrepency will be huge:

Proposed method Example of model fit And for quasi-periodic time series, the discrepency will be huge:

Proposed method Results of likelihood ratio test via FDR Awaiting completion of computations

Proposed method Distribution of likelihood ratio statistic

Proposed method Distribution of likelihood ratio statistic To assess how well this statistic performs, we simulated 50, 000 events from a physics-based model and 50, 000 null time series

Proposed method Distribution of likelihood ratio statistic

Proposed method Distribution of likelihood ratio statistic We then added approximately 60,000 time series from known variable stars It should be noted that there is an extremely long right tail on the distribution of log-likelihood ratios for variable sources (extending out to approximately 8, 000) that is not shown here; it is why additional steps are needed

Proposed method A sidenote: Why not use a Bayes factor?

Proposed method A sidenote: Why not use a Bayes factor? Given our use of Bayesian models, a Bayes factor would appear to be a natural approach for the given testing problem

Proposed method A sidenote: Why not use a Bayes factor? Given our use of Bayesian models, a Bayes factor would appear to be a natural approach for the given testing problem Unfortunately, these do not work well with priors of convenience, such as our Gaussian prior on the wavelet coefficients

Proposed method A sidenote: Why not use a Bayes factor? Given our use of Bayesian models, a Bayes factor would appear to be a natural approach for the given testing problem Unfortunately, these do not work well with priors of convenience, such as our Gaussian prior on the wavelet coefficients Because of these issues, the Bayes factor was extremely conservative in this problem for almost any reasonable prior

Proposed method Distribution of Bayes factor

Proposed method Classification

Proposed method Classification We use the estimated wavelet coefficients ˆβ m (normalized by ˆτ) as features for classification

Proposed method Classification We use the estimated wavelet coefficients ˆβ m (normalized by ˆτ) as features for classification These provide a rich, clean representation of each time series, following detrending and denoising (from our MAP estimation)

Proposed method Classification We use the estimated wavelet coefficients ˆβ m (normalized by ˆτ) as features for classification These provide a rich, clean representation of each time series, following detrending and denoising (from our MAP estimation) To simplify our classification and make our features invariant to the location of variation in our time series, we use as features the sorted absolute values of our normalized wavelet coefficients within each resolution level.

Proposed method Classification

Proposed method Classification We tested a wide variety of classifiers on our training data, including knn, SVM, LDA, QDA, and others. In the end, regularized logisitic regression appeared to be the best technique.

Proposed method Classification We tested a wide variety of classifiers on our training data, including knn, SVM, LDA, QDA, and others. In the end, regularized logisitic regression appeared to be the best technique. We obtained excellent performance (AUC = 0.98) on previous training data for the seperation of ROC for logistic regression classifier (10 fold CV) Average true positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Average false positive rate

Proposed method Classification For the multiclass problem (null vs. event vs. variable), we are testing three approaches: partially ordered logistic regression, multinomial regression, and SVM Results are currently awaiting further computation

Results Results Computation has yet to complete, but the empirical distribution of our likelihood ratio statistics (with the 10% FDR threshold) is given below:

Putting everything in its place: a mental meta-algorithm

Putting everything in its place: a mental meta-algorithm Understand what your full (computationally infeasible) statistical model is; this should guides the rest of your decision

Putting everything in its place: a mental meta-algorithm Understand what your full (computationally infeasible) statistical model is; this should guides the rest of your decision Preprocess to remove the chaff, when possible

Putting everything in its place: a mental meta-algorithm Understand what your full (computationally infeasible) statistical model is; this should guides the rest of your decision Preprocess to remove the chaff, when possible Be careful! Any prescreening must be extremely conservative to avoid significantly biasing your results

Putting everything in its place: a mental meta-algorithm Understand what your full (computationally infeasible) statistical model is; this should guides the rest of your decision Preprocess to remove the chaff, when possible Be careful! Any prescreening must be extremely conservative to avoid significantly biasing your results Use approximations for the critical parts of your models (e.g. empirical Bayes as opposed to full hierarchical modeling) to maintain computational feasibility

Putting everything in its place: a mental meta-algorithm Understand what your full (computationally infeasible) statistical model is; this should guides the rest of your decision Preprocess to remove the chaff, when possible Be careful! Any prescreening must be extremely conservative to avoid significantly biasing your results Use approximations for the critical parts of your models (e.g. empirical Bayes as opposed to full hierarchical modeling) to maintain computational feasibility Hyperparameters can be set based on scientific knowledge or for mild regularization if each observation is sufficiently rich or priors are sufficiently informative

Putting everything in its place: a mental meta-algorithm Understand what your full (computationally infeasible) statistical model is; this should guides the rest of your decision Preprocess to remove the chaff, when possible Be careful! Any prescreening must be extremely conservative to avoid significantly biasing your results Use approximations for the critical parts of your models (e.g. empirical Bayes as opposed to full hierarchical modeling) to maintain computational feasibility Hyperparameters can be set based on scientific knowledge or for mild regularization if each observation is sufficiently rich or priors are sufficiently informative Otherwise, a random subsample of the data can be used to obtain reasonable estimates

Putting everything in its place: a mental meta-algorithm

Putting everything in its place: a mental meta-algorithm Using estimates from your probability model as inputs, apply machine learning methods as needed (e.g. for large scale classification or clustering). This maintains computational efficiency and provides these methods with the cleaner input they need to perform well

Putting everything in its place: a mental meta-algorithm Using estimates from your probability model as inputs, apply machine learning methods as needed (e.g. for large scale classification or clustering). This maintains computational efficiency and provides these methods with the cleaner input they need to perform well Use scale to your advantage when evaluating uncertainty

Putting everything in its place: a mental meta-algorithm Using estimates from your probability model as inputs, apply machine learning methods as needed (e.g. for large scale classification or clustering). This maintains computational efficiency and provides these methods with the cleaner input they need to perform well Use scale to your advantage when evaluating uncertainty With prescreening, use known nulls

Putting everything in its place: a mental meta-algorithm Using estimates from your probability model as inputs, apply machine learning methods as needed (e.g. for large scale classification or clustering). This maintains computational efficiency and provides these methods with the cleaner input they need to perform well Use scale to your advantage when evaluating uncertainty With prescreening, use known nulls Without prescreening, use pseudoreplications or simulated data

Summary

Summary Massive data presents a new set of challenges to statisticians that many of our standard tools are not well-suited to address

Summary Massive data presents a new set of challenges to statisticians that many of our standard tools are not well-suited to address Machine learning has some valuable ideas and methods to offer, but we should not discard the power of probability modeling

Summary Massive data presents a new set of challenges to statisticians that many of our standard tools are not well-suited to address Machine learning has some valuable ideas and methods to offer, but we should not discard the power of probability modeling Conversely, reasonably sophisticated probability models can be incorporated into the analysis of massive datasets without destroying computational efficiency if appropriate approximations are used

Summary Massive data presents a new set of challenges to statisticians that many of our standard tools are not well-suited to address Machine learning has some valuable ideas and methods to offer, but we should not discard the power of probability modeling Conversely, reasonably sophisticated probability models can be incorporated into the analysis of massive datasets without destroying computational efficiency if appropriate approximations are used It is tremendously important to put each tool in its proper place for these types of analyses

Summary Massive data presents a new set of challenges to statisticians that many of our standard tools are not well-suited to address Machine learning has some valuable ideas and methods to offer, but we should not discard the power of probability modeling Conversely, reasonably sophisticated probability models can be incorporated into the analysis of massive datasets without destroying computational efficiency if appropriate approximations are used It is tremendously important to put each tool in its proper place for these types of analyses Our work on event detection for astronomical data shows the power of this approach by combining both rigorous probability models and standard machine learning approaches

Massive data presents a new set of challenges to statisticians that many of our standard tools are not well-suited to address Machine learning has some valuable ideas and methods to offer, but we should not discard the power of probability modeling Conversely, reasonably sophisticated probability models can be incorporated into the analysis of massive datasets without destroying computational efficiency if appropriate approximations are used It is tremendously important to put each tool in its proper place for these types of analyses Our work on event detection for astronomical data shows the power of this approach by combining both rigorous probability models and standard machine learning approaches There is a vast amount of future research to be done in this areas Summary

Acknowledgements Many thanks to both Pavlos Protopapas and Xiao-Li Meng for their data and guidance on this project I would also like to thank Edo Airoldi for our discussions on this work and Dae-Won Kim for his incredible work in setting up the MACHO data