The changing landscape of astroinformatics

Size: px
Start display at page:

Download "The changing landscape of astroinformatics"

Transcription

1 The changing landscape of astroinformatics astrostatistics and Eric Feigelson Center for Astrosta2s2cs Penn State University IAU Symposium #325 Astroinforma6cs Sorrento IT

2 An aphorism The scien)st collects data in order to understand natural phenomena The sta)s)cian helps the scien)st acquire understanding from the data The computer scien)st helps the scien)st perform the needed calcula)ons (*) The individual proficient at all these tasks is a data scien)st (*) needed only for Big Data or Big Theory

3 Outline 1 Astronomers & stat/comp methodology: o Role of statistics in astronomy o Education gap o Resurgence of astrostatistics 2 The R statistical software environment

4 Statistical questions for a scientist How reliable are my results? (P=0.05 vs. P=0.003) Have I used the most effective procedures for achieving my results? (KS vs. AD test) Do my results depend on uncertain models? (power law?) Do my results depend on my chosen analysis path? A suggested path: Ø Nonparametric data analysis (including hypothesis tests, local regression, Fourier or wavelet transforms, etc) Ø Clustering & classification (if dataset is inhomogeneous) Ø Parametric analysis: maximum likelihood with model selection Ø Bayesian analysis: weight likelihood with prior knowledge Ø Discuss: What have I learned from each step along the path?

5 Recommended steps in the statistical analysis of scientific data The application of statistics can reliably quantify information embedded in scientific data and help adjudicate the relevance of theoretical models. But this is not a straightforward, mechanical enterprise. It requires: Ø exploration of the data Ø careful statement of the scientific problem Ø model formulation in mathematical form Ø choice of statistical method(s) Ø calculation of statistical quantities Ø judicious scientific evaluation of the results Astronomers often do not adequately pursue each step

6 Astronomy involves many branches of modern statistics Cosmology Statistics Galaxy clustering Spatial point processes, clustering Galaxy morphology Galaxy luminosity fn Power law relationships Regression, mixture models Gamma distribution Pareto distribution Weak lensing morphology Geostatistics, density estimation Strong lensing morphology Shape statistics Strong lensing timing Time series with lag Faint source detection False Discovery Rate SN Ia & QSO lightcurves Nonstationary time series CMB spatial analysis Markov fields, ICA, etc ΛCDM parameters Bayesian inference & model selection Comparing data & simulation Uncertainty quantification

7 The education gap Statistics Astronomers usually take 0-1 courses in statistical methods during their university/graduate education, in contrast to extensive training in physics and associated mathematics. Students are thirsty for statistics training. Computation Survey of ~1100 astronomers worldwide (Momcheva & Tollerud 2015) shows that 90% write software but only 8% receive substantial training in software development and 43% receive no training. Preferred languages are Python (67%), IDL (44%), C/C++ (37%), Fortran (28%),, Matlab (8%), R (6%, 3% in USA), julia (1%). No correlation with age. Considering the wide-spread use of R in other scientific fields, its popularity among astronomers is strikingly low. The result of weak training is a widespread ignorance of the myriad advances in modern statistical & computational methodology.

8 Consequently The state of astrostatistics today is often poor! Many astronomical studies are confined to a narrow suite of familiar statistical methods: Fourier transform for temporal analysis (Fourier 1807) Least squares regression (Legendre 1805, Pearson 1901) Kolmogorov-Smirnov goodness-of-fit test (Kolmogorov, 1933) Principal components analysis for tables (Hotelling 1936) Even traditional methods are sometimes misused see Beware the Kolmogorov-Smirnov test! page on ASAIP

9 Under-utilized methodology: modeling (MLE, EM Algorithm, BIC, bootstrap) multivariate classification (LDA, SVM, CART, RFs) time series (autoregressive models, state space models) spatial point processes (Ripley s K, kriging) nondetections (survival analysis) image analysis (computer vision methods, False Detection Rate) statistical computing (R) Advertisement Modern Statistical Methods for Astronomy with R Applications E. D. Feigelson & G. J. Babu, Cambridge Univ Press, 2012! "#$$%&!'()'!*+,-.!/01&2!34&!! 5%67!/67&4$489!:!;4684<4=9!544>!!

10 Recent resurgence in astrostatistics Papers in astronomical literature doubled to ~500/yr in past decade ( Methods: statistical papers in NASA-Smithsonian Astrophysics Data System) Short training courses (Penn State, India, Brazil, Greece, China, Italy, France, Germany, Spain, CfA, Caltech, Sweden, IAU/AAS/CASCA/ meetings) Cross-disciplinary research collaborations (Harvard/ICHASC, Carnegie-Mellon, Penn State, NASA-Ames/Stanford, CEA-Saclay/Stanford, Cornell, UC-Berkeley, Michigan, Imperial College London, Swinburne, ) Cross-disciplinary conferences (Statistical Challenges in Modern Astronomy, Astronomical Data Analysis , PhysStat, SAMSI 2006/2012, Astroinformatics ) Scholarly society working groups: Int l Stat Institute s Int l Astrostatistical Assn, Int l Astro Union Commission, Amer Astro Soc Working Group, Amer Stat Assn Interest Group, LSST Science Collaboration, IEEE Astro Data Miner Task Force

11 Textbooks New resources in astrostatistics Bayesian Logical Data Analysis for the Physical Sciences: A Compara:ve Approach with Mathema:ca Support Gregory, 2005 Prac:cal Sta:s:cs for Astronomers, Wall & Jenkins, 2 nd ed, 2012 Modern Sta:s:cal Methods for Astronomy with R Applica:on, Feigelson & Babu, 2012 Sta:s:cs, Data Mining, and Machine Learning in Astronomy: A Prac:cal Python Guide for the Analysis of Survey Data,, Ivecic, Connolly, VanderPlas & Gray, 2014 hop://asaip.psu.edu Recent papers, meetings, jobs, blogs, courses, forums,

12 Statistical analysis needs software A brief history of statistical computing 1960s c2000: Sta6s6cal analysis developed by academic sta6s6cians, but implemented by private companies (SAS, SPSS, etc) 1980s: John Chambers (ATT, USA)) develops S sta6s6cal package in the new C language 1990s: Ross Ihaka & Robert Gentleman (Univ Auckland NZ) mimic S in an open source system, R. R Core Development Team expands, GNU GPL release. 2000s: Comprehensive R Analysis Network (CRAN) for user- provided specialized packages grows exponen6ally. Important packages incorporated into base- R.

13 Growth of CRAN contributed packages Oct : 9373 packages (~5/day) ~200,000 functions See The Popularity of Data Analysis Software, R. A. Muenchen,

14 The R statistical computing environment R integrates data manipula6on, graphics and extensive sta6s6cal analysis. Uniform documenta6on and coding standards. Fully programmable C- like language, similar to IDL. Specializes in vector/matrix inputs. Same compiler speed as IDL, Python, Matlab. Easy binary download for Windows, Mac or linux. On- the- fly installa6on of CRAN packages. Quick 2- way communica6on with C, Fortran, Python. Emulator of Matlab user- provided add- on CRAN packages including ~20 for astronomy (FITSio, astrolibr, )

15 R s growing importance in data science Kdnuggests 2014 poll data analytics/mining SPSS R Posts on software forums 2013 Job trends from Indeed.com

16 R is now the world s premier statistical computing package developed by statisticians with community packages from all of the sciences extensively used in genomics, biostatistics, econometrics, ecology, social sci, etc Data scientists recommend both Python and R Usage of both is growing rapidly (

17 Some functionalities of base R arithme6c & linear algebra bootstrap resampling empirical distribu6on tests exploratory data analysis generalized linear modeling graphics robust sta6s6cs linear programming local and ridge regression max likelihood es6ma6on multivariate analysis multivariate clustering neural networks smoothing spatial point processes statistical distributions statistical tests survival analysis time series analysis

18 Selected methods in Comprehensive R Archive Network (CRAN) Bayesian computation & MCMC, classification & regression trees, genetic algorithms, geostatistical modeling, hidden Markov models, irregular time series, kernel-based machine learning, least-angle & lasso regression, likelihood ratios, map projections, mixture models & modelbased clustering, nonlinear least squares, multidimensional analysis, multimodality test, multivariate time series, multivariate outlier detection, neural networks, non-linear time series analysis, nonparametric multiple comparisons, omnibus tests for normality, orientation data, parallel coordinates plots, partial least squares, periodic autoregression analysis, principal curve fits, projection pursuit, quantile regression, random fields, Random Forest classification, ridge regression, robust regression, Self-Organizing Maps, shape analysis, space-time ecological analysis, spatial analyisis & kriging, spline regressions, tessellations, three-dimensional visualization, wavelet toolbox

19 spatstat: A large CRAN package useful for astronomy ~1500 sta6s6cal func6ons ~800 book describing methods w/ recipes Spa2al Point Pa?erns: Methodology and Applica2ons with R Adrian Baddeley et al D/3D, 1M pts for ploing, 100K exploratory, 10K modeling (on laptop) Smoothing loca6ons and mark variables Global spa6al autocorrela6on func6ons (PCF, K, F, G, J, L*) Tests for inhomogeneity & anisotropy Clustering & hypothesis tests for mul6type data MLE and Bayesian modeling, model residuals & valida6on Poisson, Gibbs, user- supplied models Tessella6ons

20 CRAN Task Views ( CRAN Task Views provide brief overviews of CRAN packages by topic & func6onality. Maintained be expert volunteers. Bayesian ~110 (incl. rbugs, rjags, rstan, MCMCpack, LaplacesDemon, boa) Chem/Phys ~75 packages (incl. 20 for astronomy) Cluster/Mixture ~100 packages Graphics ~40 packages HighPerfComp ~75 packages (parallelia6on, GPUs, etc) Machine Learning ~70 packages Medical imaging ~20 packages Robust ~50 packages Spa6al ~135 packages (incl. spatstat) Survival ~200 packages TimeSeries ~170 packages

21 High Performance Compu2ng with R While originally designed for an individual exploring small datasets, R can be pipelined and can treat megadatsets. Several dozen CRAN packages are devoted to high- performance compu6ng. o foreach func6on released in R by Revolu6on Analy6cs. for loop distributed among local cores o Packages for parallel compu2ng communica2on treat Message Passing Interface, networkspaces, SOCK, PVM, Open MP, o Process management on mul2core computers & clusters: o batch: assists mul6ple parameter simula6ons, permuta6ons, etc o snowfall: wrapper around snow incl. error checks for MPI & PVM o Distributed compu6ng of mul2ple MCMC chains: bugsparallel

22 o Several packages for cloud & grid processing o cloudrmpi: processing using MPI on Amazon s EC2 cloud o GridR: executes R func6ons on remote hosts, clusters or grids o rmr and RHIPE: executes R func6ons via MapReduce on a Hadoop cluster o Several packages treat large memory & out- of- memory data o bigmemory: points to very large objects in memory o ff: fast access to large data on disk o HadoopStreaming: R opera6ons on streaming tabular or string data o Several packages provide GPU compu2ng o gputools: CUDA implementa6on of linear modeling, clustering, SVM, ICA, o magma: matrix algebra for mul6core CPU & GPU architectures o OpenCL: R interface to OpenCL for heterogeneous CPU/GPU systems See CRAN Task View on High Performance Compu)ng hop://cran.r- project.org/web/views/highperformancecompu6ng.html

23 # Support Vector Machine model, predic2on and valida2on Classifica2on in R install.packages('e1071') ; library(e1071) SDSS_svm_mod <- svm(sdss_train[,5] ~.,data=sdss_train[,1:4],cost=100, gamma=1) summary(sdss_svm_mod) ; str(sdss_svm_mod) SDSS_svm_test_pred <- predict(sdss_svm_mod, SDSS_test) plot(sdss_test[,1], SDSS_test[,2], xlim=c(- 0.7,3), col=round(sdss_svm_test_pred), ylim=c(- 0.7,1.8), pch=20, cex=0.5, main='support Vector Machine', xlab='u- g (mag) ) dev.copy2pdf(file='sdss_svm_class.pdf') Support Vector Machine u g (mag) g r (mag) Classification of g r (mag) r i (mag) SDSS Test Dataset r i (mag) i z (mag) www2.astro.psu.edu/users/edf/san6ago_2016

24 Closing remarks Astronomy graduate curriculum needs a year of statistical and computational methodology. Textbook in astroinformatics needed. Some astronomers should get M.S. in statistics or computer science. All astronomers should know methodology at Wikipedia level. Astrostatistics and astroinformatics can become a well-funded, cross-disciplinary research field involving a few percent of astronomers (cf. astrochemists) promulgating modern methodology and pushing its frontiers. Astronomers can benefit from regular use of many methods coded in R & CRAN. R propels implementation of modern procedures with little effort. Experts should investigate utility of R/CRAN in HPC environments.

Astrostatistics: Past, Present and Future

Astrostatistics: Past, Present and Future Astrostatistics: Past, Present and Future Eric Feigelson Penn State University Summer School in Statistics for Astronomy June 2017 What is astronomy? Astronomy is the observational study of matter beyond

More information

Astrostatistics: Past, Present & Future

Astrostatistics: Past, Present & Future Astrostatistics: Past, Present & Future Eric Feigelson Penn State University Summer School in Statistics for Astronomers XI June 2015 What is astronomy? Astronomy is the observational study of matter beyond

More information

Introduction to Astrostatistics and R

Introduction to Astrostatistics and R Introduction to Astrostatistics and R Eric Feigelson Penn State University Osservatorio Astrofisico di Arcetri April 2014 My credentials Professor of Astronomy & Astrophysics and of Statistics Assoc Director,

More information

Introduction to Astrostatistics and R

Introduction to Astrostatistics and R Introduction to Astrostatistics and R Eric Feigelson Penn State University Cosmology on the Beach, 2014 Lecture #1 My credentials Professor of Astronomy & Astrophysics and of Statistics Assoc Director,

More information

The Challenges of Heteroscedastic Measurement Error in Astronomical Data

The Challenges of Heteroscedastic Measurement Error in Astronomical Data The Challenges of Heteroscedastic Measurement Error in Astronomical Data Eric Feigelson Penn State Measurement Error Workshop TAMU April 2016 Outline 1. Measurement errors are ubiquitous in astronomy,

More information

Master of Science in Statistics A Proposal

Master of Science in Statistics A Proposal 1 Master of Science in Statistics A Proposal Rationale of the Program In order to cope up with the emerging complexity on the solutions of realistic problems involving several phenomena of nature it is

More information

Introduction to astrostatistics

Introduction to astrostatistics Introduction to astrostatistics Eric Feigelson (Astro & Astrophys) & G. Jogesh Babu (Stat) Center for Astrostatistics Penn State University Astronomy & statistics: A glorious history Hipparchus (4th c.

More information

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3 University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.

More information

Parallelizing Gaussian Process Calcula1ons in R

Parallelizing Gaussian Process Calcula1ons in R Parallelizing Gaussian Process Calcula1ons in R Christopher Paciorek UC Berkeley Sta1s1cs Joint work with: Benjamin Lipshitz Wei Zhuo Prabhat Cari Kaufman Rollin Thomas UC Berkeley EECS (formerly) IBM

More information

Computational Biology Course Descriptions 12-14

Computational Biology Course Descriptions 12-14 Computational Biology Course Descriptions 12-14 Course Number and Title INTRODUCTORY COURSES BIO 311C: Introductory Biology I BIO 311D: Introductory Biology II BIO 325: Genetics CH 301: Principles of Chemistry

More information

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics Toolbox 6. Apply statistical algorithms and probability models Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of

More information

STATISTICAL COMPUTING USING R/S. John Fox McMaster University

STATISTICAL COMPUTING USING R/S. John Fox McMaster University STATISTICAL COMPUTING USING R/S John Fox McMaster University The S statistical programming language and computing environment has become the defacto standard among statisticians and has made substantial

More information

Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University

Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University kborne@gmu.edu, http://classweb.gmu.edu/kborne/ Outline Astroinformatics Example Application:

More information

Data Analysis Methods

Data Analysis Methods Data Analysis Methods Successes, Opportunities, and Challenges Chad M. Schafer Department of Statistics & Data Science Carnegie Mellon University cschafer@cmu.edu The Astro/Data Science Community LSST

More information

Data Exploration vis Local Two-Sample Testing

Data Exploration vis Local Two-Sample Testing Data Exploration vis Local Two-Sample Testing 0 20 40 60 80 100 40 20 0 20 40 Freeman, Kim, and Lee (2017) Astrostatistics at Carnegie Mellon CMU Astrostatistics Network Graph 2017 (not including collaborations

More information

Unsupervised clustering of COMBO-17 galaxy photometry

Unsupervised clustering of COMBO-17 galaxy photometry STScI Astrostatistics R tutorials Eric Feigelson (Penn State) November 2011 SESSION 2 Multivariate clustering and classification ***************** ***************** Unsupervised clustering of COMBO-17

More information

Statistical Searches in Astrophysics and Cosmology

Statistical Searches in Astrophysics and Cosmology Statistical Searches in Astrophysics and Cosmology Ofer Lahav University College London, UK Abstract We illustrate some statistical challenges in Astrophysics and Cosmology, in particular noting the application

More information

Classifica(on and predic(on omics style. Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University

Classifica(on and predic(on omics style. Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University Classifica(on and predic(on omics style Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University Classifica(on Learning Set Data with known classes Prediction Classification rule Data with unknown

More information

Mul$variate clustering & classifica$on with R

Mul$variate clustering & classifica$on with R Mul$variate clustering & classifica$on with R Eric Feigelson Penn State Cosmology on the Beach, 2014 Astronomical context Astronomers have constructed classifica@ons of celes@al objects for centuries:

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Cosmological N-Body Simulations and Galaxy Surveys

Cosmological N-Body Simulations and Galaxy Surveys Cosmological N-Body Simulations and Galaxy Surveys Adrian Pope, High Energy Physics, Argonne Na3onal Laboratory, apope@anl.gov CScADS: Scien3fic Data and Analy3cs for Extreme- scale Compu3ng, 30 July 2012

More information

Astroinformatics in the data-driven Astronomy

Astroinformatics in the data-driven Astronomy Astroinformatics in the data-driven Astronomy Massimo Brescia 2017 ICT Workshop Astronomy vs Astroinformatics Most of the initial time has been spent to find a common language among communities How astronomers

More information

STATISTICS-STAT (STAT)

STATISTICS-STAT (STAT) Statistics-STAT (STAT) 1 STATISTICS-STAT (STAT) Courses STAT 158 Introduction to R Programming Credit: 1 (1-0-0) Programming using the R Project for the Statistical Computing. Data objects, for loops,

More information

DANIEL WILSON AND BEN CONKLIN. Integrating AI with Foundation Intelligence for Actionable Intelligence

DANIEL WILSON AND BEN CONKLIN. Integrating AI with Foundation Intelligence for Actionable Intelligence DANIEL WILSON AND BEN CONKLIN Integrating AI with Foundation Intelligence for Actionable Intelligence INTEGRATING AI WITH FOUNDATION INTELLIGENCE FOR ACTIONABLE INTELLIGENCE in an arms race for artificial

More information

Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland

Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland EnviroInfo 2004 (Geneva) Sh@ring EnviroInfo 2004 Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland Mikhail Kanevski 1, Michel Maignan 1

More information

Statistical Analysis of fmrl Data

Statistical Analysis of fmrl Data Statistical Analysis of fmrl Data F. Gregory Ashby The MIT Press Cambridge, Massachusetts London, England Preface xi Acronyms xv 1 Introduction 1 What Is fmri? 2 The Scanning Session 4 Experimental Design

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16 Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

arxiv: v1 [astro-ph.im] 20 Jan 2017

arxiv: v1 [astro-ph.im] 20 Jan 2017 IAU Symposium 325 on Astroinformatics Proceedings IAU Symposium No. xxx, xxx A.C. Editor, B.D. Editor & C.E. Editor, eds. c xxx International Astronomical Union DOI: 00.0000/X000000000000000X Deep learning

More information

SPATIAL-TEMPORAL TECHNIQUES FOR PREDICTION AND COMPRESSION OF SOIL FERTILITY DATA

SPATIAL-TEMPORAL TECHNIQUES FOR PREDICTION AND COMPRESSION OF SOIL FERTILITY DATA SPATIAL-TEMPORAL TECHNIQUES FOR PREDICTION AND COMPRESSION OF SOIL FERTILITY DATA D. Pokrajac Center for Information Science and Technology Temple University Philadelphia, Pennsylvania A. Lazarevic Computer

More information

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers Lectures in AstroStatistics: Topics in Machine Learning for Astronomers Jessi Cisewski Yale University American Astronomical Society Meeting Wednesday, January 6, 2016 1 Statistical Learning - learning

More information

Challenges and Methods for Massive Astronomical Data

Challenges and Methods for Massive Astronomical Data Challenges and Methods for Massive Astronomical Data mini-workshop on Computational AstroStatistics The workshop is sponsored by CHASC/C-BAS, NSF grants DMS 09-07185 (HU) and DMS 09-07522 (UCI), and the

More information

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1 Contents Preface to Second Edition Preface to First Edition Abbreviations xv xvii xix PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1 1 The Role of Statistical Methods in Modern Industry and Services

More information

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE Copyright 2013, SAS Institute Inc. All rights reserved. Agenda (Optional) History Lesson 2015 Buzzwords Machine Learning for X Citizen Data

More information

IE598 Big Data Optimization Introduction

IE598 Big Data Optimization Introduction IE598 Big Data Optimization Introduction Instructor: Niao He Jan 17, 2018 1 A little about me Assistant Professor, ISE & CSL UIUC, 2016 Ph.D. in Operations Research, M.S. in Computational Sci. & Eng. Georgia

More information

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R Statistical Clustering of Vesicle Patterns Mirko Birbaumer Rmetrics Workshop 3th July 2008 1 / 23 Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R Mirko

More information

Modeling radiocarbon in the Earth System. Radiocarbon summer school 2012

Modeling radiocarbon in the Earth System. Radiocarbon summer school 2012 Modeling radiocarbon in the Earth System Radiocarbon summer school 2012 Outline Brief introduc9on to mathema9cal modeling Single pool models Mul9ple pool models Model implementa9on Parameter es9ma9on What

More information

Course in Data Science

Course in Data Science Course in Data Science About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst. The course gives an

More information

Exploring the Patterns of Human Mobility Using Heterogeneous Traffic Trajectory Data

Exploring the Patterns of Human Mobility Using Heterogeneous Traffic Trajectory Data Exploring the Patterns of Human Mobility Using Heterogeneous Traffic Trajectory Data Jinzhong Wang April 13, 2016 The UBD Group Mobile and Social Computing Laboratory School of Software, Dalian University

More information

Sta$s$cs in astronomy and the SAMSI ASTRO Program

Sta$s$cs in astronomy and the SAMSI ASTRO Program Sta$s$cs in astronomy and the SAMSI ASTRO Program G. Jogesh Babu (with input from Eric Feigelson) Penn State Why astrosta$s$cs? Astronomers encounter a surprising variety of sta$s$cal problems in their

More information

ALL OF NONPARAMETRIC STATISTICS SOLUTIONS

ALL OF NONPARAMETRIC STATISTICS SOLUTIONS page 1 / 5 page 2 / 5 all of nonparametric statistics pdf Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples

More information

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu Why? How? What? How much? How many? Individual facts (quantities, characters, or symbols) The Data-Information-Knowledge-Wisdom

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Mul$variate clustering. Astro 585 Spring 2013

Mul$variate clustering. Astro 585 Spring 2013 Mul$variate clustering Astro 585 Spring 2013 Astronomical context Astronomers have constructed classifica

More information

Outline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract

Outline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract Abstract The analysis of extremely large, complex datasets is becoming an increasingly important task in the analysis of scientific data. This trend is especially prevalent in astronomy, as large-scale

More information

Uncertainty quantification and visualization for functional random variables

Uncertainty quantification and visualization for functional random variables Uncertainty quantification and visualization for functional random variables MascotNum Workshop 2014 S. Nanty 1,3 C. Helbert 2 A. Marrel 1 N. Pérot 1 C. Prieur 3 1 CEA, DEN/DER/SESI/LSMR, F-13108, Saint-Paul-lez-Durance,

More information

Improved Astronomical Inferences via Nonparametric Density Estimation

Improved Astronomical Inferences via Nonparametric Density Estimation Improved Astronomical Inferences via Nonparametric Density Estimation Chad M. Schafer, InCA Group www.incagroup.org Department of Statistics Carnegie Mellon University Work Supported by NSF, NASA-AISR

More information

Exploiting Sparse Non-Linear Structure in Astronomical Data

Exploiting Sparse Non-Linear Structure in Astronomical Data Exploiting Sparse Non-Linear Structure in Astronomical Data Ann B. Lee Department of Statistics and Department of Machine Learning, Carnegie Mellon University Joint work with P. Freeman, C. Schafer, and

More information

Machine Learning Overview

Machine Learning Overview Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression

More information

Statistical Applications in the Astronomy Literature II Jogesh Babu. Center for Astrostatistics PennState University, USA

Statistical Applications in the Astronomy Literature II Jogesh Babu. Center for Astrostatistics PennState University, USA Statistical Applications in the Astronomy Literature II Jogesh Babu Center for Astrostatistics PennState University, USA 1 The likelihood ratio test (LRT) and the related F-test Protassov et al. (2002,

More information

Mixture Models. Michael Kuhn

Mixture Models. Michael Kuhn Mixture Models Michael Kuhn 2017-8-26 Objec

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

ECE 3800 Probabilistic Methods of Signal and System Analysis

ECE 3800 Probabilistic Methods of Signal and System Analysis ECE 3800 Probabilistic Methods of Signal and System Analysis Dr. Bradley J. Bazuin Western Michigan University College of Engineering and Applied Sciences Department of Electrical and Computer Engineering

More information

Manual for a computer class in ML

Manual for a computer class in ML Manual for a computer class in ML November 3, 2015 Abstract This document describes a tour of Machine Learning (ML) techniques using tools in MATLAB. We point to the standard implementations, give example

More information

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM Lecture 9 SEM, Statistical Modeling, AI, and Data Mining I. Terminology of SEM Related Concepts: Causal Modeling Path Analysis Structural Equation Modeling Latent variables (Factors measurable, but thru

More information

Mining Digital Surveys for photo-z s and other things

Mining Digital Surveys for photo-z s and other things Mining Digital Surveys for photo-z s and other things Massimo Brescia & Stefano Cavuoti INAF Astr. Obs. of Capodimonte Napoli, Italy Giuseppe Longo Dept of Physics Univ. Federico II Napoli, Italy Results

More information

Astronomical time series (and more on R) Eric Feigelson. 3rd INPE Advanced School in Astrophysics: Astrostatistics 2009

Astronomical time series (and more on R) Eric Feigelson. 3rd INPE Advanced School in Astrophysics: Astrostatistics 2009 Astronomical time series (and more on R) Eric Feigelson 3rd INPE Advanced School in Astrophysics: Astrostatistics 2009 Outline 1 Time series in astronomy 2 Frequency domain methods 3 More about R 4 R in

More information

Latent Dirichlet Alloca/on

Latent Dirichlet Alloca/on Latent Dirichlet Alloca/on Blei, Ng and Jordan ( 2002 ) Presented by Deepak Santhanam What is Latent Dirichlet Alloca/on? Genera/ve Model for collec/ons of discrete data Data generated by parameters which

More information

Chart types and when to use them

Chart types and when to use them APPENDIX A Chart types and when to use them Pie chart Figure illustration of pie chart 2.3 % 4.5 % Browser Usage for April 2012 18.3 % 38.3 % Internet Explorer Firefox Chrome Safari Opera 35.8 % Pie chart

More information

Continuous Machine Learning

Continuous Machine Learning Continuous Machine Learning Kostiantyn Bokhan, PhD Project Lead at Samsung R&D Ukraine Kharkiv, October 2016 Agenda ML dev. workflows ML dev. issues ML dev. solutions Continuous machine learning (CML)

More information

Astronomy of the Next Decade: From Photons to Petabytes. R. Chris Smith AURA Observatory in Chile CTIO/Gemini/SOAR/LSST

Astronomy of the Next Decade: From Photons to Petabytes. R. Chris Smith AURA Observatory in Chile CTIO/Gemini/SOAR/LSST Astronomy of the Next Decade: From Photons to Petabytes R. Chris Smith AURA Observatory in Chile CTIO/Gemini/SOAR/LSST Classical Astronomy still dominates new facilities Even new large facilities (VLT,

More information

INTENSIVE COMPUTATION. Annalisa Massini

INTENSIVE COMPUTATION. Annalisa Massini INTENSIVE COMPUTATION Annalisa Massini 2015-2016 Course topics The course will cover topics that are in some sense related to intensive computation: Matlab (an introduction) GPU (an introduction) Sparse

More information

MULTIPLE EXPOSURES IN LARGE SURVEYS

MULTIPLE EXPOSURES IN LARGE SURVEYS MULTIPLE EXPOSURES IN LARGE SURVEYS / Johns Hopkins University Big Data? Noisy Skewed Artifacts Big Data? Noisy Skewed Artifacts Serious Issues Significant fraction of catalogs is junk GALEX 50% PS1 3PI

More information

A Course on Advanced Econometrics

A Course on Advanced Econometrics A Course on Advanced Econometrics Yongmiao Hong The Ernest S. Liu Professor of Economics & International Studies Cornell University Course Introduction: Modern economies are full of uncertainties and risk.

More information

Summary Report: MAA Program Study Group on Computing and Computational Science

Summary Report: MAA Program Study Group on Computing and Computational Science Summary Report: MAA Program Study Group on Computing and Computational Science Introduction and Themes Henry M. Walker, Grinnell College (Chair) Daniel Kaplan, Macalester College Douglas Baldwin, SUNY

More information

Modeling User s Cognitive Dynamics in Information Access and Retrieval using Quantum Probability (ESR-6)

Modeling User s Cognitive Dynamics in Information Access and Retrieval using Quantum Probability (ESR-6) Modeling User s Cognitive Dynamics in Information Access and Retrieval using Quantum Probability (ESR-6) SAGAR UPRETY THE OPEN UNIVERSITY, UK QUARTZ Mid-term Review Meeting, Padova, Italy, 07/12/2018 Self-Introduction

More information

Big Data Inference. Combining Hierarchical Bayes and Machine Learning to Improve Photometric Redshifts. Josh Speagle 1,

Big Data Inference. Combining Hierarchical Bayes and Machine Learning to Improve Photometric Redshifts. Josh Speagle 1, ICHASC, Apr. 25 2017 Big Data Inference Combining Hierarchical Bayes and Machine Learning to Improve Photometric Redshifts Josh Speagle 1, jspeagle@cfa.harvard.edu In collaboration with: Boris Leistedt

More information

An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University Why Do We Care? Necessity in today s labs Principled approach:

More information

Generative Model (Naïve Bayes, LDA)

Generative Model (Naïve Bayes, LDA) Generative Model (Naïve Bayes, LDA) IST557 Data Mining: Techniques and Applications Jessie Li, Penn State University Materials from Prof. Jia Li, sta3s3cal learning book (Has3e et al.), and machine learning

More information

Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

More information

Astroinformatics: massive data research in Astronomy Kirk Borne Dept of Computational & Data Sciences George Mason University

Astroinformatics: massive data research in Astronomy Kirk Borne Dept of Computational & Data Sciences George Mason University Astroinformatics: massive data research in Astronomy Kirk Borne Dept of Computational & Data Sciences George Mason University kborne@gmu.edu, http://classweb.gmu.edu/kborne/ Ever since humans first gazed

More information

Big Computing in High Energy Physics. David Toback Department of Physics and Astronomy Mitchell Institute for Fundamental Physics and Astronomy

Big Computing in High Energy Physics. David Toback Department of Physics and Astronomy Mitchell Institute for Fundamental Physics and Astronomy Big Computing in High Energy Physics Big Data Workshop Department of Physics and Astronomy Mitchell Institute for Fundamental Physics and Astronomy October 2011 Outline Particle Physics and Big Computing

More information

2005 ICPSR SUMMER PROGRAM REGRESSION ANALYSIS III: ADVANCED METHODS

2005 ICPSR SUMMER PROGRAM REGRESSION ANALYSIS III: ADVANCED METHODS 2005 ICPSR SUMMER PROGRAM REGRESSION ANALYSIS III: ADVANCED METHODS William G. Jacoby Department of Political Science Michigan State University June 27 July 22, 2005 This course will take a modern, data-analytic

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

Package BNN. February 2, 2018

Package BNN. February 2, 2018 Type Package Package BNN February 2, 2018 Title Bayesian Neural Network for High-Dimensional Nonlinear Variable Selection Version 1.0.2 Date 2018-02-02 Depends R (>= 3.0.2) Imports mvtnorm Perform Bayesian

More information

Models to carry out inference vs. Models to mimic (spatio-temporal) systems 5/5/15

Models to carry out inference vs. Models to mimic (spatio-temporal) systems 5/5/15 Models to carry out inference vs. Models to mimic (spatio-temporal) systems 5/5/15 Ring-Shaped Hotspot Detection: A Summary of Results, IEEE ICDM 2014 (w/ E. Eftelioglu et al.) Where is a crime source?

More information

A CUDA Solver for Helmholtz Equation

A CUDA Solver for Helmholtz Equation Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College

More information

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second semester

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

Statistícal Methods for Spatial Data Analysis

Statistícal Methods for Spatial Data Analysis Texts in Statistícal Science Statistícal Methods for Spatial Data Analysis V- Oliver Schabenberger Carol A. Gotway PCT CHAPMAN & K Contents Preface xv 1 Introduction 1 1.1 The Need for Spatial Analysis

More information

Econometric Analysis of Network Data. Georgetown Center for Economic Practice

Econometric Analysis of Network Data. Georgetown Center for Economic Practice Econometric Analysis of Network Data Georgetown Center for Economic Practice May 8th and 9th, 2017 Course Description This course will provide an overview of econometric methods appropriate for the analysis

More information

Language GTC April 7, These slides at Data Science Applications of GPUs in the R.

Language GTC April 7, These slides at   Data Science Applications of GPUs in the R. Data Science April 7, 2016 These slides at http://heather.cs.ucdavis.edu/gtc.pdf Why R? Why R? The lingua franca for the data science community. (R-Python-Julia battle looming?) Statistically Correct:

More information

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami Adapted from my talk in AIHelsinki seminar Dec 15, 2016 1 MOTIVATING INTRODUCTION Most of the artificial intelligence success stories

More information

Report and Opinion 2016;8(6) Analysis of bivariate correlated data under the Poisson-gamma model

Report and Opinion 2016;8(6)   Analysis of bivariate correlated data under the Poisson-gamma model Analysis of bivariate correlated data under the Poisson-gamma model Narges Ramooz, Farzad Eskandari 2. MSc of Statistics, Allameh Tabatabai University, Tehran, Iran 2. Associate professor of Statistics,

More information

regression modelling wih spatial pdf Beginner's Guide to Spatial, Temporal and Spatial-Temporal

regression modelling wih spatial pdf Beginner's Guide to Spatial, Temporal and Spatial-Temporal DOWNLOAD OR READ : REGRESSION MODELLING WIH SPATIAL AND SPATIAL TEMPORAL DATA A BAYESIAN APPROACHBAYESIAN STATISTICS AND MARKETINGAPPLIED BAYESIAN FORECASTING AND TIME SERIES ANALYSIS PDF EBOOK EPUB MOBI

More information

Advanced Computing Systems for Scientific Research

Advanced Computing Systems for Scientific Research Undergraduate Review Volume 10 Article 13 2014 Advanced Computing Systems for Scientific Research Jared Buckley Jason Covert Talia Martin Recommended Citation Buckley, Jared; Covert, Jason; and Martin,

More information

Tutorial on Probabilistic Programming with PyMC3

Tutorial on Probabilistic Programming with PyMC3 185.A83 Machine Learning for Health Informatics 2017S, VU, 2.0 h, 3.0 ECTS Tutorial 02-04.04.2017 Tutorial on Probabilistic Programming with PyMC3 florian.endel@tuwien.ac.at http://hci-kdd.org/machine-learning-for-health-informatics-course

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Data Intensive Computing meets High Performance Computing

Data Intensive Computing meets High Performance Computing Data Intensive Computing meets High Performance Computing Kathy Yelick Associate Laboratory Director for Computing Sciences, Lawrence Berkeley National Laboratory Professor of Electrical Engineering and

More information

ECONOMICS 7200 MODERN TIME SERIES ANALYSIS Econometric Theory and Applications

ECONOMICS 7200 MODERN TIME SERIES ANALYSIS Econometric Theory and Applications ECONOMICS 7200 MODERN TIME SERIES ANALYSIS Econometric Theory and Applications Yongmiao Hong Department of Economics & Department of Statistical Sciences Cornell University Spring 2019 Time and uncertainty

More information

Rick Ebert & Joseph Mazzarella For the NED Team. Big Data Task Force NASA, Ames Research Center 2016 September 28-30

Rick Ebert & Joseph Mazzarella For the NED Team. Big Data Task Force NASA, Ames Research Center 2016 September 28-30 NED Mission: Provide a comprehensive, reliable and easy-to-use synthesis of multi-wavelength data from NASA missions, published catalogs, and the refereed literature, to enhance and enable astrophysical

More information

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist

More information

Statistical Methods for Astronomy

Statistical Methods for Astronomy Statistical Methods for Astronomy If your experiment needs statistics, you ought to have done a better experiment. -Ernest Rutherford Lecture 1 Lecture 2 Why do we need statistics? Definitions Statistical

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week: Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week: Course general information About the course Course objectives Comparative methods: An overview R as language: uses and

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

ECE 3800 Probabilistic Methods of Signal and System Analysis

ECE 3800 Probabilistic Methods of Signal and System Analysis ECE 3800 Probabilistic Methods of Signal and System Analysis Dr. Bradley J. Bazuin Western Michigan University College of Engineering and Applied Sciences Department of Electrical and Computer Engineering

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information