Data-Intensive Statistical Challenges in Astrophysics

Size: px
Start display at page:

Download "Data-Intensive Statistical Challenges in Astrophysics"

Transcription

1 Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Lots of contributions from Tamas Budavari (JHU)

2 Scientific Data Analysis Today Scientific data is doubling every year, reaching PBs Data is everywhere, never will be at a single location Architectures increasingly CPU-heavy, IO-poor Data-intensive scalable architectures needed Databases are a good starting point Scientists spend 80% of time on menial tasks Scientists need special features (arrays, GPUs) Most data analysis done on midsize BeoWulf clusters Universities hitting the power wall Soon we cannot even store the incoming data stream Not scalable, not maintainable

3 Non-Incremental Changes Science is moving increasingly from hypothesisdriven to data-driven discoveries Need new randomized, incremental algorithms Best result in 1 min, 1 hour, 1 day, 1 week New computational tools and strategies not just statistics, not just computer science, not just astronomy, not just genomics Need new data intensive scalable architectures Astronomy has always been data-driven. now becoming more generally accepted

4 Scalable Data-Intensive Analysis Large data sets => data resides on hard disks Analysis has to move to the data Hard disks are becoming sequential devices For a PB data set you cannot use a random access pattern Both analysis and visualization become streaming problems Same thing is true with searches Massively parallel sequential crawlers (MR, Hadoop, etc) Spatial indexing needs to be maximally sequential Space filling curves (Peano-Hilbert, Morton, )

5 Why Astronomy Data? Approach inherently and traditionally data-driven Important spatio-temporal features Very large density contrasts in populations Real errors and covariances Many signals very subtle, buried in systematics Data sets very large, pushing scalability Worthless! Jim Gray

6 The Age of Surveys CMB Surveys (pixels) 1990 COBE Boomerang 10, CBI 50, WMAP 1 Million 2008 Planck 10 Million Angular Galaxy Surveys (obj) 1970 Lick 1M 1990 APM 2M 2005 SDSS 200M 2011 PS1 1000M 2020 LSST 30000M Time Domain QUEST SDSS Extension survey Dark Energy Camera Pan-STARRS LSST Galaxy Redshift Surveys (obj) 1986 CfA LCRS dF SDSS BOSS LAMOST Petabytes/year

7 Sloan Digital Sky Survey The Cosmic Genome Project Two surveys in one Photometric survey in 5 bands Spectroscopic redshift survey Data is public 2.5 Terapixels of images => 5 Tpx 10 TB of raw data => 120TB processed 0.5 TB catalogs => 35TB in the end Started in 1992, finished in 2008 Extra data volume enabled by Moore s Law Kryder s Law

8 Astro-Statistical Challenges The crossmatch problem (multi-, time domain) Photometric redshifts (prediction/regression problem) Correlations (auto/cross, higher order) Outlier detection in many dimensions Statistical errors vs systematics Comparing observations to models comparing distributions, updating models The unknown unknown, when we have no models.. Exascale streams are coming (Square Km Array) Scalability!!!

9 Relevant Astro Challenges I. Analysis of spectra (multispectral data) Sparse signal in large dimensions Much noise, and very rare events 4Kx1M SVD problem, perfect for randomized algorithms Motivated our work on robust incremental PCA Photometric measurements (segmentation) Currently 500M rows x 500 attributes Noisy, missing data Spatial, temporal searches and associations Dominated by unknown systematic errors Small number of subtle outliers, buried in very large typical population Very skewed distributions

10 Galaxy Diversity from PCA PC 1st [Average Spectrum] 2nd [Stellar Continuum] 3rd [Finer Continuum Features + Age] 4th 5th [Age] Balmer series hydrogen lines [Metallicity] Mg b, Na D, Ca II Triplet

11 Streaming PCA Initialization Eigensystem of a small, random subset Truncate at p largest eigenvalues Incremental updates Mean and the low-rank A matrix SVD of A yields new eigensystem Randomized algorithm! T. Budavari, D. Mishin 2011

12 Robust PCA PCA minimizes σ RMS of the residuals r = y Py Quadratic formula: r 2 extremely sensitive to outliers We optimize a robust M-scale σ 2 (Maronna 2005) Implicitly given by Fits in with the iterative method! Outliers can be processed separately

13 Eigenvalues in Streaming PCA 13 Classic Robust

14 Examples with SDSS Built on top of the Incremental Robust PCA Principal Component Pursuit (I. Csabai) CX decomposition (C-W Yip)

15 Principal component pursuit Low rank approximation of data matrix: X Standard PCA: min X E subject to rank( E) k 2 works well if the noise distribution is Gaussian outliers can cause bias Principal component pursuit min subjectto sparse spiky noise/outliers: try to minimize the number of outliers while keeping the rank low NP-hard problem A 0 X N A, rank( N) k The L1 trick: min N A subjectto X N A * N, A 1 numerically feasible convex problem (Augmented Lagrange Multiplier) min N, A N A subjectto X ( N A) * 1 2 * E. Candes, et al. Robust Principal Component Analysis. preprint, Abdelkefi et al. ACM CoNEXT Workshop (traffic anomaly detection)

16 Testing on Galaxy Spectra Slowly varying continuum + absorption lines Highly variable sparse emission lines This is the simple version of PCP: the position of the lines are known but there are many of them, automatic detection can be useful spiky noise can bias standard PCA DATA: Streaming robust PCA implementation for galaxy spectrum catalog (L. Dobos et al.) SDSS 1M galaxy spectra Morphological subclasses Robust averages + first few PCA directions

17 PCA PCA reconstruction Residual

18 Principal component pursuit Low rank Sparse Residual λ=0.6/sqrt(n), ε=0.03

19 Galaxy ID Galaxy ID Not Every Data Direction is Equal Wavelength Selected Wavelengths Wavelength A = C X Selected Wavelengths Procedure: 1. Perform SVD of A = U V T 2. Pick number of eigenvectors = K 3. Calculate Leverage Score = i V T ij 2 / K Mahoney and Drineas 2009

20 Wavelength Sampling Probability k = 2 c = 7 k = 4 c = 16 k = 6 c = 25 k = 8 c = 29

21 Ranking Astronomical Line Indices Subspace Analysis of Spectra Cutouts: -Othogonality -Divergence -Commonality (Worthey et al. 94; Trager et al. 98) (Yip et al in prep.)

22 Identifying New Line Indices, Objectively (Yip et al in prep.)

23 Relevant Astro Challenges II. Photometric redshifts High dimensional embedding of a noisy low-dimensional process: guess galaxy distance from its colors Subtle statistical problem, with huge computational needs => scalability challenge Training set => regression/estimation engine Models constrained by Bayesian priors during training (Budavari) Need to detect atypical objects that do not fit known patterns Large-Scale Fuzzy Spatial Cross-Matching Soon we need to do spatial matches 100Bx100B data sets Experimenting with different approaches (DB/Hadoop/GPU)

24 Photometric Redshifts Normally, distances from Hubble s Law H r Measure the Doppler shift of spectral lines distance! v 0 But spectroscopy is very expensive SDSS: 640 spectra in 45 minutes vs. 300K 5 color images Future big surveys will have no spectra Idea: Multicolor images are like a crude spectrograph Statistical estimation of the redshifts/distances

25 Photometric Redshifts Phenomenological (PolyFit, ANNz, knn, RF) Simple, quite accurate, fairly robust Little physical insight, difficult to extrapolate, Malmquist Template-based (KL, HyperZ ) Simple, physical model Calibrations, templates, issues with accuracy Hybrid ( base learner ) Physical basis, adaptive Complicated, compute intensive Important for next generation surveys! We must understand the errors!

26 Recent Developments Random Forests S. Carliles, C. Priebe, A. Szalay, T. Budavari, S. Heinis (2009) Slightly better than other estimators Estimated errors close to Gaussian, and accurate Physically motivated removal of various systematics Inclination Self Absorption in a galaxy (Yip et al 2011) Effect of emission lines Unified theory of photometric redshifts (Budavari 2010) Not a regression problem Kernel density estimators, constrained by model priors

27 RF is like Kernel Estimator For a given query point each tree picks a neighborhood The ensemble of these neighborhood points defines a kernel The smaller the sampling rate from the training set, the broader the kernel => effective degrees of freedom N eff Central limit theorem convergence is driven by N eff RF is computationally a very efficient way to regress

28 Carliles et al 2009 Zspec vs Zrf

29 RF on Cyberbricks 36-node Amdahl cluster using 1200W total Zotac Atom/ION motherboards 4GB of memory, N330 dual core Atom, 16 GPU cores Aggregate disk space 43.6TB 63 x 120GB SSD = 7.7 TB 27x 1TB Samsung F1 = 27.0 TB 18x.5TB Samsung M1= 9.0 TB Blazing I/O Performance: 18GB/s Amdahl number = 1 for under $30K Using the GPUs for data mining: 6.4B multidimensional regressions (photo-z) in 5 minutes over 1.2TB of data Running the Random Forest algorithm inside the DB

30 Correlation Functions 2-point, 3 point 2 nd order, higher order Spatial, angular Various estimators Direct estimators, edge corrections (Ripley, Landy-Szalay) Various algorithms 1 2 Tree codes (a. Gray) GPU-based codes (Budavari)

31 The Impact of GPUs We need to reconsider the N logn only approach Once we can run 100K threads, maybe running SIMD N 2 on smaller partitions is also acceptable Recent JHU effort on integrating CUDA with SQL Server, using SQL UDF Galaxy spatial correlations: 600 trillion real and random galaxy pairs using brute force N 2 Much faster than the tree codes! This is because high resolution was needed Tian, Budavari, Neyrinck, Szalay 2010

32 Visualizing Petabytes Visualizations are becoming IO limited Needs to be done where the data is It is easier to send a HD 3D video stream to the user than all the data It is possible to build individual servers with extreme data rates (5GBps per server see Data-Scope) Prototype on turbulence simulation already works: data streaming directly from SQL Server to GPU Interactive visualizations driven remotely

33 Streaming Visualization Kai Buerger (visited JHU from Munich) Implemented high performance streaming from DB into the GPU, exceeding 1GB/sec Limited by the current I/O subsystem Using the velocity field from our 30TB turbulence database

34

35

36 JHU Data-Scope Revised 1P 1S All P All S Full servers rack units capacity TB price $K power kw GPU* TF seq IO GBps IOPS kiops netwk bw Gbps * Without the GPU costs (it is about $1,600/ card)

37 Summary Real multi-pb solutions are needed NOW! We have to build it ourselves Multi-faceted challenges No single trivial breakthrough, will need multiple thrusts Various scalable algorithms, new statistical techniques Adoption of new approaches is a difficult process Need to train people with new skill sets Π shaped people Increasing diversification Also many common patterns across all disciplines A new, Fourth Paradigm of science is emerging but it is not incremental.

Data-Intensive Statistical Challenges in Astrophysics

Data-Intensive Statistical Challenges in Astrophysics Data-Intensive Statistical Challenges in Astrophysics Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary) Alex Szalay The Johns Hopkins University The Age of

More information

COMPUTATIONAL STATISTICS IN ASTRONOMY: NOW AND SOON

COMPUTATIONAL STATISTICS IN ASTRONOMY: NOW AND SOON COMPUTATIONAL STATISTICS IN ASTRONOMY: NOW AND SOON / The Johns Hopkins University Recording Observations Astronomers drew it Now kids do it on the SkyServer #1 by Haley Multicolor Universe Eventful Universe

More information

MULTIPLE EXPOSURES IN LARGE SURVEYS

MULTIPLE EXPOSURES IN LARGE SURVEYS MULTIPLE EXPOSURES IN LARGE SURVEYS / Johns Hopkins University Big Data? Noisy Skewed Artifacts Big Data? Noisy Skewed Artifacts Serious Issues Significant fraction of catalogs is junk GALEX 50% PS1 3PI

More information

BAYESIAN CROSS-IDENTIFICATION IN ASTRONOMY

BAYESIAN CROSS-IDENTIFICATION IN ASTRONOMY BAYESIAN CROSS-IDENTIFICATION IN ASTRONOMY / The Johns Hopkins University Sketch by William Parsons (1845) 2 Recording Observations Whirlpool Galaxy M51 Discovered by Charles Messier (1773) Multicolor

More information

Astroinformatics in the data-driven Astronomy

Astroinformatics in the data-driven Astronomy Astroinformatics in the data-driven Astronomy Massimo Brescia 2017 ICT Workshop Astronomy vs Astroinformatics Most of the initial time has been spent to find a common language among communities How astronomers

More information

PRACTICAL ANALYTICS 7/19/2012. Tamás Budavári / The Johns Hopkins University

PRACTICAL ANALYTICS 7/19/2012. Tamás Budavári / The Johns Hopkins University PRACTICAL ANALYTICS / The Johns Hopkins University Statistics Of numbers Of vectors Of functions Of trees Statistics Description, modeling, inference, machine learning Bayesian / Frequentist / Pragmatist?

More information

The SDSS is Two Surveys

The SDSS is Two Surveys The SDSS is Two Surveys The Fuzzy Blob Survey The Squiggly Line Survey The Site The telescope 2.5 m mirror Digital Cameras 1.3 MegaPixels $150 4.3 Megapixels $850 100 GigaPixels $10,000,000 CCDs CCDs:

More information

Data analysis of massive data sets a Planck example

Data analysis of massive data sets a Planck example Data analysis of massive data sets a Planck example Radek Stompor (APC) LOFAR workshop, Meudon, 29/03/06 Outline 1. Planck mission; 2. Planck data set; 3. Planck data analysis plan and challenges; 4. Planck

More information

Exploiting Sparse Non-Linear Structure in Astronomical Data

Exploiting Sparse Non-Linear Structure in Astronomical Data Exploiting Sparse Non-Linear Structure in Astronomical Data Ann B. Lee Department of Statistics and Department of Machine Learning, Carnegie Mellon University Joint work with P. Freeman, C. Schafer, and

More information

arxiv: v1 [astro-ph.im] 9 May 2013

arxiv: v1 [astro-ph.im] 9 May 2013 Astron. Nachr. / AN Volume, No. Issue, 1 4 (212) / DOI DOI Photo-Met: a non-parametric method for estimating stellar metallicity from photometric observations Gyöngyi Kerekes 1, István Csabai 1, László

More information

Design and implementation of the spectra reduction and analysis software for LAMOST telescope

Design and implementation of the spectra reduction and analysis software for LAMOST telescope Design and implementation of the spectra reduction and analysis software for LAMOST telescope A-Li Luo *a, Yan-Xia Zhang a and Yong-Heng Zhao a *a National Astronomical Observatories, Chinese Academy of

More information

Some Statistical Aspects of Photometric Redshift Estimation

Some Statistical Aspects of Photometric Redshift Estimation Some Statistical Aspects of Photometric Redshift Estimation James Long November 9, 2016 1 / 32 Photometric Redshift Estimation Background Non Negative Least Squares Non Negative Matrix Factorization EAZY

More information

Astronomy of the Next Decade: From Photons to Petabytes. R. Chris Smith AURA Observatory in Chile CTIO/Gemini/SOAR/LSST

Astronomy of the Next Decade: From Photons to Petabytes. R. Chris Smith AURA Observatory in Chile CTIO/Gemini/SOAR/LSST Astronomy of the Next Decade: From Photons to Petabytes R. Chris Smith AURA Observatory in Chile CTIO/Gemini/SOAR/LSST Classical Astronomy still dominates new facilities Even new large facilities (VLT,

More information

Probabilistic photometric redshifts in the era of Petascale Astronomy

Probabilistic photometric redshifts in the era of Petascale Astronomy Probabilistic photometric redshifts in the era of Petascale Astronomy Matías Carrasco Kind NCSA/Department of Astronomy University of Illinois at Urbana-Champaign Tools for Astronomical Big Data March

More information

Quantifying correlations between galaxy emission lines and stellar continua

Quantifying correlations between galaxy emission lines and stellar continua Quantifying correlations between galaxy emission lines and stellar continua R. Beck, L. Dobos, C.W. Yip, A.S. Szalay and I. Csabai 2016 astro-ph: 1601.0241 1 Introduction / Technique Data Emission line

More information

Astroinformatics: massive data research in Astronomy Kirk Borne Dept of Computational & Data Sciences George Mason University

Astroinformatics: massive data research in Astronomy Kirk Borne Dept of Computational & Data Sciences George Mason University Astroinformatics: massive data research in Astronomy Kirk Borne Dept of Computational & Data Sciences George Mason University kborne@gmu.edu, http://classweb.gmu.edu/kborne/ Ever since humans first gazed

More information

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE Copyright 2013, SAS Institute Inc. All rights reserved. Agenda (Optional) History Lesson 2015 Buzzwords Machine Learning for X Citizen Data

More information

The SDSS Data. Processing the Data

The SDSS Data. Processing the Data The SDSS Data Processing the Data On a clear, dark night, light that has traveled through space for a billion years touches a mountaintop in southern New Mexico and enters the sophisticated instrumentation

More information

A SPEctra Clustering Tool for the exploration of large spectroscopic surveys. Philipp Schalldach (HU Berlin & TLS Tautenburg, Germany)

A SPEctra Clustering Tool for the exploration of large spectroscopic surveys. Philipp Schalldach (HU Berlin & TLS Tautenburg, Germany) A SPEctra Clustering Tool for the exploration of large spectroscopic surveys Philipp Schalldach (HU Berlin & TLS Tautenburg, Germany) Working Group Helmut Meusinger (Tautenburg, Germany) Philipp Schalldach

More information

HUBBLE SOURCE CATALOG. Steve Lubow Space Telescope Science Institute June 6, 2012

HUBBLE SOURCE CATALOG. Steve Lubow Space Telescope Science Institute June 6, 2012 HUBBLE SOURCE CATALOG Steve Lubow Space Telescope Science Institute June 6, 2012 HUBBLE SPACE TELESCOPE 2.4 m telescope launched in 1990 In orbit above Earth s atmosphere High resolution, faint sources

More information

Automatic Unsupervised Spectral Classification of all SDSS/DR7 Galaxies

Automatic Unsupervised Spectral Classification of all SDSS/DR7 Galaxies Automatic Unsupervised Spectral Classification of all SDSS/DR7 Galaxies J. Sánchez Almeida, J. A. L. Aguerri, C. Muñoz-Tuñón, A. de Vicente Instituto de Astrofisica de Canarias, Spain 2010, ApJ, 714, 487

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Real Astronomy from Virtual Observatories

Real Astronomy from Virtual Observatories THE US NATIONAL VIRTUAL OBSERVATORY Real Astronomy from Virtual Observatories Robert Hanisch Space Telescope Science Institute US National Virtual Observatory About this presentation What is a Virtual

More information

Data-driven models of stars

Data-driven models of stars Data-driven models of stars David W. Hogg Center for Cosmology and Particle Physics, New York University Center for Data Science, New York University Max-Planck-Insitut für Astronomie, Heidelberg 2015

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Classifying Galaxy Morphology using Machine Learning

Classifying Galaxy Morphology using Machine Learning Julian Kates-Harbeck, Introduction: Classifying Galaxy Morphology using Machine Learning The goal of this project is to classify galaxy morphologies. Generally, galaxy morphologies fall into one of two

More information

COMPRESSED AND PENALIZED LINEAR

COMPRESSED AND PENALIZED LINEAR COMPRESSED AND PENALIZED LINEAR REGRESSION Daniel J. McDonald Indiana University, Bloomington mypage.iu.edu/ dajmcdon 2 June 2017 1 OBLIGATORY DATA IS BIG SLIDE Modern statistical applications genomics,

More information

Data Release 5. Sky coverage of imaging data in the DR5

Data Release 5. Sky coverage of imaging data in the DR5 Data Release 5 The Sloan Digital Sky Survey has released its fifth Data Release (DR5). The spatial coverage of DR5 is about 20% larger than that of DR4. The photometric data in DR5 are based on five band

More information

Big Bang, Big Iron: CMB Data Analysis at the Petascale and Beyond

Big Bang, Big Iron: CMB Data Analysis at the Petascale and Beyond Big Bang, Big Iron: CMB Data Analysis at the Petascale and Beyond Julian Borrill Computational Cosmology Center, LBL & Space Sciences Laboratory, UCB with Christopher Cantalupo, Theodore Kisner, Radek

More information

Precise measurement of the radial BAO scale in galaxy redshift surveys

Precise measurement of the radial BAO scale in galaxy redshift surveys Precise measurement of the radial BAO scale in galaxy redshift surveys David Alonso (Universidad Autónoma de Madrid - IFT) Based on the work arxiv:1210.6446 In collaboration with: E. Sánchez, F. J. Sánchez,

More information

Cosmology of Photometrically- Classified Type Ia Supernovae

Cosmology of Photometrically- Classified Type Ia Supernovae Cosmology of Photometrically- Classified Type Ia Supernovae Campbell et al. 2013 arxiv:1211.4480 Heather Campbell Collaborators: Bob Nichol, Chris D'Andrea, Mat Smith, Masao Sako and all the SDSS-II SN

More information

Fast Hierarchical Clustering from the Baire Distance

Fast Hierarchical Clustering from the Baire Distance Fast Hierarchical Clustering from the Baire Distance Pedro Contreras 1 and Fionn Murtagh 1,2 1 Department of Computer Science. Royal Holloway, University of London. 57 Egham Hill. Egham TW20 OEX, England.

More information

Modern Image Processing Techniques in Astronomical Sky Surveys

Modern Image Processing Techniques in Astronomical Sky Surveys Modern Image Processing Techniques in Astronomical Sky Surveys Items of the PhD thesis József Varga Astronomy MSc Eötvös Loránd University, Faculty of Science PhD School of Physics, Programme of Particle

More information

Virtual Sensors and Large-Scale Gaussian Processes

Virtual Sensors and Large-Scale Gaussian Processes Virtual Sensors and Large-Scale Gaussian Processes Ashok N. Srivastava, Ph.D. Principal Investigator, IVHM Project Group Lead, Intelligent Data Understanding ashok.n.srivastava@nasa.gov Coauthors: Kamalika

More information

RLW paper titles:

RLW paper titles: RLW paper titles: http://www.wordle.net Astronomical Surveys and Data Archives Richard L. White Space Telescope Science Institute HiPACC Summer School, July 2012 Overview Surveys & catalogs: Fundamental

More information

Dimensionality Reduction and Principle Components Analysis

Dimensionality Reduction and Principle Components Analysis Dimensionality Reduction and Principle Components Analysis 1 Outline What is dimensionality reduction? Principle Components Analysis (PCA) Example (Bishop, ch 12) PCA vs linear regression PCA as a mixture

More information

Introduction to the Sloan Survey

Introduction to the Sloan Survey Introduction to the Sloan Survey Title Rita Sinha IUCAA SDSS The SDSS uses a dedicated, 2.5-meter telescope on Apache Point, NM, equipped with two powerful special-purpose instruments. The 120-megapixel

More information

Large-Scale Behavioral Targeting

Large-Scale Behavioral Targeting Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting

More information

Statistics of the Universe: Exa-calculations and Cosmology's Data Deluge

Statistics of the Universe: Exa-calculations and Cosmology's Data Deluge Statistics of the Universe: Exa-calculations and Cosmology's Data Deluge Debbie Bard Matt Bellis Cosmology: the study of the nature and history of the Universe History of Universe driven by competing forces:

More information

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM SPATIAL DATA MINING Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM INTRODUCTION The main difference between data mining in relational DBS and in spatial DBS is that attributes of the neighbors

More information

9.1 Large Scale Structure: Basic Observations and Redshift Surveys

9.1 Large Scale Structure: Basic Observations and Redshift Surveys 9.1 Large Scale Structure: Basic Observations and Redshift Surveys Large-Scale Structure Density fluctuations evolve into structures we observe: galaxies, clusters, etc. On scales > galaxies, we talk about

More information

Dictionary Learning for photo-z estimation

Dictionary Learning for photo-z estimation Dictionary Learning for photo-z estimation Joana Frontera-Pons, Florent Sureau, Jérôme Bobin 5th September 2017 - Workshop Dictionary Learning on Manifolds MOTIVATION! Goal : Measure the radial positions

More information

Structured matrix factorizations. Example: Eigenfaces

Structured matrix factorizations. Example: Eigenfaces Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix

More information

Nonlinear Data Transformation with Diffusion Map

Nonlinear Data Transformation with Diffusion Map Nonlinear Data Transformation with Diffusion Map Peter Freeman Ann Lee Joey Richards* Chad Schafer Department of Statistics Carnegie Mellon University www.incagroup.org * now at U.C. Berkeley t! " 3 3

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

Studying galaxies with the Sloan Digital Sky Survey

Studying galaxies with the Sloan Digital Sky Survey Studying galaxies with the Sloan Digital Sky Survey Laboratory exercise, Physics of Galaxies, Spring 2017 (Uppsala Universitet) by Beatriz Villarroel *** The Sloan Digital Sky Survey (SDSS) is the largest

More information

Gaussian Process Modeling in Cosmology: The Coyote Universe. Katrin Heitmann, ISR-1, LANL

Gaussian Process Modeling in Cosmology: The Coyote Universe. Katrin Heitmann, ISR-1, LANL 1998 2015 Gaussian Process Modeling in Cosmology: The Coyote Universe Katrin Heitmann, ISR-1, LANL In collaboration with: Jim Ahrens, Salman Habib, David Higdon, Chung Hsu, Earl Lawrence, Charlie Nakhleh,

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Cosmology & Culture. Lecture 4 Wednesday April 22, 2009 The Composition of the Universe, & The Cosmic Spheres of Time.

Cosmology & Culture. Lecture 4 Wednesday April 22, 2009 The Composition of the Universe, & The Cosmic Spheres of Time. Cosmology & Culture Lecture 4 Wednesday April 22, 2009 The Composition of the Universe, & The Cosmic Spheres of Time UCSC Physics 80C Why do scientists take seriously the Double Dark cosmology, which says

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Dark Matter on Small Scales: Merging and Cosmogony. David W. Hogg New York University CCPP

Dark Matter on Small Scales: Merging and Cosmogony. David W. Hogg New York University CCPP Dark Matter on Small Scales: Merging and Cosmogony David W. Hogg New York University CCPP summary galaxy merger rates suggest growth of ~1 percent per Gyr galaxy evolution is over can we rule out CDM now

More information

Vector Space Models. wine_spectral.r

Vector Space Models. wine_spectral.r Vector Space Models 137 wine_spectral.r Latent Semantic Analysis Problem with words Even a small vocabulary as in wine example is challenging LSA Reduce number of columns of DTM by principal components

More information

Learning an astronomical catalog of the visible universe through scalable Bayesian inference

Learning an astronomical catalog of the visible universe through scalable Bayesian inference Learning an astronomical catalog of the visible universe through scalable Bayesian inference Jeffrey Regier with the DESI Collaboration Department of Electrical Engineering and Computer Science UC Berkeley

More information

Introductory Review on BAO

Introductory Review on BAO Introductory Review on BAO David Schlegel Lawrence Berkeley National Lab 1. What are BAO? How does it measure dark energy? 2. Current observations From 3-D maps From 2-D maps (photo-z) 3. Future experiments

More information

Caesar s Taxi Prediction Services

Caesar s Taxi Prediction Services 1 Caesar s Taxi Prediction Services Predicting NYC Taxi Fares, Trip Distance, and Activity Paul Jolly, Boxiao Pan, Varun Nambiar Abstract In this paper, we propose three models each predicting either taxi

More information

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017 Machine Learning (ML) Is

More information

Photometric Redshifts with DAME

Photometric Redshifts with DAME Photometric Redshifts with DAME O. Laurino, R. D Abrusco M. Brescia, G. Longo & DAME Working Group VO-Day... in Tour Napoli, February 09-0, 200 The general astrophysical problem Due to new instruments

More information

The Sloan Digital Sky Survey. Sebastian Jester Experimental Astrophysics Group Fermilab

The Sloan Digital Sky Survey. Sebastian Jester Experimental Astrophysics Group Fermilab The Sloan Digital Sky Survey Sebastian Jester Experimental Astrophysics Group Fermilab SLOAN DIGITAL SKY SURVEY Sloan Digital Sky Survey Goals: 1. Image ¼ of sky in 5 bands 2. Measure parameters of objects

More information

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:

More information

A Unified Theory of Photometric Redshifts. Tamás Budavári The Johns Hopkins University

A Unified Theory of Photometric Redshifts. Tamás Budavári The Johns Hopkins University A Unified Theory of Photometric Redshifts Tamás Budavári The Johns Hopkins University Haunting Questions Two classes of methods: empirical and template fitting Why are they so different? Anything fundamental?

More information

AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis

AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Joint work with: Ian Foster: Univ. of

More information

Recent BAO observations and plans for the future. David Parkinson University of Sussex, UK

Recent BAO observations and plans for the future. David Parkinson University of Sussex, UK Recent BAO observations and plans for the future David Parkinson University of Sussex, UK Baryon Acoustic Oscillations SDSS GALAXIES CMB Comparing BAO with the CMB CREDIT: WMAP & SDSS websites FLAT GEOMETRY

More information

Dark Energy. Cluster counts, weak lensing & Supernovae Ia all in one survey. Survey (DES)

Dark Energy. Cluster counts, weak lensing & Supernovae Ia all in one survey. Survey (DES) Dark Energy Cluster counts, weak lensing & Supernovae Ia all in one survey Survey (DES) What is it? The DES Collaboration will build and use a wide field optical imager (DECam) to perform a wide area,

More information

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University Lecture 4: Principal Component Analysis Aykut Erdem May 016 Hacettepe University This week Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA PCA Applications Data Visualization

More information

Astronomy 102: Stars and Galaxies Review Exam 3

Astronomy 102: Stars and Galaxies Review Exam 3 October 31, 2004 Name: Astronomy 102: Stars and Galaxies Review Exam 3 Instructions: Write your answers in the space provided; indicate clearly if you continue on the back of a page. No books, notes, or

More information

Statistics in Astronomy (II) applications. Shen, Shiyin

Statistics in Astronomy (II) applications. Shen, Shiyin Statistics in Astronomy (II) applications Shen, Shiyin Contents I. Linear fitting: scaling relations I.5 correlations between parameters II. Luminosity distribution function Vmax VS maximum likelihood

More information

Dimensionality Reduction

Dimensionality Reduction Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball

More information

Unveiling the misteries of our Galaxy with Intersystems Caché

Unveiling the misteries of our Galaxy with Intersystems Caché Unveiling the misteries of our Galaxy with Intersystems Caché Dr. Jordi Portell i de Mora on behalf of the Gaia team at the Institute for Space Studies of Catalonia (IEEC UB) and the Gaia Data Processing

More information

9. High-level processing (astronomical data analysis)

9. High-level processing (astronomical data analysis) Master ISTI / PARI / IV Introduction to Astronomical Image Processing 9. High-level processing (astronomical data analysis) André Jalobeanu LSIIT / MIV / PASEO group Jan. 2006 lsiit-miv.u-strasbg.fr/paseo

More information

Cosmology on the Beach: Experiment to Cosmology

Cosmology on the Beach: Experiment to Cosmology Image sky Select targets Design plug-plates Plug fibers Observe! Extract spectra Subtract sky spec. Cosmology on the Beach: Experiment to Cosmology Fit redshift Make 3-D map Test physics! David Schlegel!1

More information

A523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011

A523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011 A523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011 Lecture 1 Organization:» Syllabus (text, requirements, topics)» Course approach (goals, themes) Book: Gregory, Bayesian

More information

Today. Lookback time. ASTR 1020: Stars & Galaxies. Astronomy Picture of the day. April 2, 2008

Today. Lookback time. ASTR 1020: Stars & Galaxies. Astronomy Picture of the day. April 2, 2008 ASTR 1020: Stars & Galaxies April 2, 2008 Astronomy Picture of the day Reading: Chapter 21, sections 21.3. MasteringAstronomy Homework on Galaxies and Hubble s Law is due April 7 th. Weak Lensing Distorts

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Algorithms for Higher Order Spatial Statistics

Algorithms for Higher Order Spatial Statistics Algorithms for Higher Order Spatial Statistics István Szapudi Institute for Astronomy University of Hawaii Future of AstroComputing Conference, SDSC, Dec 16-17 Outline Introduction 1 Introduction 2 Random

More information

From quasars to dark energy Adventures with the clustering of luminous red galaxies

From quasars to dark energy Adventures with the clustering of luminous red galaxies From quasars to dark energy Adventures with the clustering of luminous red galaxies Nikhil Padmanabhan 1 1 Lawrence Berkeley Labs 04-15-2008 / OSU CCAPP seminar N. Padmanabhan (LBL) Cosmology with LRGs

More information

Power spectrum exercise

Power spectrum exercise Power spectrum exercise In this exercise, we will consider different power spectra and how they relate to observations. The intention is to give you some intuition so that when you look at a microwave

More information

Automated Classification of HETDEX Spectra. Ted von Hippel (U Texas, Siena, ERAU)

Automated Classification of HETDEX Spectra. Ted von Hippel (U Texas, Siena, ERAU) Automated Classification of HETDEX Spectra Ted von Hippel (U Texas, Siena, ERAU) Penn State HETDEX Meeting May 19-20, 2011 Outline HETDEX with stable instrumentation and lots of data ideally suited to

More information

From DES to LSST. Transient Processing Goes from Hours to Seconds. Eric Morganson, NCSA LSST Time Domain Meeting Tucson, AZ May 22, 2017

From DES to LSST. Transient Processing Goes from Hours to Seconds. Eric Morganson, NCSA LSST Time Domain Meeting Tucson, AZ May 22, 2017 From DES to LSST Transient Processing Goes from Hours to Seconds Eric Morganson, NCSA LSST Time Domain Meeting Tucson, AZ May 22, 2017 Hi, I m Eric Dr. Eric Morganson, Research Scientist, Nation Center

More information

Lecture 3a: The Origin of Variational Bayes

Lecture 3a: The Origin of Variational Bayes CSC535: 013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton The origin of variational Bayes In variational Bayes, e approximate the true posterior across parameters

More information

Applied Machine Learning for Design Optimization in Cosmology, Neuroscience, and Drug Discovery

Applied Machine Learning for Design Optimization in Cosmology, Neuroscience, and Drug Discovery Applied Machine Learning for Design Optimization in Cosmology, Neuroscience, and Drug Discovery Barnabas Poczos Machine Learning Department Carnegie Mellon University Machine Learning Technologies and

More information

17 Random Projections and Orthogonal Matching Pursuit

17 Random Projections and Orthogonal Matching Pursuit 17 Random Projections and Orthogonal Matching Pursuit Again we will consider high-dimensional data P. Now we will consider the uses and effects of randomness. We will use it to simplify P (put it in a

More information

Detecting the Unexpected

Detecting the Unexpected Detecting the Unexpected Discovery in the Era of Astronomically Big Data Insights from Space Telescope Science Institute s first Big Data conference Josh Peek, Chair Sarah Kendrew, C0-Chair SOC: Erik Tollerud,

More information

Locally-biased analytics

Locally-biased analytics Locally-biased analytics You have BIG data and want to analyze a small part of it: Solution 1: Cut out small part and use traditional methods Challenge: cutting out may be difficult a priori Solution 2:

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

Multistage Anslysis on Solar Spectral Analyses with Uncertainties in Atomic Physical Models

Multistage Anslysis on Solar Spectral Analyses with Uncertainties in Atomic Physical Models Multistage Anslysis on Solar Spectral Analyses with Uncertainties in Atomic Physical Models Xixi Yu Imperial College London xixi.yu16@imperial.ac.uk 23 October 2018 Xixi Yu (Imperial College London) Multistage

More information

DES Galaxy Clusters x Planck SZ Map. ASTR 448 Kuang Wei Nov 27

DES Galaxy Clusters x Planck SZ Map. ASTR 448 Kuang Wei Nov 27 DES Galaxy Clusters x Planck SZ Map ASTR 448 Kuang Wei Nov 27 Origin of Thermal Sunyaev-Zel'dovich (tsz) Effect Inverse Compton Scattering Figure Courtesy to J. Carlstrom Observables of tsz Effect Decrease

More information

AUTOMATIC MORPHOLOGICAL CLASSIFICATION OF GALAXIES. 1. Introduction

AUTOMATIC MORPHOLOGICAL CLASSIFICATION OF GALAXIES. 1. Introduction AUTOMATIC MORPHOLOGICAL CLASSIFICATION OF GALAXIES ZSOLT FREI Institute of Physics, Eötvös University, Budapest, Pázmány P. s. 1/A, H-1117, Hungary; E-mail: frei@alcyone.elte.hu Abstract. Sky-survey projects

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality

More information

Current Status of Chinese Virtual Observatory

Current Status of Chinese Virtual Observatory Current Status of Chinese Virtual Observatory Chenzhou Cui, Yongheng Zhao National Astronomical Observatories, Chinese Academy of Science, Beijing 100012, P. R. China Dec. 30, 2002 General Information

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Distant galaxies: a future 25-m submm telescope

Distant galaxies: a future 25-m submm telescope Distant galaxies: a future 25-m submm telescope Andrew Blain Caltech 11 th October 2003 Cornell-Caltech Workshop Contents Galaxy populations being probed Modes of investigation Continuum surveys Line surveys

More information

Neuroscience Introduction

Neuroscience Introduction Neuroscience Introduction The brain As humans, we can identify galaxies light years away, we can study particles smaller than an atom. But we still haven t unlocked the mystery of the three pounds of matter

More information

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory Randomized Selection on the GPU Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory High Performance Graphics 2011 August 6, 2011 Top k Selection on GPU Output the top k keys

More information

Astronomical imagers. ASTR320 Monday February 18, 2019

Astronomical imagers. ASTR320 Monday February 18, 2019 Astronomical imagers ASTR320 Monday February 18, 2019 Astronomical imaging Telescopes gather light and focus onto a focal plane, but don t make perfect images Use a camera to improve quality of images

More information

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often

More information

Galaxy Formation Now and Then

Galaxy Formation Now and Then Galaxy Formation Now and Then Matthias Steinmetz Astrophysikalisches Institut Potsdam 1 Overview The state of galaxy formation now The state of galaxy formation 10 years ago Extragalactic astronomy in

More information

Crurent and Future Massive Redshift Surveys

Crurent and Future Massive Redshift Surveys Crurent and Future Massive Redshift Surveys Jean-Paul Kneib LASTRO Light on the Dark Galaxy Power Spectrum 3D mapping of the position of galaxies non-gaussian initial fluctuations Size of the Horizon:

More information

Data Preprocessing Tasks

Data Preprocessing Tasks Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can

More information