A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

Similar documents
A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Frequentist-Bayesian Model Comparisons: A Simple Example

Signal Modeling, Statistical Inference and Data Mining in Astrophysics

Bayesian Methods for Machine Learning

F & B Approaches to a simple model

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

An example to illustrate frequentist and Bayesian approches

Introduction to Probability and Statistics (Continued)

A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University

17 : Markov Chain Monte Carlo

Brief introduction to Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Bayesian Inference in Astronomy & Astrophysics A Short Course

More on nuisance parameters

Lecture 2: Univariate Time Series

1 A simple example. A short introduction to Bayesian statistics, part I Math 217 Probability and Statistics Prof. D.

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Graphical Models for Collaborative Filtering

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

CSC321 Lecture 18: Learning Probabilistic Models

A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University. Motivations: Detection & Characterization. Lecture 2.

Introduction to Bayesian Data Analysis

If we want to analyze experimental or simulated data we might encounter the following tasks:

Stochastic Processes. A stochastic process is a function of two variables:

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

MCMC Sampling for Bayesian Inference using L1-type Priors

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

STA 4273H: Sta-s-cal Machine Learning

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Stochastic Processes

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Introduction to Bayesian Methods

Lecture Notes 7 Stationary Random Processes. Strict-Sense and Wide-Sense Stationarity. Autocorrelation Function of a Stationary Process

Markov Chain Monte Carlo methods

Lecture 3: Statistical sampling uncertainty

Probability, CLT, CLT counterexamples, Bayes. The PDF file of this lecture contains a full reference document on probability and random variables.

Detection ASTR ASTR509 Jasper Wall Fall term. William Sealey Gosset

Chapter 3 - Temporal processes

Bayesian methods in the search for gravitational waves

STA 294: Stochastic Processes & Bayesian Nonparametrics

Applied Probability and Stochastic Processes

SF2943: TIME SERIES ANALYSIS COMMENTS ON SPECTRAL DENSITIES

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Computational statistics

P 1.5 X 4.5 / X 2 and (iii) The smallest value of n for

Conditional probabilities and graphical models

ELEG 3143 Probability & Stochastic Process Ch. 6 Stochastic Process

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Statistical Theory MT 2006 Problems 4: Solution sketches


Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Introduction to Machine Learning. Lecture 2

CSC 2541: Bayesian Methods for Machine Learning

ABC methods for phase-type distributions with applications in insurance risk problems

SAMSI Astrostatistics Tutorial. More Markov chain Monte Carlo & Demo of Mathematica software

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Course content (will be adapted to the background knowledge of the class):

Fundamentals of Applied Probability and Random Processes

Statistical Theory MT 2007 Problems 4: Solution sketches

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

1 Probabilities. 1.1 Basics 1 PROBABILITIES

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Bayesian Phylogenetics:

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Long-Run Covariability

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Probabilistic Graphical Models

Lecture 4: Probabilistic Learning

Statistical techniques for data analysis in Cosmology

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

Stat 248 Lab 2: Stationarity, More EDA, Basic TS Models

Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011

Statistics for Data Analysis a toolkit for the (astro)physicist

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

Bayesian Regression Linear and Logistic Regression

Basic math for biology

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011

Lecture 7 October 13

1 Linear Difference Equations

Least Squares Regression

Making rating curves - the Bayesian approach

Statistical learning. Chapter 20, Sections 1 4 1

Primer on statistics:

Mobile Robot Localization

Algorithmisches Lernen/Machine Learning

Information geometry for bivariate distribution control

Machine Learning 4771

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Approximate Bayesian Computation

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2013

Transcription:

Lecture 8 A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2015 http://www.astro.cornell.edu/~cordes/a6523 Applications: Bayesian inference: overview and examples Introduction to data mining in large-scale surveys Reading: Gregory chapters 5, 3, 9.1-9.2 Lecture 10 (Thursday 26 Feb): Adam Brazier (Cornell Center for Advanced Computing) will talk about astronomy-survey workflows and the howto of databases

Topics for Lecture 10 next week Sensor data (e.g. telescope data) often requires further filtering and cross-comparisons of the global output. By storing output in a database we can query our data products efficiently and with a wide variety of qualifiers and filters. Databases, particularly relational databases, are used in many fields, including industry, to store information in a form that can be efficiently queried. We will introduce the relational database structure, how they can be queried, how they should be designed and how they can be incorporated into the scientific workflow.

Topics Plan Bayesian inference Detection problems Matched filtering and localization Modeling (linear, nonlinear) Cost functions Parameter estimation and errors Optimization methods Hill climbing, annealing, genetic algorithms MCMC variants (Gibbs, Hamiltonian) Generalized spectral analysis Lomb-Scargle Maximum entropy High resolution method Bayesian approaches Wavelets Principal components Cholesky decomposition Large scale surveys in astronomy Time domain Spectral line Images and image cubes Detection & characterization of events, sources, objects Known object types Unknown object types Current algorithms Data mining tools Databases Distributed processing

Gibbs sampling http://en.wikipedia.org/wiki/gibbs_sampling http://sfb649.wiwi.hu-berlin.de/ fedc_homepage/xplore/ebooks/html/csa/ node28.html http://cs.brown.edu/research/ai/dynamics/ tutorial/documents/gibbssampling.html http://csg.sph.umich.edu/abecasis/class/ 815.23.pdf

Bayesian Inference Probability = a measure of our state of knowledge before/after acquiring data = frequency of occurrence. Let D = a vector of data points and θ = a vector of parameters for some model. The parameters might be those for a straight line for a more complex model (some have hundreds of parameters or more). The simplest form of Bayes law for model fitting (parameter estimation) is P (θ D) = Before acquiring data P (D θ) = sampling distribution P (θ)p (D θ) P (D) You can view the parameters as fixed and the data variable. After getting data, the unknown parameter values are a function of fixed data. We then rename P (D θ) (θ D) =likelihood function. Note that this form of Bayes theorem follows from conditional probabilities for a pair of propositions: P (AB) =P (A B)P (B) =P (B A)P (A) P (B A)P (A) = P (A B) = P (B) Let A θ and B D. 1

We infer the posterior probability (or PDF) of parameter values as P (θ D) = P (θ)(θ D) P (D) = Prior Likelihood function Normalization The normalization is simply the integral of the numerator if we want the posterior PDF to be normalized (which we often do) In the simplest case, we have no prior information so the posterior PDF is simply P (θ D) = (θ D) dθ (θ D) The normaliza-on is some-mes referred to as the prior predic)ve probability or the global likelihood 2

A Form for More Detailed Inference (model comparisons, hypothesis testing) Use 3-proposition probabilities written in two ways: P (ABC) =P (A BC)P (BC) =P (A BC)P (B C)P (C) and Equating we get P (ABC) =P (B AC)P (AC) =P (B AC)P (A C)P (C) P (A BC)P (B C)P (C) =P (B AC)P (A C)P (C) which gives Now let P (A BC) = P (A C)P (B AC) P (B C) A θ parameters of a model B D data C I background information (laws of physics, empirical results, wild guesses... (1) = P (θ DI) = P (θ I)P (D θi) P (D I) 3

What do we do with posterior probabilities or PDFs? Answer: the usual stuff: we characterize the quantity of interest according to what our goals are. Best value? mean, mode, median How well do we know it? variance, confidence or credible region. The credible region for a parameter is its range of values that cover X% of the PDF (e.g. 68%, 95%). These regions may or may not correspond to 1σ or 3σ regions, depending on how Gaussian-like the PDF is. Is it consistent with being Gaussian distributed? kurtosis, skewness If multiple parameters: Are they correlated or independent? There may be underlying physics or phenomena of interest Maybe only a subset of parameters is of interest. We then marginalize the uninteresting or nuisance parameters: Let θ =(φ, ψ) with ψ = nuisance parameters. We integrate the total posteriod PDF to get the PDF of the parameters of interest: P (φ DI) = dψ P (φ, ψ DI) 4

Sequential Learning Start we a prior P (θ I). Acquire first data point or set: Acquire second data point or set: D 1 = posterior 1 prior 1 1 D 2 = posterior 2 prior 2 2 posterior 2 posterior 1 2 posterior 2 prior 1 1 2. D n = posterior n prior 1 n j=1 j 5

Examples Poisson event rate (photon counting) Gaussian mean and standard deviation

Example Data: {k i },i=1,...,n, i.i.d., drawn from Poisson process Poisson PDF: Want: an estimate of the mean of process P k = λk e λ k! FREQUENTIST APPROACH: We need an estimator for the mean; consider the likelihood f(λ) = n P (k i )= i=1 1 n i=1 k i! λ n i=1 k i e nλ. Maximizing, we obtain an estimator for the mean is df dλ =0=f(λ) n + λ 1 k = 1 n n k i. i=1 n k i i=1

BAYESIAN APPROACH: Likelihood (as before): P (D MI) = n P (k i )= i=1 1 n ı=1 k i! λ n i=1 k i e nλ. Prior: Assume Prior Predictive: P (D I) P (M I) =P (λ I) P (λ I)λ λ U(λ) dλ U(λ)P (D MI) = n n x n ı=1 k i! Γ(n x). Combining all the above, we find P (λ {k i }I) = nn x Γ(n x) λn x e nλ U(λ) Note that rather than getting a point estimate for the mean, we get a PDF for its value. For hypothesis testing, this is much more useful than a point estimate.

Issues Bayesian inference can look deceptively simple (especially for the examples given) Issues that arise: The underlying form for the likelihood function may not be known so an analytical form is not available The posterior PDF may not be easily integrated, especially if the dimensionality is high and its shape is not simple. Finding parameter values does not need normalization necessarily but comparison of models does Vast literature exists on how to sample and integrate the posterior PDF (e.g. MCMC and its variants)

Question How do we calculate the likelihood function if we do not know the underlying PDF for the data errors and cannot argue from the CLT that it is Gaussian?

Bayesian Priors: Art or Science? The prior PDF f(θ I) for a parameter vector θ is used to impose a priori information about parameter values, when known. If prior information is constraining (i.e. the prior PDF has a strong influence on the shape of the posterior PDF), it is said to be informative. When explicit constraints are not known, one often uses a non-informative prior. For example, suppose we have a parameter which is largely unconstrained and for which we want to calculate the posterior PDF while allowing a wide range of possible values for the parameter. We might, then, use a flat prior in the statistical inference. But, is a flat prior really the best one for expressing ignorance of the actual value for a parameter? The answer is, not necessarily. 1

To illustrate the issues, we will consider two kinds of parameters: a location parameter and a scale parameter. For example, consider data that we assume are described by a N(µ, σ 2 ) distribution whose parameters µ (the mean) and σ 2 (the variance) are not known and are not constrained a priori. What should we use as priors for these parameters? We can write the likelihood function as L = f(d θi) = di µ f σ θi, (1) i where {d i,i =1, N} are the data and f(x) =e x2 /2. Note that µ shifts the PDF while σ scales the PDF. 2

Choosing a prior for µ: We use translation invariance. Suppose we make a change of variable so that d i = d i + c. (2) Then d i µ d i (µ + c) = d i µ. (3) σ σ σ Since c is arbitrary, if we don t know µ and hence do not know µ + c, it is plausible that we should search uniformly in µ, i.e. the prior for µ should be flat. We can see this also by the following. Suppose the prior for µ is f µ (µ). Then the prior for µ = µ + c is f µ (µ )= f µ(µ c) dµ = f µ (µ c). (4) /dµ We would like the inference to be independent of any such change of variable, so the form of the prior for µ should be translation invariant. In order for the left-hand and right-hand sides of Eq.?? to be equal, the form of the prior needs to be independent of its argument, i.e. flat. 3

Thus an appropriate prior would be of the form 1 µ 2 µ,µ 1 1 µ µ 2 f µ (µ) = 0, otherwise, where µ 1,2 are chosen to encompass all plausible values of µ. Note that in calculating the posterior PDF, the 1/(µ 2 µ 1 ) factor drops out if the range µ 2 µ 1 is much wider than the likelihood function L(θ). An example of a noninformative prior is shown in Figure??. (5) 4

Figure 1: A noninformative prior for the mean, µ. In this case, a flat prior PDF, f µ (µ), is shown along with a likelihood function, L(µ), that is much narrower than the prior. The peak of L is the maximum likelihood estimate for µ and is the arithmetic mean of the data: ˆµ = N 1 i d i. For a case like this, the actual interval for the prior, [µ 1,µ 2 ] will drop out of the posterior PDF because it appears in both the numerator and denominator. 5

Choosing a prior for σ: Here we use scale invariance. Consider a change of variable Now d i µ σ d i /c µ σ d i = cd i. (6) = d i cµ cσ = d i µ cσ If the prior for σ is f σ (σ), then the prior for σ is = d i µ σ. (7) f σ (σ )= f σ(σ /c) dσ /dσ = 1 c f σ(σ /c). (8) We would like f σ and f σ to have the same shape. Consider a power-law form, f σ σ n. Then Eq.?? implies that σ n 1 σ n =, (9) c c which can be satisfied only for n =1. 6

Thus the scale-invariant prior for σ is σ 1, σ 1 σ σ 2 f σ (σ) 0, otherwise, where σ 1,2 are chosen to encompass all plausible values of σ. (10) 7

Reality check: we can show that the scale-invariant, non-informative prior for σ is reasonable by considering another change of variable. Suppose we want to use the reciprocal of σ as our parameter rather than σ: The prior for s is s = σ 1. (11) f s (s) = f σ(s 1 ) ds/dσ = dσ ds f σ(s 1 ) = s 2 f σ (s 1 ) = s 2 1 s 1 = s 1. (12) Thus, the prior has the same form for σ and its reciprocal. This is desirable because it would not be reasonable for the parameter inference to depend on which variable we used. Thus we can use either σ or s and then derive one from the other. 8

Some Stochastic Processes of Interest

Stochastic Processes II Useful Processes: A. Gaussian noise: n(t) is a gaussian random process if 1. f n (x) =1D Gaussian PDF 2. f n,n(t+τ) (x, y) =2D joint Gaussian PDF 3. All higher order PDFs, moments can be written in terms of the first and second moments. Note that Gaussian noise can be either stationary or nonstationary. For example, the mean X(t) and variance σx 2 (t) can both be time dependent. 1

B. White noise has a particular spectral shape (flat) but the 1D PDF is unspecified: The autocorrelation function is S n (f) = constant R(τ) = σ 2 n δ(τ) continuous case R(τ) = σ 2 n δ τ0 discrete case Thus, white noise need not be Gaussian noise and vice versa. However, white, Gaussian noise is often used or assumed. Example of white, non-gaussian noise constructed from white, gaussian noise: Let X k = white, Gaussian noise: X k X k = σ 2 x δ kk. Let Y k =sgn(x k )=±1 Then Y k is white noise but it is not Gaussian. The PDF of Y k is f Y (Y )= 1 2 [δ(y +1)+δ(Y 1)]. It may be shown that the autocorrelation function of Y is a function of the ACF of X: This relation (van Vleck relation) is the basis for autocorrelation spectrometers. 2

C. Shot Noise is associated with Poisson events, each having a shape h(t): x(t) = i h(t t i ). where events occur at a rate λ. If h(t) decays to zero as t ±, then x(t) has stationary statistics. If h(t) does not decay, x(t) has nonstationary statistics. C1. White noise: As h(t) δ(t), x(t) tends to white noise. C2. Bandlimited white noise: If h(t) has a power spectrum H(f) 2 that is low-pass in form (it goes to zero above some cutoff frequency f c, then x(t) will have a flat spectrum for f f c. Similar for bandpass noise, where the centroid frequency of the non-zero noise is at some frequency f = 0. 3

Figure 1: A single realization of Gaussian white noise and random walks derived from it. Since individual steps occur frequently, the random walks are termed dense. 4

Figure 2: A single realization of non-gaussian white noise (shot noise) and sparse random walks derived from it. 5

D. Autoregressive (AR) process: depends on past values + white noise: x t = n t M α j x t j, j=1 where n t = discrete white noise. M = order of AR model α = coefficients of AR model. AR processes play a role in maximum entropy spectral estimators. By taking the Fourier transform of the expression for x t we can solve for X f = 1+ j Ñ f α j e 2πijf 6

E. Moving average (MA) process: is a moving average of white noise: x t = N β j n t j. j=0 F. ARMA process: AR and MA combined. G. ARIMA process: An integrated ARMA process. 7

H. Markov chain: one whose present state depends probabilistically on some number p of previous values. A first-order Markov process has p =1, etc. For a chain with n states, e.g. S = {s 1,s 2,,s n } the probability of being in a given state at discrete time t is given by the state probability vector is the row vector P t =(p 1,p 2,,p n ) and the probability vector for time t +1is P t+1 = P t Q where Q is the transition matrix whose elements are the probabilities q ij of transitioning from the i th state to the j th state. The sum of the elements along a row of Q is unity because the chain has to be in some state at any time. A two-state chain, for example, has a transition matrix q 11 1 q 11 Q =. 1 q 22 q 22 8

I. Random Walks: Any integral of noise with stationary statistics leads to a process having nonstationary statistics, with random-walk-like behavior. E.g. where n(t) is white noise. x(t) = t 0 dt n(t ), J. Higher-order random walks: If white noise is integrated M times, the resultant process is an M th -order random walk. 9

Got to here 2015