A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

Size: px

Start display at page:

Download "A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring"

Reynold Hudson
5 years ago
Views:

1 Lecture 8 A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring Applications: Bayesian inference: overview and examples Introduction to data mining in large-scale surveys Reading: Gregory chapters 5, 3, Lecture 10 (Thursday 26 Feb): Adam Brazier (Cornell Center for Advanced Computing) will talk about astronomy-survey workflows and the howto of databases

2 Topics for Lecture 10 next week Sensor data (e.g. telescope data) often requires further filtering and cross-comparisons of the global output. By storing output in a database we can query our data products efficiently and with a wide variety of qualifiers and filters. Databases, particularly relational databases, are used in many fields, including industry, to store information in a form that can be efficiently queried. We will introduce the relational database structure, how they can be queried, how they should be designed and how they can be incorporated into the scientific workflow.

3 Topics Plan Bayesian inference Detection problems Matched filtering and localization Modeling (linear, nonlinear) Cost functions Parameter estimation and errors Optimization methods Hill climbing, annealing, genetic algorithms MCMC variants (Gibbs, Hamiltonian) Generalized spectral analysis Lomb-Scargle Maximum entropy High resolution method Bayesian approaches Wavelets Principal components Cholesky decomposition Large scale surveys in astronomy Time domain Spectral line Images and image cubes Detection & characterization of events, sources, objects Known object types Unknown object types Current algorithms Data mining tools Databases Distributed processing

4 Gibbs sampling fedc_homepage/xplore/ebooks/html/csa/ node28.html tutorial/documents/gibbssampling.html pdf

6 Bayesian Inference Probability = a measure of our state of knowledge before/after acquiring data = frequency of occurrence. Let D = a vector of data points and θ = a vector of parameters for some model. The parameters might be those for a straight line for a more complex model (some have hundreds of parameters or more). The simplest form of Bayes law for model fitting (parameter estimation) is P (θ D) = Before acquiring data P (D θ) = sampling distribution P (θ)p (D θ) P (D) You can view the parameters as fixed and the data variable. After getting data, the unknown parameter values are a function of fixed data. We then rename P (D θ) (θ D) =likelihood function. Note that this form of Bayes theorem follows from conditional probabilities for a pair of propositions: P (AB) =P (A B)P (B) =P (B A)P (A) P (B A)P (A) = P (A B) = P (B) Let A θ and B D. 1

7 We infer the posterior probability (or PDF) of parameter values as P (θ D) = P (θ)(θ D) P (D) = Prior Likelihood function Normalization The normalization is simply the integral of the numerator if we want the posterior PDF to be normalized (which we often do) In the simplest case, we have no prior information so the posterior PDF is simply P (θ D) = (θ D) dθ (θ D) The normaliza-on is some-mes referred to as the prior predic)ve probability or the global likelihood 2

8 A Form for More Detailed Inference (model comparisons, hypothesis testing) Use 3-proposition probabilities written in two ways: P (ABC) =P (A BC)P (BC) =P (A BC)P (B C)P (C) and Equating we get P (ABC) =P (B AC)P (AC) =P (B AC)P (A C)P (C) P (A BC)P (B C)P (C) =P (B AC)P (A C)P (C) which gives Now let P (A BC) = P (A C)P (B AC) P (B C) A θ parameters of a model B D data C I background information (laws of physics, empirical results, wild guesses... (1) = P (θ DI) = P (θ I)P (D θi) P (D I) 3

9 What do we do with posterior probabilities or PDFs? Answer: the usual stuff: we characterize the quantity of interest according to what our goals are. Best value? mean, mode, median How well do we know it? variance, confidence or credible region. The credible region for a parameter is its range of values that cover X% of the PDF (e.g. 68%, 95%). These regions may or may not correspond to 1σ or 3σ regions, depending on how Gaussian-like the PDF is. Is it consistent with being Gaussian distributed? kurtosis, skewness If multiple parameters: Are they correlated or independent? There may be underlying physics or phenomena of interest Maybe only a subset of parameters is of interest. We then marginalize the uninteresting or nuisance parameters: Let θ =(φ, ψ) with ψ = nuisance parameters. We integrate the total posteriod PDF to get the PDF of the parameters of interest: P (φ DI) = dψ P (φ, ψ DI) 4

10 Sequential Learning Start we a prior P (θ I). Acquire first data point or set: Acquire second data point or set: D 1 = posterior 1 prior 1 1 D 2 = posterior 2 prior 2 2 posterior 2 posterior 1 2 posterior 2 prior D n = posterior n prior 1 n j=1 j 5

11 Examples Poisson event rate (photon counting) Gaussian mean and standard deviation

12 Example Data: {k i },i=1,...,n, i.i.d., drawn from Poisson process Poisson PDF: Want: an estimate of the mean of process P k = λk e λ k! FREQUENTIST APPROACH: We need an estimator for the mean; consider the likelihood f(λ) = n P (k i )= i=1 1 n i=1 k i! λ n i=1 k i e nλ. Maximizing, we obtain an estimator for the mean is df dλ =0=f(λ) n + λ 1 k = 1 n n k i. i=1 n k i i=1

13 BAYESIAN APPROACH: Likelihood (as before): P (D MI) = n P (k i )= i=1 1 n ı=1 k i! λ n i=1 k i e nλ. Prior: Assume Prior Predictive: P (D I) P (M I) =P (λ I) P (λ I)λ λ U(λ) dλ U(λ)P (D MI) = n n x n ı=1 k i! Γ(n x). Combining all the above, we find P (λ {k i }I) = nn x Γ(n x) λn x e nλ U(λ) Note that rather than getting a point estimate for the mean, we get a PDF for its value. For hypothesis testing, this is much more useful than a point estimate.

14 Issues Bayesian inference can look deceptively simple (especially for the examples given) Issues that arise: The underlying form for the likelihood function may not be known so an analytical form is not available The posterior PDF may not be easily integrated, especially if the dimensionality is high and its shape is not simple. Finding parameter values does not need normalization necessarily but comparison of models does Vast literature exists on how to sample and integrate the posterior PDF (e.g. MCMC and its variants)

15 Question How do we calculate the likelihood function if we do not know the underlying PDF for the data errors and cannot argue from the CLT that it is Gaussian?

16 Bayesian Priors: Art or Science? The prior PDF f(θ I) for a parameter vector θ is used to impose a priori information about parameter values, when known. If prior information is constraining (i.e. the prior PDF has a strong influence on the shape of the posterior PDF), it is said to be informative. When explicit constraints are not known, one often uses a non-informative prior. For example, suppose we have a parameter which is largely unconstrained and for which we want to calculate the posterior PDF while allowing a wide range of possible values for the parameter. We might, then, use a flat prior in the statistical inference. But, is a flat prior really the best one for expressing ignorance of the actual value for a parameter? The answer is, not necessarily. 1

17 To illustrate the issues, we will consider two kinds of parameters: a location parameter and a scale parameter. For example, consider data that we assume are described by a N(µ, σ 2 ) distribution whose parameters µ (the mean) and σ 2 (the variance) are not known and are not constrained a priori. What should we use as priors for these parameters? We can write the likelihood function as L = f(d θi) = di µ f σ θi, (1) i where {d i,i =1, N} are the data and f(x) =e x2 /2. Note that µ shifts the PDF while σ scales the PDF. 2

18 Choosing a prior for µ: We use translation invariance. Suppose we make a change of variable so that d i = d i + c. (2) Then d i µ d i (µ + c) = d i µ. (3) σ σ σ Since c is arbitrary, if we don t know µ and hence do not know µ + c, it is plausible that we should search uniformly in µ, i.e. the prior for µ should be flat. We can see this also by the following. Suppose the prior for µ is f µ (µ). Then the prior for µ = µ + c is f µ (µ )= f µ(µ c) dµ = f µ (µ c). (4) /dµ We would like the inference to be independent of any such change of variable, so the form of the prior for µ should be translation invariant. In order for the left-hand and right-hand sides of Eq.?? to be equal, the form of the prior needs to be independent of its argument, i.e. flat. 3

19 Thus an appropriate prior would be of the form 1 µ 2 µ,µ 1 1 µ µ 2 f µ (µ) = 0, otherwise, where µ 1,2 are chosen to encompass all plausible values of µ. Note that in calculating the posterior PDF, the 1/(µ 2 µ 1 ) factor drops out if the range µ 2 µ 1 is much wider than the likelihood function L(θ). An example of a noninformative prior is shown in Figure??. (5) 4

20 Figure 1: A noninformative prior for the mean, µ. In this case, a flat prior PDF, f µ (µ), is shown along with a likelihood function, L(µ), that is much narrower than the prior. The peak of L is the maximum likelihood estimate for µ and is the arithmetic mean of the data: ˆµ = N 1 i d i. For a case like this, the actual interval for the prior, [µ 1,µ 2 ] will drop out of the posterior PDF because it appears in both the numerator and denominator. 5

21 Choosing a prior for σ: Here we use scale invariance. Consider a change of variable Now d i µ σ d i /c µ σ d i = cd i. (6) = d i cµ cσ = d i µ cσ If the prior for σ is f σ (σ), then the prior for σ is = d i µ σ. (7) f σ (σ )= f σ(σ /c) dσ /dσ = 1 c f σ(σ /c). (8) We would like f σ and f σ to have the same shape. Consider a power-law form, f σ σ n. Then Eq.?? implies that σ n 1 σ n =, (9) c c which can be satisfied only for n =1. 6

22 Thus the scale-invariant prior for σ is σ 1, σ 1 σ σ 2 f σ (σ) 0, otherwise, where σ 1,2 are chosen to encompass all plausible values of σ. (10) 7

23 Reality check: we can show that the scale-invariant, non-informative prior for σ is reasonable by considering another change of variable. Suppose we want to use the reciprocal of σ as our parameter rather than σ: The prior for s is s = σ 1. (11) f s (s) = f σ(s 1 ) ds/dσ = dσ ds f σ(s 1 ) = s 2 f σ (s 1 ) = s 2 1 s 1 = s 1. (12) Thus, the prior has the same form for σ and its reciprocal. This is desirable because it would not be reasonable for the parameter inference to depend on which variable we used. Thus we can use either σ or s and then derive one from the other. 8

24 Some Stochastic Processes of Interest

25 Stochastic Processes II Useful Processes: A. Gaussian noise: n(t) is a gaussian random process if 1. f n (x) =1D Gaussian PDF 2. f n,n(t+τ) (x, y) =2D joint Gaussian PDF 3. All higher order PDFs, moments can be written in terms of the first and second moments. Note that Gaussian noise can be either stationary or nonstationary. For example, the mean X(t) and variance σx 2 (t) can both be time dependent. 1

26 B. White noise has a particular spectral shape (flat) but the 1D PDF is unspecified: The autocorrelation function is S n (f) = constant R(τ) = σ 2 n δ(τ) continuous case R(τ) = σ 2 n δ τ0 discrete case Thus, white noise need not be Gaussian noise and vice versa. However, white, Gaussian noise is often used or assumed. Example of white, non-gaussian noise constructed from white, gaussian noise: Let X k = white, Gaussian noise: X k X k = σ 2 x δ kk. Let Y k =sgn(x k )=±1 Then Y k is white noise but it is not Gaussian. The PDF of Y k is f Y (Y )= 1 2 [δ(y +1)+δ(Y 1)]. It may be shown that the autocorrelation function of Y is a function of the ACF of X: This relation (van Vleck relation) is the basis for autocorrelation spectrometers. 2

27 C. Shot Noise is associated with Poisson events, each having a shape h(t): x(t) = i h(t t i ). where events occur at a rate λ. If h(t) decays to zero as t ±, then x(t) has stationary statistics. If h(t) does not decay, x(t) has nonstationary statistics. C1. White noise: As h(t) δ(t), x(t) tends to white noise. C2. Bandlimited white noise: If h(t) has a power spectrum H(f) 2 that is low-pass in form (it goes to zero above some cutoff frequency f c, then x(t) will have a flat spectrum for f f c. Similar for bandpass noise, where the centroid frequency of the non-zero noise is at some frequency f = 0. 3

28 Figure 1: A single realization of Gaussian white noise and random walks derived from it. Since individual steps occur frequently, the random walks are termed dense. 4

29 Figure 2: A single realization of non-gaussian white noise (shot noise) and sparse random walks derived from it. 5

30 D. Autoregressive (AR) process: depends on past values + white noise: x t = n t M α j x t j, j=1 where n t = discrete white noise. M = order of AR model α = coefficients of AR model. AR processes play a role in maximum entropy spectral estimators. By taking the Fourier transform of the expression for x t we can solve for X f = 1+ j Ñ f α j e 2πijf 6

31 E. Moving average (MA) process: is a moving average of white noise: x t = N β j n t j. j=0 F. ARMA process: AR and MA combined. G. ARIMA process: An integrated ARMA process. 7

32 H. Markov chain: one whose present state depends probabilistically on some number p of previous values. A first-order Markov process has p =1, etc. For a chain with n states, e.g. S = {s 1,s 2,,s n } the probability of being in a given state at discrete time t is given by the state probability vector is the row vector P t =(p 1,p 2,,p n ) and the probability vector for time t +1is P t+1 = P t Q where Q is the transition matrix whose elements are the probabilities q ij of transitioning from the i th state to the j th state. The sum of the elements along a row of Q is unity because the chain has to be in some state at any time. A two-state chain, for example, has a transition matrix q 11 1 q 11 Q =. 1 q 22 q 22 8

33 I. Random Walks: Any integral of noise with stationary statistics leads to a process having nonstationary statistics, with random-walk-like behavior. E.g. where n(t) is white noise. x(t) = t 0 dt n(t ), J. Higher-order random walks: If white noise is integrated M times, the resultant process is an M th -order random walk. 9

34 Got to here 2015

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

Lecture 9 A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2015 http://www.astro.cornell.edu/~cordes/a6523 Applications: Comparison of Frequentist and Bayesian inference