A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University. Motivations: Detection & Characterization. Lecture 2.

Size: px

Start display at page:

Download "A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University. Motivations: Detection & Characterization. Lecture 2."

Benjamin West
6 years ago
Views:

1 A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University Lecture 2 Probability basics Fourier transform basics Typical problems Overall mantra: Discovery and cri@cal thinking with data + The more I prac,ce the luckier I get Arnold Palmer 1 Motivations: Detection & Characterization Detec@on problem: We seek the existence of a signal in some data. It may not exist in the data or it may be very weak We want to detect it with maximum robustness if it is there If not, we want a robust descrip@on of how strong it could be and be consistent with non- detec@on Characteriza@on problem: We know that a signal is in the data and we want to determine its proper@es op@mally. What do robust and op@mal mean? How we incorporate prior knowledge into an analysis is a key difference between frequen@st and Bayesian methods 2 1

2 Broad Classes of Problems Many measured quantitites ( raw data ) are the outputs of linear systems Wave propagation (EM, gravitational, seismic, acoustic ) Many signals are the result of nonlinear operations in natural systems or in apparati Many analyses of data are linear operations acting on the data to produce some desired result (detection, modeling) E.g. Fourier transform based spectral analysis Many analyses are nonlinear E.g. Maximum entropy and Bayesian spectral analysis 3 Broad Classes of Problems Detection, analysis and modeling: signal detection analysis Natural or artificial Is it there? What are its properties? Optimal detection schemes Maximize S/N of a test statistic Population of signals: maximize detections of real signals minimize false positives and false negatives null hypothesis: no signal there Parametric approaches: (e.g. least squares fitting of a model with parameters) Non-parametric approaches: (e.g. relative comparison of distributions [KS test]) 4 2

3 Probability Definitions Item Defini*on/property comments Random variable Cumula*ve probability distribu*on F X (x) Probability density func*on f X (x) = df X (x) /dx Event à number X = RV, x = instance CDF = cumula@ve probability PDF, Normalized to unity Events can be quan@ta@ve or qualita@ve Characteriza@on, inference, genera@on of random samples di^o Characteris*c func*on Fourier transform of PDF moment genera@ng func@on Joint/mul*variate PDF, CDF Moments Comparison of data sets Collec@ons of random variables Averages of any quan@ty over a PDF Moments, PDFs, CDFs Stochas@c processes and correla@on func@ons e.g. Kolmogorov- Smirnov test for CDFs 5 Notions of Probability Qualitative: (Knowledge) As a measure of the degree to which propositions, hypotheses, or quantities are known. The measure can be an intrinsic property of an event in and of itself or relative to other events. eg. It is not probable that the sun will explode as a supernova is an absolute statement based on our background knowledge of stellar evolution. eg. Horse A is more probable to win, place, or show than horse B is a relative statement that includes the constraint that one horse wins to the exclusion of any other horse in the race. Bayesian inference makes use of this qualitative form of probability along with quantitative aspects discussed below. In this sense, Bayesian methods attribute probabilities to more entities than do some of the more formalistic approaches

4 Quantitative: (Frequentist approach: random variables and ensembles) Classical: Probabilities of events in an event space S are found, a priori, by considerations of possible outcomes of experiments and the manner in which the outcomes may be achieved. For example, a single-die yields probabilities of 1/6 for the outcomes of a single toss of the die. A problem with the classical definition (which often calculates probabilities from line segment lengths, areas, volumes, etc.) is that they cannot be evaluated for all cases (eg. an unfair die). Relative Frequency: The probability of an event is defined to be the limit of the frequency of occurrence of the event in N repeated experiments as N : P { } = lim N N N. The problem with frequencies is that in real experiments N is always finite and therefore probabilities can only be estimated. Such estimates are a poor basis for deductive probabilistic theories. 2 7 Axiomatic: A deductive theory of probability can be based on axioms of probability for allowed events in an overall event space. The axioms are: For = an element in the event space S and P ( ) = probability of the event the following hold: i) 0 P ( ) 1 ii) P {S} =1(S = set of all events, so P {S} is the probability that any event will be the outcome of an experiment. iii) If 1 and 2 are mutually exclusive then the probability of ( the event that 1 or 1 occurs) is P ( )=P ( 1 )+P ( 2 ). These axioms imply further that: iv) If = event that does not occur, then P ( ) =1 P ( ). v) If 1 is a sufficient condition for 2, then P ( 1 ) P ( 2 ). Equality holds when 2 is also a necessary condition for 2. vi) Let P ( 1 2 ) = probability of the event that 1 and 2 occurs (overlap in Venn diagram). Mutually exclusive events have P ( 1 2 )=0while generally P ( 1 2 ) 0. In general we also have P ( )=P ( 1 )+P ( 2 ) P ( 1 2 ) P ( 1 )+P ( 2 ) 3 8 4

5 9 Conditional Probabilities: Also satisfying the axioms are conditional probabilities. Consider the probability that an event 2 occurs given that the event 1 occurs. It may be shown that this probability is Similarly, Bayes Theorem: P { 2 1 } = P { 1 2} P { 1 }. P { 1 2 } = P { 2 1} P { 2 }. Solving for P { 1 2 } from the two preceding equations and setting these solutions equal yields Bayes theorem P { 2 1 } = P { 1 2 }P { 2 }. P { 1 }

6 Bayesian Inference: Bayesian inference is based on the preceding equation where, with relaxed definitions of the event space, hypotheses and parameters are attributed probabilities based on knowledge before and after an experiment is conducted. Bayes theorem combined with the qualitative interpretation of probability therefore allows the sequential acquisition of knowledge (ie. learning) to be handled. The implied temporal sequence of events, by which data are accumulated and the likelihood of a hypothesis being true increases or decreases, represents the power of the Bayesian outlook. Moreover, with Bayesian inference, assumptions behind the inference are often brought up front as conditions upon which probabilities are calculated. 11 Probability on One Page 1. PDF and CDF: f X (x) has unit area and F X (x) is in [0, 1]; f X (x) = df X(x) dx 2. Characteristic function: X(!) = R dx f X (x)e i!x e i!x d 3. Moments: hx n i = i n n X (!) d! n (moment generating function) 3. Change of variable: y = g(x) with solutions x j = g 1 (y),j =1,...,n. Probability is conserved: f Y (y)dy = f X (x)dx: nx f X (x j ) f Y (y) = dg(x)/dx x=xj!=0 NxN transformation: the derivative is replaced by the determinant of the Jacobian matrix. 4. Sum of independent RVs: Z = X + Y: Easily extendable to a sum of N RVs. 5. Conditional PDFs: f X (x y)f Y (y) =f XY (x, y) 6. Bayes Theorem: j=1 f Z (z) = f X f Y Convolution =) Z (!) = X (!) Y (!) Product =) Z 2 = X 2 + Y 2 Variances add f Y (y x) = f X(x y)f Y (y) f X (x)

7 Generating Pseudo Random Numbers Cf. Section 5.13 of Gregory Several methods: 1. transformation method (exponential example) E.g. X = g(y) 2. CDF mapping of uniformly distributed numbers Works for arbitrary output PDF 3. Measurements of natural systems Higher-order statistics (e.g. specifying a power spectral shape): other methods are needed, to be discussed. 13 Arbitrary output PDF U in [0, 1] 14 7

8 1/31/

9 Basic Probability Tools Random variables, event space: ζ = event à X = random var. PDF, CDF, characteris@c func@on Median, mode, mean Condi@onal probabili@es and PDFs Bayes theorem Comparing PDFs Moments and moment tests Sums of random variables and convolu@on theorem Central Limit Theorem Changes of variable Func@ons of random variables Sequences of random variables Stochas@c processes = sequences of random variables vs. t, f, etc. Power spectrum, autocorrela@on, autocovariance, and structure func@ons Bispectrum = 2D spectrum of third moment Random walks, shot noise, autoregressive, moving average, Markov processes 17 Relevant PDFs Gaussian or Normal: N(μ, σ 2 ) f X (x) = 1 p 2 e (x µ)2 /2 2 Random variable argument Exponen@al: Chi 2 : X = NX j=1 f X (x) = f X (x) =hxi 1 e x/hxi H(x) x 2 j with x j = = i.i.d GRV: N(0,1)! 1 (N/2)2 N/2 x(n 2)/2 e x/2 18 9

h^p://en.wikipedia.org/wiki/normal_distribu@on h^p://en.

10 h^p://en.wikipedia.org/wiki/chi- 19 h^p://upload.wikimedia.org/wikipedia/commons/a/a9/empirical_rule.png Confidence intervals 20 10

11 of a Gaussian or Normal RV From wikipedia 21 χ

12 Discrete RV: Poisson Processes If we throw points onto the time axis randomly but with a uniform average rate of having k points in an interval [0,T] is P (k) = e t ( t) k k! the probability (1) Note this is the limiting case of a binomial distribution P (k) = n p k (1 p) n k k where p is the probability of having one point in the interval [0,t] when n points occur in a larger interval [0,T]; i.e. p = t/t. In the limit where n so that p 0 while np = constant > 0, the Poisson expression follows from the binomial probability. The mean and second moment of k are k = t k 2 = k 2 + t. 23 Central Limit Theorem The Central Limit Theorem is a powerful tool for observational science because it can be used to invoke a priori distributions for measured quantities. In simple terms, any quantity that is the sum of many independent ones will have Gaussian statistics. We need to understand what constitutes independence and how Gaussian statistics play out for stochastic processes, which may be viewed as sequences of a large number of random variables. In addition, the CLT does not always apply! Consider the sum Z N = 1 N N X j j=1 where the X j are independent, identically distributed (iid) random variables with means and variances µ j X j j 2 = Xj 2 X j 2. and the PDFs of the X j s are almost arbitrary. Restrictions on the distributions of each X j are that i) j 2 >m>0 m = constant ii) X n <M= constant for n>2 In the limit N,Z N becomes a Gaussian random variable with mean and variance Z N = 1 N µ j, Z 2 = 1 N j=1 N N j=1 2 j

13 Example of arithmetic sums of uniformly distributed numbers Consider the sum x = 1 N x j N j=1 where the x j are drawn from a uniform PDF: x j [ 1/2, 1/2]. The figure shows counts of x for averages of length N =1, 2, 4, 8 and 16 for 100 realizations Example of arithmetic sums of uniformly distributed numbers (continued) Histograms based on 10 5 realizations are shown in the figure below. Two important features are: 1. the width of the PDF scales as N 1/2 2. the shape of the PDF tends toward a Gaussian form with x =0and x = x 2 1/2 1/ 12N

14 27 Cauchy random numbers A Cauchy distribution is Its characteristic function is Using uniformly distributed numbers u [ transformation f X(x) = X( ) =e x 2. 1/2, 1/2], Cauchy random numbers x can be generated using the x =tan( u). Check: Using the change of variable theorem, where has been used. f X(x) = f du u(u(x) dx 1 = 1+x 2, dx du = d du tan u = 1 cos 2 u =1+x

Example of arithmetic sums of Cauchy distributed numbers Histograms based on 10 5 realizations are shown in the figure below. Here neither the width nor the shame of the PDF changes as N gets larger!

15 Example of arithmetic sums of Cauchy distributed numbers Histograms based on 10 5 realizations are shown in the figure below. Here neither the width nor the shame of the PDF changes as N gets larger! What s happening? 5 29 First let s understand the case of summing uniformly distributed numbers Again consider X j that are all uniformly distributed between ± 1 2. The PDF of a single random variable is f X(x) = and it is a Fourier transform pair with its characteristic function or Graphically: (x) j( ) = e i sin /2 xj = /2 f X(x) = (x) sin f f = sin 2 /

16 Now consider sums of N RVs of this type (i.e. X j) and use the convolution theorem to evaluate the character- j istic function of the sum: Graphically: Gaussian N =2 N =3 N = e x2 ( sin 2 /2 )2 sin /2 ( /2 )3 e We need to rescale the sum to find the characteristic function of Z N = 1 N x j N j=1 defined earlier. From the convolution results we have sin /2 N ( ) = NZN /2 From the transformation of random variables we have that and by the scaling theorem for Fourier transforms f ZN (x) = Nf NZN ( Nx) ZN ( ) = NZN = N sin /2 N N. /2 N

17 If the CLT holds: Now or lim N ZN( ) =e f ZN (x) = e x2 /2 Z Z 2 Z Consistency with this limiting form can be seen by expanding ZN for small ZN ( ) /2 N 1 3! ( /2 N) 3 /2 N 2 1 N 24 that is identical to the expansion of exp ( 2 2 Z/2) Why the CLT does not work for a sum of Cauchy variables: The Cauchy distribution and its characteristic function are f X(x) = x 2 (w) = e In this case has a characteristic function Z N = 1 N x j N j=1 N( ) =e N / N By inspection the exponential will not converge to a Gaussian. Instead, the sum of N Cauchy RVs is a Cauchy RV. Is the Cauchy distribution a legitimate PDF? No! The variance diverges: X 2 = dx x x 2. The Cauchy distribution is an example of a stable distribution, defined as one that has the property that a linear combination of two independent copies of the variable has the same distribution to within a location and a scale parameter. The family of stable distributions is sometimes called the Levy alpha-stable distribution. (Levy flights, etc.) There is a generalized CLT involving PDFs with long power- law tails h^p://en.wikipedia.org/wiki/lévy_distribu@on 10 h^p://en.wikipedia.org/wiki/stable_distribu@on 34 17

18 I. Ensemble vs. Time Averages we are forced to do sample averages of various types Our goal is onen, however, to learn about the parent or ensemble from which the data are conceptually drawn In some averages converge to good of ensemble averages In others, convergence can be very slow or can fail (e.g. red- noise processes) 35 I(t, ζ) 36 18

19 I(t, ζ) 37 I(t, ζ) As data span length T à time average à ensemble average Ergodic 38 19

1/31/17 Types of Random Processes Goodman Sta,s,cal Op,cs 39 Example: the Universe Measurements of the CMB and large- scale structure are on a single realiza@on The goal of cosmology is to learn

20 1/31/17 Types of Random Processes Goodman Sta,s,cal Op,cs 39 Example: the Universe Measurements of the CMB and large- scale structure are on a single realiza@on The goal of cosmology is to learn about the (no@onal) ensemble of condi@ons that lead to what we see Quan@ta@vely these are cast in ques@ons like what was the primordial spectrum of density ﬂuctua@ons? and that spectrum is usually parameterized as a power law Perhaps the mul@verse = the ensemble Are all universes the same (sta@s@cally)? Do measurements on our universe typify all universes? (Conven@onal wisdom says no) 40 20

Probability, CLT, CLT counterexamples, Bayes. The PDF file of this lecture contains a full reference document on probability and random variables.

Lecture 5 A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2015 http://www.astro.cornell.edu/~cordes/a6523 Probability, CLT, CLT counterexamples, Bayes The PDF file of