Statistical Methods for Astronomy

Statistical Methods for Astronomy If your experiment needs statistics, you ought to have done a better experiment. -Ernest Rutherford Lecture 1 Lecture 2 Why do we need statistics? Definitions Statistical distributions Binomial Distribution Poisson Distribution Gaussian Distribution Central Limit theorem Your Statistical Toolbox Bayes' theorem F-test KS-test Monte Carlo method transforming deviates Least Squares chi-squared significance

References Data Reduction and Error Analysis, Bevington and Robinson Practical Statistics for Astronomers, Wall and Jenkins Numerical Recipes, Press et al. Understanding Data Better with Bayesian and Global Statistical Methods, Press, 1996 (on astro-ph)

Another look at the problem Knowing the distribution allows us to predict what we will observe. We often know what we have observed and want to determine what that tells us about the distribution.

Bayesian Statistics Frequentist approaches are computationally easy, but often solve the inverse of the problem we want. Bayesian approaches use both the data and any prior information to develop a posterior distribution. Allows calculation of parameter uncertainty more directly. More easily incorporates outside information.

An Example I flip a coin 10 times and obtain 7 heads. What is the probability for flipping a heads? A frequentist statistician would say 0.7 A bayesian statistician might define a prior probability with mean=0.5 and sigma=0.2 (for example) Who would you side with?

Obtaining the Posterior Distribution Bayes' Theorem states: P B A = P A B P B P A P(A B) should be read as probability of A given B A is typically the data P(data), B the statistic we want to know. P(B) is the prior information we may know about the experiment. P(data) is just a normalization constant P B data P data B P B

Using Bayes' theorem Assume we are looking for faint companions, and expect them to be around 1% of the stars we observe. From putting in fake companions we know that we can detect planets 90% of the time. We also know that we see false planets 3% of the observations. What is the probability that an object we see is actually a planet? P planet det. = P planet det. = P planet det. = 0.9 0.01 0.9 0.01 0.03 0.99 =0.23 P planet =0.01 P det. planet =0.01 P det. planet =0.9 P det. no planet =0.03 P det. planet P planet P det P det. planet P planet P det. planet P planet P det. no planet P no planet

General Bayesian Guidance Focuses on probability rather than accept/reject. Bayesian approaches allow you to calculate probabilities the parameters have a range of values in a more straightforward way. A common concern about Bayesian statistics is that it is subjective. This is not necessarily a problem. Bayesian techniques are generally more computationally intensive, but this is rarely a drawback for modern computers.

Hypothesis Testing Hypothesis testing uses some metric to determine whether two data sets, or a data set and a model, are distinct. Typically, the problem is set up so that the hypothesis is that the data sets are consistent (the null hypothesis). A probability is calculated that the value found would be obtained again with another sample. Based on the required level of confidence, the hypothesis is rejected or accepted.

Are two data sets drawn from the same distribution? The t statistic quantifies the likelihood that the means are the same. The F statistic quantifies the likelihood that the variances of two data sets are the same. Consider two data sets, x and y, with m and n data points: t= x y s 1/m 1/ n F = x i x 2 / n 1 y j y 2 / m 1 s 2 = ns x ms y n m

Student's t test Calculate the t statistic. A perfect agreement is t=0. Evaluate the probability for t>value. t= x y s 1/m 1/ n s 2 = ns x ms y n m t= x y s 1/m 1/ n s 2 = ns x ms y n m

F test Calculate the F statistic. F = x i x 2 / n 1 y j y 2 / m 1 Calculate the probability that F>value.

The Kolmogorov-Smirnov Test Calculate the cumulative distribution function for your model (C_model(x)). Calculate the cumulative distribution function for your data(c_data(x). Find maximum of Cmodel(x)-Cdata(x) The variables, x, must be continuous to use K-S test.

K-S test example D

Monte Carlo Simulation Often we may find it easiest just to replicate an experiment or observation in the computer. In general these tools are referred to as Monte Carlo methods. General idea is to simulate randomness and reproduce observations for comparison with data. First we need a random number sequence.

Creating Random numbers A proper random sequence of numbers is a whole topic in itself. Numerical Recipes discusses this in some detail. A simple example of a random number generator is the sequence: I j 1 = a I j mod m /a Where a and m are large numbers. I_j is a seed value that would always give us the exact same sequence of random numbers.

Random Numbers The example gives a uniform distribution set of random numbers. That is, P x dx=dx if 0 x 1 0 otherwise We would like useful distributions, such as Poisson, etc. To do so, we need to transform the random numbers.

Transformation Method Starting from the law for transformation of probabilities: p y dy = p x dx We can rewrite to solve for the probability we want. dx p y = dy p x p y = dx dy 1. Need to integrate the probability distribution 2. Solve for the new variable (y) in terms of the uniform variable(x)

Example I want to simulate the time it takes between arrival of photons at the detector. This is given by an exponential probability distribution: P t dt= e t dt Use the transformation of probabilities: Need to integrate: e t dt= dx e t = dx dt e t =x t= ln x A random number in the range 0 to 1 will be transformed to one which can be between Inf and 0

Limitations Transformation methods are limited to analytical probability distributions. One also needs to be able to integrate the proability distribution and invert the equation to solve for the new variable. Often one of these criteria is not satisfied. You can still generate useful random numbers using the rejection method.

Rejection Method Generate two uniform random deviates, x and y. Adjust x to span the range of values expected for the random number (x'=f(x)). Compare the value of y to the value of the probability distribution at x' (y'=p(x')) If y'<y use the value of x' in your simulation, if y'>y reject this pair and start over.