Statistical Methods for Astronomy Probability (Lecture 1) Statistics (Lecture 2) Why do we need statistics? Useful Statistics Definitions Error Analysis Probability distributions Error Propagation Binomial Distribution Least Squares Poisson Distribution chi-squared Gaussian Distribution Bayes Theorem Central Limit theorem Significance Comparison Statistics
Possible Statistics Average µ = j x j n Most Likely Statistics Median: Order data. Mode: med = x j where j = N/2+1/2(odd) = x j + x j+1 2 Most frequently occurring value(s). where j = N/2(even) Spread Statistics Variance Root mean square s 2 = j s = j (x j µ) 2 n 1 (x j µ) 2 n 1 Mean deviation x = j x j µ n 1 Sample Maximum 2
Know your statistic 3 from Biostatistical Analysis, fourth edition, Simon & Schuster 1999
Good Statistics Good statistics should be: Unbiased - should converge to the right value with more data points. Robust - should not be affected by a few bad data points. Consistent - should not be affected, systematically by the size of your sample. Close - should converge as quickly as possible with increasing data. 4
Relation to Probability distributions Statistics are based on data only! However, they are often most useful as estimators of parameters of probability distribution. This is a frequentist approach where the distribution is used to determine how often we might obtain the resulting statistic, so that we can decide whether this is the correct model. 5
Chebyshev inequality What if we don t know the underlying distribution? If we know the mean and sample variance, Chebyshev s inequality is very useful. P ( x µ >nσ) < 1 n 2 Be careful of typo in Wall and Jenkin s text! n P_ch(<) P_Gauss(=) 1 1 0.32 2 0.25 0.05 3 0.11 0.003 4 0.06 0.00006 6
Error Analysis If I take N measurements, how precisely could I determine the sample mean, compared to a set with 2N measurements? µ 2 = ( 1 N (x j µ)) 2 j µ 2 = σ2 N + 1 N 2 (x i µ)(x j µ) i=j 7
Propagation of Errors If the quantity of interest is, say f=x+y then σ 2 = f 2 f 2 σ 2 = x 2 x 2 + y 2 y 2 In general: σ 2 (f(x i )) = i ( f x i ) 2 σ 2 (x i )
Comparing a data set to a distribution Suppose we have N data points and a model we think describes this with M parameters. Our model: y(x) =y(x, a 1...a M ) An intuitive metric is the distance of each data point from the model. Let's use the square of the difference between data and the model. LS = N (yi y(x i,a 1 a M )) 2 Why is this a reasonable metric for Determining the best fit to the data?
Justification for Least-Squares What is the probability a certain data point is drawn from a given model? P e (y i y(x i )) 2 2σ 2 For N points the overall probability for a given model is N P e (y i y(x i )) 2 2σ 2 To maximize the probability we should minimize the exponent: N (y i y(x i )) 2 i=1 i=1 2σ 2
Chi-squared The exponent is referred to as the statistic chisquared: N χ 2 (y i y(x i )) 2 = Chi-squared is not a unique metric, but is commonly used: Mean: µ χ 2 = ν = N M Variance: σχ 2 =2ν 2 Often, reduced chi-squared is quoted: Mean: Variance: i=1 µ reduced χ 2 =1 σ 2 reduced χ 2 = 2 ν σ 2
Expressing Confidence A fit with 10 d.o.f. Would have A chi-squared higher than 0.83, 60% of the time. Plausible for Chi-squared<1.45 If nu=100 chi-squared<1.15 from Bevington
Using chi-squared How do we decide whether the model describes the data? Rule of thumb: Reduced chi-square should be within 1-2 sigma of 1 for a valid model. Say I had chi-square = 2 for nu=10. The statement that you could make is: A chi-square of >2 would occur 3% of the time. This suggests the data do not support the model. You need to decide how much you trust your error estimates in order to make this statement. See Wall and Jenkins Table A.2.6 for probability values.
Parameter and error estimation To estimate the uncertainty in parameters, you can vary each parameter until χ 2 goes up by 2/nu. Beware of correlation between parameters!! Joint variation should be carried out to avoid underestimating the uncertainty. Also, if you didn't know the error in your data, you have no way of determining whether the model was valid. You can still use this to derive errors for your data, if you are certain of the model. σ 2 = χ 2 ν
Statistics for Hypothesis Testing Hypothesis testing uses some metric to determine whether two data sets, or a data set and a model, are distinct. Typically, the problem is set up so that the hypothesis is that the data sets are consistent (the null hypothesis). A probability is calculated that the value found would be obtained again with another sample. Based on the required level of confidence, the hypothesis is rejected or accepted.
Are two data sets drawn from the same distribution? The t statistic quantifies the likelihood that the means are the same. The F statistic quantifies the likelihood that the variances of two data sets are the same. Consider two data sets, x and y, with m and n data points: t = x y s 1/m +1/n F = (xi x) 2 /(n 1) (yi y) 2 /(m 1) s 2 = ns x + ms y n + m S x = (xi x) 2 n
Student's t test Calculate the t statistic. A perfect agreement is t=0. Evaluate the probability for t>value. ν = m + n 2 t = x y s 1/m +1/n s 2 = ns x + ms y n + m
F test Calculate the F statistic. F = (xi x) 2 /(n 1) (yi y) 2 /(m 1) Calculate the probability that F>value.
The Kolmogorov-Smirnov Test Calculate the cumulative distribution function for your model (C_model(x)). Calculate the cumulative distribution function for your data(c_data(x). Find maximum of Cmodel(x)-Cdata(x) The variables, x, must be continuous to use K-S test.
K-S test example