Practical Statistics Lecture 1 (Nov. 9): - Correlation - Hypothesis Testing Lecture 2 (Nov. 16): - Error Estimation - Bayesian Analysis - Rejecting Outliers Lecture 3 (Nov. 18) - Monte Carlo Modeling - Bootstrap + Jack-knife Lecture 4 (Nov. 30): - Detection Effects - Survival Analysis Lecture 5 (Dec. 2): - Fourier Techniques - Filtering - Unevenly Sampled Data Good Reference: Hogg et al. 2010 http://arxiv.org/pdf/1008.4686v1 1
Review: Process of Decision Making Ask a Question Take Data Reduce Data Derive Statistics describing data Reflect on what is needed Probability Distribution Error Analysis Does the Statistic answer your question? No Hypothesis Testing Yes Simulation Publish! 2
Review: The Binomial distribution You are observing something that has a probability, p, of occurring in a single observation. You observe it N times. Want chance of obtaining n successes. For one, particular sequence of observations the probability is: P 1 (n) =p n (1 p) N n There are many sequences which yield n successes: N! P (n) = n!(n n)! pn (1 p) N n N = p n (1 p) N n n Mean Np Variance Np(1-p) Often said N choose n
Review: Mean and Variance of Distributions Distribution Mean Variance Binomial Np Np(1-p) Poisson µ µ Gaussian µ σ 2 Uniform [a,b) (a+b)/2 (b-a)/12
Review: Comparing a data set to a distribution Suppose we have N data points and a model we think describes this with M parameters. Our model: y(x) =y(x, a 1...a M ) An intuitive metric is the distance of each data point from the model. Let's use the square of the difference between data and the model. LS = N (yi y(x i,a 1 a M )) 2 Why is this a reasonable metric for Determining the best fit to the data?
Review: Chi-squared The statistic chi-squared is defined as: N χ 2 (y i y(x i )) 2 = Chi-squared is not a unique metric, but is commonly used: Mean: µ χ 2 = ν = N M Variance: σχ 2 =2ν 2 Often, reduced chi-squared is quoted: Mean: Variance: i=1 µ reduced χ 2 =1 σ 2 reduced χ 2 = 2 ν σ 2
HW 2, Problem 2 10% of G type stars have detectable RV. How many stars should I observe to determine whether M type stars are similar? 7
Exam 1: Problem 2 Detector has 12000 digital units of measured flux, and 3 DU measured RMS noise at this level. How many photons does this correspond to? At no-light level, we measure 1 DU of RMS noise. How much noise does this add? 8
Correlation Often the first approach to analyzing data is to look for correlations in various parameters. - May or may not be physically motivated. - Understand experimental effects first (be skeptical). - Be careful of subclusters of points. - Correlation is not (necessarily) causation (remain skeptical). 9
A mass-separation correlation? 10
Are people born early in the year better hockey players? See Outliers book by Malcolm Gladwell 11
Correlation coefficient The correlation coefficient for two parameters, x and y, is defined as the covariance between parameters over the scatter in the distribution for each parameter: ρ = covariance(x, y) σ x σ y The correlation coefficient can be estimated directly from the data: r = i (X i <X>)(Y i <Y >) i (X i <X>) 2 i (Y i <Y >) 2 12
Probability of correlation For a bivariate Gaussian distribution, Bayes theorem can be used to estimate the probability of correlation: prob(ρ data) (1 ρ2 ) (N 1)/2 (1 ρr) (1 + 1 1+ρr N 3/2 n 1/2 8 +...) 13
What if we see a correlation? It s common (but dangerous!) to just fit a line to the data: Anscombe s quartet illustrates the potential pitfalls of line fitting 14
Principle Component Analysis If we have N objects, n measured variables (x_n) for each object then: - We want a minimum number of variables that are independent. - These variables will be linear combinations of the observed variables: i = n a ij x j j=1 The goal is to define the new variables to minimize the residual variance in the data 15
Geometrical view of PCA Iterative approach of finding the component with maximum variance. 16
PCA manipulation 17
Statistics for Hypothesis Testing Hypothesis testing uses some metric to determine whether two data sets, or a data set and a model, are distinct. Typically, the problem is set up so that the hypothesis is that the data sets are consistent (the null hypothesis). A probability is calculated that the value found would be obtained again with another sample. Based on the required level of confidence, the hypothesis is rejected or accepted.
Parametric Tests Often, the most intuitive way to understand our data is to choose the parameter of interest (say the mean) and compare it to a model. Alternatively, we might be comparing two data sets by asking whether the differences in a statistic are meaningful. These general tests are called Parametric tests They can use frequentist approaches to accept or reject the hypothesis. They can use Bayesian approaches to calculate probabilities of different results. 19
Are two data sets drawn from the same distribution? The t statistic quantifies the likelihood that the means are the same. The F statistic quantifies the likelihood that the variances of two data sets are the same. Consider two data sets, x and y, with m and n data points: t = x y s 1/m +1/n F = (xi x) 2 /(n 1) (yi y) 2 /(m 1) s 2 = ns x + ms y n + m S x = (xi x) 2 n
Student's t test Calculate the t statistic. A perfect agreement is t=0. Evaluate the probability for t>value. ν = m + n 2 t = x y s 1/m +1/n s 2 = ns x + ms y n + m
F test Calculate the F statistic. F = (xi x) 2 /(n 1) (yi y) 2 /(m 1) Calculate the probability that F>value.
Non-Parametric Tests If we don t know the underlying distribution, or have small number statistics, there are still tests that can be used to accept or reject a hypothesis. Non-parametric tests still make some assumption about the data: Usually this is something related to the data following counting statistics, or the binomial distribution (randomness assumed, in the appropriate form) 23
Chi-squared test The chi-squared statistic can be used to compare any model to a data set: χ 2 = N i=1 (E i O i ) 2 E i Assumes variation in data is due to counting statistics Data must be binned so that E_i is reasonable for the model 24
The Kolmogorov-Smirnov Test Calculate the cumulative distribution function for your model (C_model(x)). Calculate the cumulative distribution function for your data(c_data(x). Find maximum of Cmodel(x)-Cdata(x) The variables, x, must be continuous to use K-S test. Don t need to bin the data.
K-S test example
Assignment: Test your toolbox Download Matlab (or use another tool for this) Download plot data set at: - http://zero.as.arizona.edu/ast518 Familiarize yourself with plotting data, error bars, etc. (This data set will be the basis of HW 7) 27
Matlab download go to: http://sitelicense.arizona.edu/matlab - Follow instructions to download and install. - Make sure to use an you@email.arizona.edu for Mathworks registration. 28