Data Mining In Modern Astronomy Sky Surveys: Hypothesis Testing, Bayes Theorem, & Parameter Estimation Ching-Wa Yip cwyip@pha.jhu.edu; Bloomberg 518
Erratum of Last Lecture The Central Limit Theorem was proved by Bernoulli back in 17 th century. The Michelson-Morley speed-of-light experiment was carried out in 18 th century. Michelson & Morley could have accessed to the Central Limit Theorem and decided to carry out many, repeated measurements. (Need more research)
From Data to Information We don t just want data. We want information from the data. Information Database Sensors Data Analysis or Data Mining
From Data to Information We don t just want data. We want information from the data. Information Database Sensors Data Analysis or Data Mining
Probability Density Function (PDF) A function which tells the probability of an event, e.g.: Temperature T lies between 34F and 50F. Variable X lies between X 1 and X 2. A well-known PDF is the Standard Normal Distribution:
Number/Total Number PDFs Come in Many Shapes E.g., Color of galaxies Color (Redder ) (Yip, Connolly, Szalay, et al. 2004)
Properties of PDFs The total area under the function is 1 (= 100% probability): p x dx = 1 The probability of the variable x lying between a and b is: P x beween a and b b = p x dx a
2D Probability Density Function (PDF) A function which tells the probability of an event, just like the 1D PDF but with 2 variables: the variables X lies between X 1 and X 2 AND Y between Y 1 and Y 2 The total area under the curve is still 1. p x, y dxdy = 1
Example 2D PDF Seeing disk of stellar image. p(x, Y) Y X
Standard Normal Distribution Revisit: Some Terminologies Significance Level = Area at the tail of the distribution = = 2.5% = 0.025 Critical Values = Z = The value corresponds to a Significance Level
(Abridged, Taken from Dekking et al.)
If: = 0.025 (tabulated as 0250) We get: Z = 1.96 (Abridged, Taken from Dekking et al.)
Significance Level ( ) and its Critical No more tables! Values (Z)
Hypothesis Testing Goal: To test whether a hypothesis is true or false. The beginning hypothesis is our best knowledge for the problem (also called the Null Hypothesis). If Null Hypothesis is FALSE, Alternative Hypothesis is TRUE.
Hypothesis Testing Suppose we think the average of a sample is some number. We want to test this hypothesis. Null hypothesis: some # Alternative hypothesis: some # +
Hypothesis Testing: (Step 1) Set Significance Levels It is the area at the tail(s). +
Hypothesis Testing: (Step 2) Find Critical Values Use: - Look-up tables - Computational software +
Hypothesis Testing: (Step 3) Calculate Statistics From Data Also called Test Statistics. T.S. +
Hypothesis Testing: (Step 3) Calculate Statistics From Data T.S. + Also called Test Statistics. Test Statistics is usually mean-subtracted. Therefore, we can use mean=0 Normal Distribution to carry on the hypothesis testing.
Hypothesis Testing: (Step 4) Draw Conclusion Fail to reject the (null) hypothesis.
Hypothesis Testing: (Step 4) Draw Conclusion Reject the (null) hypothesis
Hypothesis Testing: (Step 4) Draw Conclusion Reject the (null) hypothesis
Hypothesis Testing Example A galaxy formation theory predicted that the average radius of galaxies in the nearby universe is 35 kpc (1pc = 1parsec = 10 16 m). A random sample of 225 galaxies has a mean radius x = 30 kpc, and the S.D. of radius = 20 kpc. Task: Set up an hypothesis test at 5% significance level. Null hypothesis: 35 kpc Alt. hypothesis: 35 kpc Use two-tailed test. (M. Calvo)
Details: Calculate Sampling Distribution of the mean: μ x = by Central Limit Thm = μ(theory) = 35 σ x = by Central Limit Thm = S.D.(theory) ~ S.D. = 20 = 4 n n 15 3 Calculate Test Statistics from data: Z = x μ x σ x = 30 35 4/3 = 3.75
n = 225 30; Almost Normal -3.75 ( = 35 kpc from theory)
n = 225 30; Almost Normal Conclusion: Reject the galaxy formation theory. Accept 35 kpc at 5% significance level. -3.75 ( = 35 kpc from theory)
Meaning of the Confidence Level It is the chance we will make Type I error. Type I error is when we reject Null Hypothesis which is actually true: There is 5% chance we are wrong by rejecting the theory (that average galaxy size being 35 kpc).
Hypothesis Testing Example: 28 Sources in IceCube (2013)
Null Hypothesis: Sources are uniformly distributed on the sky. Alternative Hypothesis: Sources are originated from the Milky Way center.
Conditional Probability: Discrete Drawing a random card from a stack of 52 playing cards, what is the probability of a card being a Jack given it is a Face card (J, Q, K)? P Jack Face = Number of Jack s Number of Face s = 4 12 = 1 3
Thinking Like Bayesian There will be a bicycle race tomorrow, what is the probability a particular athlete will win? Problem: The event only occurs once, we do not have a sample to calculate the probability of a particular athlete winning. Solution: Bayesian Statistics.
Bayes Theorem (1763) P A B = P B A P(A) P(B) Posterior = Probability of A after B occurs. Probability of A before B occurs. Likelihood Prior Normalization Factor (Thomas Bayes, 1701-1761)
Conditional Probability: Discrete Drawing a random card from a stack of 52 playing cards, what is the probability of a card being a Jack given it is a Face card (J, Q, K)? When I know nothing beforehand: P Jack = 4 52 If my friend told me it is a Face card (Added Evidence): P Jack Face = P Face Jack P(Jack) P(Face) = 1 4 52 12 52 = 1 3
Main Idea behind Bayes Theorem When we add a new piece of evidence, we change our outlook on the probability of an event.
Frequentist vs. Bayesian Frequentist: Bayesians assume a prior probability. Bayesian: Frequentist cannot assign a probability to a single event.
Parameter Estimation: The Problem Estimate parameter from data given model. 1 modeled parameter: θ Multi modeled parameters: θ = (θ 1, θ 2, θ 3,, θ N )
Quality of an Estimator: Bias and Variance Usually, the best estimator is somewhere in-between.
Least-Square Fitting (LSQ) A popular way to find linear model that bestfit the data. A linear model is a model in which there is no square, cube,, and higher order power terms in the variables. Example: straight lines Y = Slope * X + Constant Slope and Constant are the parameters in the model. A Parameter Estimation problem.
Least-Square Fitting (LSQ): Linear model Y = Slope * X + Constant X: Independent Variable, Input Variable, etc. Y: Dependent Variable, Output Variable, etc.
Example of LSQ Fitting in R
Goodness of Fit A goodness of fit measures how well the model fit the data. E.g., Chi-sq (the sum of square of the difference over all of the N data points) x 2 = N i=1 D M 2 σ 2
Graphical Meaning of Chi-sq Y Model Fit Data Points X
Graphical Meaning of Chi-sq Y Model Fit Data Points Chi-sq measures how far away the points are from the model. X
Graphical Meaning of Chi-sq Y M D Model Fit Data Points Chi-sq measures how far away the points are from the model. X
(Astronomy) Data are Imperfect Random Error Photon counts follow Poisson distribution Random error for Poisson = photon count Systematic Error Bad CCD pixels Cosmic Rays Sky Emissions Etc.
Sky Emission (or Skylines) Average Sky Emissions From SDSS DR6 (Yip)
Image Stacking: Central Limit Theorem Revisit Fruchter & Hook (2002): Stack images in order to remove cosmic ray (= systematic error in pixel flux)
Similarly, Spectra Stacking (Yip, Connolly et al. 2004)
whereas individual spectrum is noisy. (SDSS Data Release 6)
Outliers: Using Median vs. Mean When there are outliers, the median could be more robust than the mean as a measure for the average. Outliers are difficult to define, because we need to know the average distribution as well (Next Lecture: Unsupervised Machine Learning). Subfield of study: Robust Statistics.
Median as the Robust Average
Homework 2014 Jan 14 (due Monday noon, Jan 20) The data file (saved in the course website as hubbletable1.csv ) contains the Object Name, Distance (in 10 6 pc = Mpc), and Recession Velocity (in km/s) from Hubble s 1929 work. 1) Find the Hubble s Constant by using Least Square Fitting. 2) Plot the Velocity vs. Distance; and the best-fit model. 3) The Hubble s constant from WMAP survey is determined to be 71 km/s/mpc. Comment on the comparison between the calculated and the WMAP values. Read the article on Bayes theorem. A CCD records a signal of 100 photons. What is the signal-to-noise ratio (SNR)? If the human eyes can discern features with 100% certainty in an image which has SNR 5 (*), what is the minimum number of photons we need for 100% certainty? (*) This is a simplified version of the Rose Criterion, 1948. Hints: Use read.csv() in R to read Comma Separated Values. To extract a column from the data, use x$column. For example, x$distance_mpc.