Practical Statistics

Similar documents
Statistical Methods for Astronomy

Statistical Methods for Astronomy

AST 418/518 Instrumentation and Statistics

Statistics notes. A clear statistical framework formulates the logic of what we are doing and why. It allows us to make precise statements.

Statistical Methods for Astronomy

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Math Review Sheet, Fall 2008

STATISTICS OF OBSERVATIONS & SAMPLING THEORY. Parent Distributions

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Introduction to Statistical Methods for High Energy Physics

Statistical Data Analysis Stat 3: p-values, parameter estimation

Data modelling Parameter estimation

Statistics and Data Analysis

Unit 10: Simple Linear Regression and Correlation

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

My data doesn t look like that..

If we want to analyze experimental or simulated data we might encounter the following tasks:

Physics 6720 Introduction to Statistics April 4, 2017

BERTINORO 2 (JVW) Yet more probability Bayes' Theorem* Monte Carlo! *The Reverend Thomas Bayes

Subject CS1 Actuarial Statistics 1 Core Principles

Statistical Methods in Particle Physics

Unit 14: Nonparametric Statistical Methods

Institute of Actuaries of India

Deciding, Estimating, Computing, Checking

Deciding, Estimating, Computing, Checking. How are Bayesian posteriors used, computed and validated?

Physics 509: Bootstrap and Robust Parameter Estimation

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Introduction to Statistics and Error Analysis II

REVIEW 8/2/2017 陈芳华东师大英语系

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Practice Problems Section Problems

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

MATH 10 INTRODUCTORY STATISTICS

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

CE 3710: Uncertainty Analysis in Engineering

MATH 10 INTRODUCTORY STATISTICS

Mock Exam - 2 hours - use of basic (non-programmable) calculator is allowed - all exercises carry the same marks - exam is strictly individual

Some Statistics. V. Lindberg. May 16, 2007

Hypothesis testing:power, test statistic CMS:

Ch. 1: Data and Distributions

Fundamental Probability and Statistics

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Lecture 1: Probability Fundamentals

Confidence Intervals, Testing and ANOVA Summary

STA 2201/442 Assignment 2

Testing Statistical Hypotheses

Statistical Methods in Particle Physics

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

Lecture 10: Generalized likelihood ratio test

LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO

CSE 312 Final Review: Section AA

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

1 Statistics Aneta Siemiginowska a chapter for X-ray Astronomy Handbook October 2008

18.05 Practice Final Exam

Statistics Introductory Correlation

Overview of Spatial analysis in ecology

Part III: Unstructured Data

Statistics. Lent Term 2015 Prof. Mark Thomson. 2: The Gaussian Limit

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Primer on statistics:

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Robustness and Distribution Assumptions

Simulation. Where real stuff starts

STAT 461/561- Assignments, Year 2015

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Statistical techniques for data analysis in Cosmology

Statistical Models with Uncertain Error Parameters (G. Cowan, arxiv: )

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Hypothesis testing. 1 Principle of hypothesis testing 2

Math 562 Homework 1 August 29, 2006 Dr. Ron Sahoo

Stat 5101 Lecture Notes

Chapter 27 Summary Inferences for Regression

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

y n 1 ( x i x )( y y i n 1 i y 2

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

M & M Project. Think! Crunch those numbers! Answer!

Recall the Basics of Hypothesis Testing

Probability Density Functions

Data Analysis I. Dr Martin Hendry, Dept of Physics and Astronomy University of Glasgow, UK. 10 lectures, beginning October 2006

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Testing Statistical Hypotheses

STAT 518 Intro Student Presentation

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Math 50: Final. 1. [13 points] It was found that 35 out of 300 famous people have the star sign Sagittarius.

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Modeling and Performance Analysis with Discrete-Event Simulation

14.30 Introduction to Statistical Methods in Economics Spring 2009

APPENDICES APPENDIX A. STATISTICAL TABLES AND CHARTS 651 APPENDIX B. BIBLIOGRAPHY 677 APPENDIX C. ANSWERS TO SELECTED EXERCISES 679

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic

Statistics for Data Analysis. Niklaus Berger. PSI Practical Course Physics Institute, University of Heidelberg

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

Overview. Confidence Intervals Sampling and Opinion Polls Error Correcting Codes Number of Pet Unicorns in Ireland

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

AST 418/518 Instrumentation and Statistics

Physics 509: Non-Parametric Statistics and Correlation Testing

Transcription:

Practical Statistics Lecture 1 (Nov. 9): - Correlation - Hypothesis Testing Lecture 2 (Nov. 16): - Error Estimation - Bayesian Analysis - Rejecting Outliers Lecture 3 (Nov. 18) - Monte Carlo Modeling - Bootstrap + Jack-knife Lecture 4 (Nov. 30): - Detection Effects - Survival Analysis Lecture 5 (Dec. 2): - Fourier Techniques - Filtering - Unevenly Sampled Data Good Reference: Hogg et al. 2010 http://arxiv.org/pdf/1008.4686v1 1

Review: Process of Decision Making Ask a Question Take Data Reduce Data Derive Statistics describing data Reflect on what is needed Probability Distribution Error Analysis Does the Statistic answer your question? No Hypothesis Testing Yes Simulation Publish! 2

Review: The Binomial distribution You are observing something that has a probability, p, of occurring in a single observation. You observe it N times. Want chance of obtaining n successes. For one, particular sequence of observations the probability is: P 1 (n) =p n (1 p) N n There are many sequences which yield n successes: N! P (n) = n!(n n)! pn (1 p) N n N = p n (1 p) N n n Mean Np Variance Np(1-p) Often said N choose n

Review: Mean and Variance of Distributions Distribution Mean Variance Binomial Np Np(1-p) Poisson µ µ Gaussian µ σ 2 Uniform [a,b) (a+b)/2 (b-a)/12

Review: Comparing a data set to a distribution Suppose we have N data points and a model we think describes this with M parameters. Our model: y(x) =y(x, a 1...a M ) An intuitive metric is the distance of each data point from the model. Let's use the square of the difference between data and the model. LS = N (yi y(x i,a 1 a M )) 2 Why is this a reasonable metric for Determining the best fit to the data?

Review: Chi-squared The statistic chi-squared is defined as: N χ 2 (y i y(x i )) 2 = Chi-squared is not a unique metric, but is commonly used: Mean: µ χ 2 = ν = N M Variance: σχ 2 =2ν 2 Often, reduced chi-squared is quoted: Mean: Variance: i=1 µ reduced χ 2 =1 σ 2 reduced χ 2 = 2 ν σ 2

HW 2, Problem 2 10% of G type stars have detectable RV. How many stars should I observe to determine whether M type stars are similar? 7

Exam 1: Problem 2 Detector has 12000 digital units of measured flux, and 3 DU measured RMS noise at this level. How many photons does this correspond to? At no-light level, we measure 1 DU of RMS noise. How much noise does this add? 8

Correlation Often the first approach to analyzing data is to look for correlations in various parameters. - May or may not be physically motivated. - Understand experimental effects first (be skeptical). - Be careful of subclusters of points. - Correlation is not (necessarily) causation (remain skeptical). 9

A mass-separation correlation? 10

Are people born early in the year better hockey players? See Outliers book by Malcolm Gladwell 11

Correlation coefficient The correlation coefficient for two parameters, x and y, is defined as the covariance between parameters over the scatter in the distribution for each parameter: ρ = covariance(x, y) σ x σ y The correlation coefficient can be estimated directly from the data: r = i (X i <X>)(Y i <Y >) i (X i <X>) 2 i (Y i <Y >) 2 12

Probability of correlation For a bivariate Gaussian distribution, Bayes theorem can be used to estimate the probability of correlation: prob(ρ data) (1 ρ2 ) (N 1)/2 (1 ρr) (1 + 1 1+ρr N 3/2 n 1/2 8 +...) 13

What if we see a correlation? It s common (but dangerous!) to just fit a line to the data: Anscombe s quartet illustrates the potential pitfalls of line fitting 14

Principle Component Analysis If we have N objects, n measured variables (x_n) for each object then: - We want a minimum number of variables that are independent. - These variables will be linear combinations of the observed variables: i = n a ij x j j=1 The goal is to define the new variables to minimize the residual variance in the data 15

Geometrical view of PCA Iterative approach of finding the component with maximum variance. 16

PCA manipulation 17

Statistics for Hypothesis Testing Hypothesis testing uses some metric to determine whether two data sets, or a data set and a model, are distinct. Typically, the problem is set up so that the hypothesis is that the data sets are consistent (the null hypothesis). A probability is calculated that the value found would be obtained again with another sample. Based on the required level of confidence, the hypothesis is rejected or accepted.

Parametric Tests Often, the most intuitive way to understand our data is to choose the parameter of interest (say the mean) and compare it to a model. Alternatively, we might be comparing two data sets by asking whether the differences in a statistic are meaningful. These general tests are called Parametric tests They can use frequentist approaches to accept or reject the hypothesis. They can use Bayesian approaches to calculate probabilities of different results. 19

Are two data sets drawn from the same distribution? The t statistic quantifies the likelihood that the means are the same. The F statistic quantifies the likelihood that the variances of two data sets are the same. Consider two data sets, x and y, with m and n data points: t = x y s 1/m +1/n F = (xi x) 2 /(n 1) (yi y) 2 /(m 1) s 2 = ns x + ms y n + m S x = (xi x) 2 n

Student's t test Calculate the t statistic. A perfect agreement is t=0. Evaluate the probability for t>value. ν = m + n 2 t = x y s 1/m +1/n s 2 = ns x + ms y n + m

F test Calculate the F statistic. F = (xi x) 2 /(n 1) (yi y) 2 /(m 1) Calculate the probability that F>value.

Non-Parametric Tests If we don t know the underlying distribution, or have small number statistics, there are still tests that can be used to accept or reject a hypothesis. Non-parametric tests still make some assumption about the data: Usually this is something related to the data following counting statistics, or the binomial distribution (randomness assumed, in the appropriate form) 23

Chi-squared test The chi-squared statistic can be used to compare any model to a data set: χ 2 = N i=1 (E i O i ) 2 E i Assumes variation in data is due to counting statistics Data must be binned so that E_i is reasonable for the model 24

The Kolmogorov-Smirnov Test Calculate the cumulative distribution function for your model (C_model(x)). Calculate the cumulative distribution function for your data(c_data(x). Find maximum of Cmodel(x)-Cdata(x) The variables, x, must be continuous to use K-S test. Don t need to bin the data.

K-S test example

Assignment: Test your toolbox Download Matlab (or use another tool for this) Download plot data set at: - http://zero.as.arizona.edu/ast518 Familiarize yourself with plotting data, error bars, etc. (This data set will be the basis of HW 7) 27

Matlab download go to: http://sitelicense.arizona.edu/matlab - Follow instructions to download and install. - Make sure to use an you@email.arizona.edu for Mathworks registration. 28