MIT Spring 2015

Similar documents
One-Sample Numerical Data

MATH4427 Notebook 4 Fall Semester 2017/2018

MIT Spring 2015

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

Lecture 2 and Lecture 3

Continuous Distributions

Descriptive Data Summarization

Stat 710: Mathematical Statistics Lecture 31

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Lecture 12: Small Sample Intervals Based on a Normal Population Distribution

Statistics. Statistics

Chapter 3. Data Description

Expectation, Variance and Standard Deviation for Continuous Random Variables Class 6, Jeremy Orloff and Jonathan Bloom

Describing distributions with numbers

3 Joint Distributions 71

Ch. 1: Data and Distributions

Density and Distribution Estimation

Practical Statistics for the Analytical Scientist Table of Contents

Confidence Intervals for Normal Data Spring 2014

Contents. Acknowledgments. xix

Stat 427/527: Advanced Data Analysis I

Lecture 6 Exploratory data analysis Point and interval estimation

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Describing Distributions with Numbers

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Class 26: review for final exam 18.05, Spring 2014

Midwest Big Data Summer School: Introduction to Statistics. Kris De Brabanter

Describing distributions with numbers

Lecture 12 Robust Estimation

Lecture 35. Summarizing Data - II

1 Measures of the Center of a Distribution

Statistics for Engineers Lecture 5 Statistical Inference

Continuous Expectation and Variance, the Law of Large Numbers, and the Central Limit Theorem Spring 2014

Asymptotic Statistics-VI. Changliang Zou

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Advanced Statistics II: Non Parametric Tests

Continuous r.v. s: cdf s, Expected Values

CS 147: Computer Systems Performance Analysis

Sampling Distributions

Chapter 6: Large Random Samples Sections

Explore the data. Anja Bråthen Kristoffersen

Simulation. Alberto Ceselli MSc in Computer Science Univ. of Milan. Part 4 - Statistical Analysis of Simulated Data

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Instrumentation (cont.) Statistics vs. Parameters. Descriptive Statistics. Types of Numerical Data

2.1 Measures of Location (P.9-11)

Efficient and Robust Scale Estimation

simple if it completely specifies the density of x

Sampling Distributions of Statistics Corresponds to Chapter 5 of Tamhane and Dunlop

Post-exam 2 practice questions 18.05, Spring 2014

STATISTICS 1 REVISION NOTES

Master s Written Examination

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

Sociology 6Z03 Review I

Maximum Likelihood Large Sample Theory

Math Review Sheet, Fall 2008

Chapter 9: Hypothesis Testing Sections

1. Exploratory Data Analysis

12 - Nonparametric Density Estimation

Robust Parameter Estimation in the Weibull and the Birnbaum-Saunders Distribution

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

(Re)introduction to Statistics Dan Lizotte

Descriptive Statistics

ORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES

Math 494: Mathematical Statistics

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

Section 3. Measures of Variation

This does not cover everything on the final. Look at the posted practice problems for other topics.

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Chapter 4: Continuous Random Variables and Probability Distributions

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Modeling Uncertainty in the Earth Sciences Jef Caers Stanford University

Robust scale estimation with extensions

Practice Exam 1. (A) (B) (C) (D) (E) You are given the following data on loss sizes:

STAT 461/561- Assignments, Year 2015

Moments. Raw moment: February 25, 2014 Normalized / Standardized moment:

Statistics for Managers using Microsoft Excel 6 th Edition

Notes on MAS344 Computational Statistics School of Mathematical Sciences, QMUL. Steven Gilmour

Modeling and Performance Analysis with Discrete-Event Simulation

Class 8 Review Problems solutions, 18.05, Spring 2014

Spring 2012 Math 541B Exam 1

Unit 2. Describing Data: Numerical

The Multivariate Gaussian Distribution

STAT 200 Chapter 1 Looking at Data - Distributions

Exploratory data analysis: numerical summaries

Learning Objectives for Stat 225


Cherry Blossom run (1) The credit union Cherry Blossom Run is a 10 mile race that takes place every year in D.C. In 2009 there were participants

Inference on distributions and quantiles using a finite-sample Dirichlet process

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Statistics I Chapter 2: Univariate data analysis

CHAPTER 1. Introduction

Counting principles, including permutations and combinations.

STAT 830 Non-parametric Inference Basics

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

Statistical Data Analysis

Brief Review of Probability

Limiting Distributions

Week 9 The Central Limit Theorem and Estimation Concepts

Continuous Random Variables. and Probability Distributions. Continuous Random Variables and Probability Distributions ( ) ( ) Chapter 4 4.

Transcription:

MIT 18.443 Dr. Kempthorne Spring 2015 MIT 18.443 1

Outline 1 MIT 18.443 2

Batches of data: single or multiple x 1, x 2,..., x n y 1, y 2,..., y m w 1, w 2,..., w l etc. Graphical displays Summary statistics: 1 n 1 n x = n 1 x i, s x = n 1(x i x) 2, y, s y, w, s w Model: X 1,..., X n independent with µ = E[X i ] and σ 2 = Var[X i ]. Confidence intervals for µ (apply CLT) If identical, evaluate GOF of specific distribution family(s) MIT 18.443 3

(continued) Cumulative Distribution Functions (CDFs) Empirical analogs of Theoretical CDFs Histograms Empirical analog of Theoretical PDFs/PMFs Summary Statistics Central Value (mean/average/median) Spread (standard deviation/range/inter-quartile range) Shape (symmetry/skewness/kurtosis) Boxplots: graphical display of distribution Scatterplots: relationships between variables MIT 18.443 4

Outline 1 MIT 18.443 5

Methods Based on Cumlative Distribution Functions Empirical CDF Batch of data: x 1, x 2,..., x n (if data is a random sample, Batch i.i.d. sample) Def: Empirical CDF (ECDF) #(x i x) F n (x) = n Define ECDF Using Ordered batch: x (1) x (2) x (n) F n (x) = 0, x < x (1), k =, x (k) x x (k+1), k = 1,..., (n 1) n = 1, x > x (n). The ECDF is the CDF of the discrete uniform distribution on the values {x 1, x 2,..., x n }. (Values weighted by multiplicity) MIT 18.443 6

Empirical CDF For each data value x i d define indicator 1, if x x I (,xi ](x) = i 0, if x > x i 1 F n n (x) = I n (,xi ](x) i=1 If Batch sample from distribution with theoretical cdf F ( ), I (,xi ](x) are i.i.d. Bernoulli(prob = F (x)), nf n (x) Binomial(size = n, prob = F (x)). Thus: E [F n (x)] = F (x) F (x)[1 F (x)] Var[F n (x)] = n For samples, F n (x) is unbiased for θ = F (x) and maximum variance is at the median: MIT 18.443 7

Estimating θ = F (x) Using the Empirical CDF Confidence Interval Based on Binomial [nf n (x)] θˆn(x) z(α/2)ˆσˆ θ < θ < θˆn(x) + z(α/2)ˆσˆ θ where ˆθ n (x) = F n (x) F n(x)[1 F n(x)] σˆθˆ = n z(α/2) = upper α/2 quantile of N(0, 1) Applied to single values of (x, θ = F (x)) Kolmogorov-Smirnov Test Statistic: If the x i are a random sample from a continuous distribution with true CDF F (x), then KSstat = max F n (x) F (x) T n <x<+ where the distribution of T n has asymptotic distribution D nt n K, the Kolmogorov distribution (which does not depend on F!) = Simultaneous confidence MIT 18.443 intervals for all ( x, θ = F ( x)) 8

Survival Functions Definition: For a r.v. X with CDF F (x), the Survival Function is S(x) = P(X > x) = 1 F (x) X : Time until death/failure/ event Empirical Survival Function S n (x) = 1 F n (x) Example 10.2.2.A Survival Analysis of Test Treatements 5 Test Groups (I,II,III,IV,V) of 72 animals/group. 1 Control Group of 107 animals. Animals in each test group received same dosage of tubercle bacilli inoculation. Test groups varied from low dosage (I) to high dosage (V). Survival lifetimes measured over 2-year period. Study Objectives/Questions: What is effect of increased exposure? How does effect vary with level of resistivity? MIT 18.443 9

Hazard Function: Mortality Rate Definition: For a lifetime distribution X with cdf F (x) and surivival function S(x) = 1 F (x), the hazard function h(x) is the instantaneous mortality rate at age x: h(x) δ = P(x < X < x + δ X > x) f (x)δ δf (x) = = P(X > x) 1 F (x) f (x) = δ S(x) = h(x) = f (x)/s(x). Alternate Representations of the Hazard Function d f (x) S(x) h(x) = = dx S(x) S(x) = d (log[s(x)]) = d dx dx (log[1 F (x)]) MIT 18.443 10

Hazard Function: Mortality Rate Special Cases: X Exponential(λ) with cdf F (x) = 1 e λx h(x) λ (constant mortality rate!) X Rayleigh(σ 2 ) with cdf F (x) = 1 e x2 /2σ 2 h(x) = x/σ 2 (mortality rate increases linearly) X Weibull(α, β) with cdf F (x) = 1 e (x/α) β h(x) = βx β 1 /α β Note: value of β determines whether h(x) is increasing (β > 1), constant (β = 1), or or decreasing (β < 1). MIT 18.443 11

Hazard Function Log Survival Functions Ordered times: X (1) X (2) X (n) For x = X (j), F n (x) = j/n and S n (x) = 1 j/n, Plot Log Survival Function versus age/lifetime log[s n (x (j) )] versus x (j) Note: to handle j = n case apply modifed definition of S n (): S n (x (j) ) = 1 j/(n + 1) In the plot of the log survival function, the hazard rate is the negative slope of the plotted function. (straight line constant hazard/mortality rate) MIT 18.443 12

Quantile-Quantile Plots One-Sample Quantile-Quantile Plots X a continuous r.v. with CDF F (x). pth quantile of the distribution: x p : F (x p ) = p x p = F 1 (p) Empirical-Theoretical Quantile-Quantile Plot Plot x (j) versus x pj = F 1 (p j ), where p j = j/(n + 1) Two-Sample Empirical-Empirical Quantile-Quantile Plot Context: two groups Control Group: x 1,..., x n i.i.d. cdf F (x) Test Treatment: y 1,..., y n i.i.d. cdf G (y) Plot order statistics of {y i } versus order statistics of {x i } Testing for No Treatment Effect H 0 : No treatment effect, G () F () MIT 18.443 13

Two-Sample Empirical Quantile-Quantile Plots Testing for Additive Treatment Effect Hypotheses: H 0 : No treatment effect, G () F () H 1 : Expected response increases by h units y p = x p + h Relationship between G () and F () G (y p ) = F (x p ) = p = G (y p ) = F (y p h). CDF G () is same as F () but shifted h units to right Q-Q Plot is linear with slope = 1 and intercept = h. MIT 18.443 14

Two-Sample Empirical Quantile-Quantile Plots Testing for Multiplicative Treatment Effect Hypotheses: H 0 : No treatment effect, G () F () H 1 : Expected response increases by factor of c (> 0) y p = c x p Relationship between G () and F () G (y p ) = F (x p ) = p = G (y p ) = F (y p /c). CDF G () is same as F () when plotted on log horizontal scale (shifted log(c) units to right on log scale). Q-Q Plot is linear with slope = c and intercept = 0. MIT 18.443 15

Outline 1 MIT 18.443 16

Methods For Displaying Distributions (relevant R functions) Histogram: hist(x, nclass = 20, probability = TRUE ) Density Curve: plot(density(x, bw = sj )) Kernel function: w(x) ) with i bandwidth h x w h (x) = 1 h w h n f h (x) = 1 n i=1 w h(x X i ) Options for Kernel function: Gaussian: w(x) = Normal(0, 1) pdf = w h (x X i ) is N(X i, h) density Rectangular, triangular, cosine, bi-weight, etc. See: Scott, D. W. (1992) Multivariate Density Estimation. Theory, Practice and Visualization. New York: Wiley Stem-and-Leaf Plot: stem(x) Boxplots: boxplot(x) MIT 18.443 17

Boxplot Construction Vertical axis = scale of sample X 1,..., X n Horizontal lines drawn at upper-quartile, lower quartile and vertical lines join the box Horizontal line drawn at median inside the box Vertical line drawn up from upper-quartile to most extreme data point within 1.5 (IQR) of the upper quartile. (IQR=Inter-Quartile Range) Also, vertical line drawn down from lower-quartile to most extreme data point within 1.5 (IQR) of the lower-quartile Short horizontal lines added to mark ends of vertical lines Each data point beyond the ends of vertical lines is marked with or. MIT 18.443 18 Displaying Distributions

Outline 1 MIT 18.443 19

Location Measures Objective: measure center of {x 1, x 2,..., x n } Arithmetic Mean 1 n x = n i=1 x i. Issues: not robust to outliers. Median n+1 2 x [ j] +x [j +1] n x [j ], j = if n is odd median({x i }) = 2, j = 2 if n is even Note: confidence intervals for η = median(x ) of form [x [k], x [n+1 k] ] Rice Section 10.4.2 applies binomial distribution for #(X i > η) MIT 18.443 20

Location Measures Trimmed Mean: x.trimmedmean = mean(x, trim = 0.10) 10% of of lowest values dropped 10% of highest values dropped mean of remaning values computed (trim = parameter must be less than 0.5) M Estimates Least-squares estimate: Choose ˆµ to minimize n 2 X i µ σ i=1 MAD estimate: Choose ˆµ to minimize n Xi µ σ i=1 (minimize mean absolute deviation) MIT 18.443 21

M Estimates (continued) Huber Estimate: Choose ˆmu to minimize n ( ) X i µ Ψ σ i=1 where Ψ() is sum-of-squares near 0 (within k σ) and sum-of-absolutes far from 0 (more than k σ) In R: library(mass) x.mestimate = huber(x, k = 1.5) Comparing Location Estimates For symmetric distribution: same location parameter for all methods! Apply Bootstrap to estimate variability For asymmetric distribution: location parameter varies MIT 18.443 22

Measures of Dispersion n Sample Standard Deviation: s 2 = 1 (X i X ) 2 n 1 i=1 s 2 unbiased f for Var(X ) if {X i } are i.i.d. s biased for Var(X ). (n 1)s 2 For X i i.i.d. N(µ, σ 2 : σ 2 χ 2 n 1. Interquartile Range : IQR = q 0.75 q 0.25 where q p is the p-th quantile of F X For X i i.i.d. N(µ, σ 2 ), IQR = 1.35 σ sample IQR For Normal Sample: σ = 1.35 1 Mean Absolute Deviation: MAD = n X i X n i=1 For X i i.i.d. N(µ, σ 2 ), E [MAD] = 0.675 σ MAD For Normal Sample: σ = 0.675 MIT 18.443 23

MIT OpenCourseWare http://ocw.mit.edu 18.443 Statistics for Applications Spring 2015 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.