(Re)introduction to Statistics Dan Lizotte

Similar documents
EXAM. Exam #1. Math 3342 Summer II, July 21, 2000 ANSWERS

The Central Limit Theorem

Counting principles, including permutations and combinations.

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Statistics and Econometrics I

Lecture 2: Repetition of probability theory and statistics

Class 26: review for final exam 18.05, Spring 2014

Recitation 2: Probability

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Bandits, Experts, and Games

Probability. Lecture Notes. Adolfo J. Rumbos

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Continuous Expectation and Variance, the Law of Large Numbers, and the Central Limit Theorem Spring 2014

Business Statistics. Lecture 10: Course Review

What is a random variable

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

PAC-learning, VC Dimension and Margin-based Bounds

Math 494: Mathematical Statistics

Probability and Probability Distributions. Dr. Mohammed Alahmed

2. Variance and Covariance: We will now derive some classic properties of variance and covariance. Assume real-valued random variables X and Y.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

Single Maths B: Introduction to Probability

Random Variables. Statistics 110. Summer Copyright c 2006 by Mark E. Irwin

7 Random samples and sampling distributions

To find the median, find the 40 th quartile and the 70 th quartile (which are easily found at y=1 and y=2, respectively). Then we interpolate:

Probability and Statistics

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

Math Review Sheet, Fall 2008

Learning Objectives for Stat 225

MATH2206 Prob Stat/20.Jan Weekly Review 1-2

AP Statistics Cumulative AP Exam Study Guide

Stephen Scott.

Exploratory Data Analysis August 26, 2004

Summarizing Measured Data

CHAPTER 2: Describing Distributions with Numbers

Statistics 100A Homework 5 Solutions

STA 111: Probability & Statistical Inference

Statistics 1 - Lecture Notes Chapter 1

CS280, Spring 2004: Final

Algorithms for Uncertainty Quantification

MATH Notebook 5 Fall 2018/2019

Lecture 1: August 28

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Lecture 8 Sampling Theory

Lecture 1: Probability Fundamentals

Introduction to Probability and Statistics (Continued)

Stat 704 Data Analysis I Probability Review

Chapter 1: Exploring Data

Week 9 The Central Limit Theorem and Estimation Concepts

Probability. Table of contents

Lecture 2: Review of Probability

Business Statistics. Lecture 3: Random Variables and the Normal Distribution

MATH4427 Notebook 4 Fall Semester 2017/2018

CS 361: Probability & Statistics

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Sample Spaces, Random Variables

CS 5014: Research Methods in Computer Science. Bernoulli Distribution. Binomial Distribution. Poisson Distribution. Clifford A. Shaffer.

Last few slides from last time

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

Statistical Data Analysis

SCHOOL OF MATHEMATICS AND STATISTICS

Example. If 4 tickets are drawn with replacement from ,

Null Hypothesis Significance Testing p-values, significance level, power, t-tests Spring 2017

Random Variables. Definition: A random variable (r.v.) X on the probability space (Ω, F, P) is a mapping

ELEG 3143 Probability & Stochastic Process Ch. 2 Discrete Random Variables

Continuous random variables

6 Single Sample Methods for a Location Parameter

MAT 271E Probability and Statistics

Statistical inference (estimation, hypothesis tests, confidence intervals) Oct 2018

Name: Firas Rassoul-Agha

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Probability Distributions & Sampling Distributions

Political Science Math Camp: Problem Set 2

How Monte Carlo Sampling Contributes to Data Analysis. Outline

Outline. Unit 3: Inferential Statistics for Continuous Data. Outline. Inferential statistics for continuous data. Inferential statistics Preliminaries

Epidemiology Principle of Biostatistics Chapter 11 - Inference about probability in a single population. John Koval

1.1 Review of Probability Theory

STAT Chapter 5 Continuous Distributions

1.3: Describing Quantitative Data with Numbers

Discrete Random Variables

An introduction to biostatistics: part 1

Math 151. Rumbos Fall Solutions to Review Problems for Exam 2. Pr(X = 1) = ) = Pr(X = 2) = Pr(X = 3) = p X. (k) =

(Ch 3.4.1, 3.4.2, 4.1, 4.2, 4.3)

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes

Introduction to Statistical Inference

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

20 Hypothesis Testing, Part I

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

2 Chapter 2: Conditional Probability

CIVL 7012/8012. Collection and Analysis of Information

PROBABILITY THEORY REVIEW

STATISTICS 1 REVISION NOTES

Exam 2 Practice Questions, 18.05, Spring 2014

Sampling Distribution: Week 6

(Ch 3.4.1, 3.4.2, 4.1, 4.2, 4.3)

Applied Regression Analysis

Chapter 1 Statistical Inference

STAT 4385 Topic 01: Introduction & Review

Transcription:

(Re)introduction to Statistics Dan Lizotte 2017-01-17 Statistics The systematic collection and arrangement of numerical facts or data of any kind; (also) the branch of science or mathematics concerned with the analysis and interpretation of numerical data and appropriate ways of gathering such data. [OED] Why statistics? Can tell you if you should be surprised by your data Can help predict what future data will look like Data ## We'll use data on the duration and spacing of eruptions ## of the old faithful geyser ## Data are eruption duration and waiting time to next eruption data ("faithful") # load data str (faithful) # display the internal structure of an R object ## 'data.frame': 272 obs. of 2 variables: ## $ eruptions: num 3.6 1.8 3.33 2.28 4.53... ## $ waiting : num 79 54 74 62 85 55 88 85 51 85... Data summaries A statistic is a the result of applying a function (summary) to the data: statistic <- function(data) E.g. ranks: Min, Quantiles, Median, Mean, Max summary (faithful$eruptions) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.600 2.163 4.000 3.488 4.454 5.100 Roughly, a quantile for a proportion p is a value x for which p of the data are less than or equal to x. The first quartile, median, and third quartile are the quantiles for p = 0.25, p = 0.5, and p = 0.75, respectively. Visual Summary 1: Box Plot boxplot (faithful$eruptions, main="eruption time", horizontal=t) 1

Eruption time 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Visual Summary 1.5: Box Plot, Jitter Plot library('ggplot2');library(gridextra); #boxplot relatives b1<-ggplot(faithful, aes(x="all",y=eruptions)) + labs(x=null) + geom_boxplot() #jitter plot b2<-ggplot(faithful, aes(x="all",y=eruptions)) + labs(x=null) + geom_jitter(position=position_jitter(height=0,width=0.25)) grid.arrange(b1, b2, nrow=1) 2

5 5 4 4 eruptions 3 eruptions 3 2 2 All All Visual Summary 2: Histogram ## Construct histogram of eruption times, plot data points on the x axis hist (faithful$eruptions, main="eruption time", xlab="time (minutes)", ylab="count") points (x=faithful$eruptions,y=rep(0,length(faithful$eruptions)), lwd=4, col='blue') 3

Eruption time Count 0 20 40 60 2 3 4 5 Time (minutes) Visual Summary 2.5: Histogram ## Construct different histogram of eruption times ggplot(faithful, aes(x=eruptions)) + labs(y="proportion") + geom_histogram(aes(y =..count../sum(..count 4

0.100 0.075 Proportion 0.050 0.025 0.000 2 3 4 5 eruptions Visual Summary 3: Empirical Cumulative Distribution Function ## Construct ECDF of eruption times, plot data points on the x axis plot(ecdf(faithful$eruptions), main="eruption time", xlab="time (minutes)", ylab="proportion") points (x=faithful$eruptions,y=rep(0,length(faithful$eruptions)), lwd=4, col='blue') 5

Eruption time Proportion 0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 Time (minutes) Visual Summary 3.5: Empirical Cumulative Distribution Function ## Different picture of ECDF, with jitter plot ggplot(faithful, aes(x=eruptions)) + labs(x="eruption Time",y="Proportion") + stat_ecdf() + geom_jitter(aes(y=0.125),position=position_jitter(width=0,height=0.1)) 6

1.00 0.75 Proportion 0.50 0.25 0.00 2 3 4 5 Eruption Time Replicates Common assumption is that data consists of replicates that are the same. Come from the same population Come from the same process The goal of data analysis is to understand what the data tell us about the population. Randomness We often assume that we can treat items as if they were distributed randomly. That s so random! Result of a coin flip is random Passengers were screened at random random does not mean uniform Mathematical formalism: events and probability Sample Spaces and Events Sample space S is the set of all possible events we might observe. Depends on context. Coin flips: S = {h, t} Eruption times: S = R 0 7

(Eruption times, Eruption waits): S = R 0 R 0 An event is a subset of the sample space. Observe heads: {h} Observe eruption for 2 minutes: {2.0} Observe eruption with length between 1 and 2 minutes and wait between 50 and 70 minutes: [1, 2] [50, 70]. Event Probabilities Any event can be assigned a probability between 0 and 1 (inclusive). Pr({h}) = 0.5 Pr([1, 2] [50, 70]) = 0.10 Probability (OED) Math. As a measurable quantity: the extent to which a particular event is likely to occur, or a particular situation be the case, as measured by the relative frequency of occurrence of events of the same kind in the whole course of experience, and expressed by a number between 0 and 1. An event that cannot happen has probability 0; one that is certain to happen has probability 1. Probability is commonly estimated by the ratio of the number of successful cases to the total number of possible cases, derived mathematically using known properties of the distribution of events, or estimated logically by inferential or inductive reasoning (when mathematical concepts may be inapplicable or insufficient). Axioms of probability Pr is a probability function over S iff 1. For all events A, Pr(A) R, Pr(A) 0 2. Pr(S) = 1 3. If A 1, A 2,... are disjoint, then Pr( A i ) = Pr(A i ) i=1 i=1 Interpreting probability: Objectivist view Suppose we observe n replications of an experiment. Let n(a) be the number of times event A was observed lim n n(a) n = Pr(A) This is (loosely) Borel s Law of Large Numbers (The more correct statment of this is coming up in a few slides.) Subjective interpretation is possible as well. ( Bayesian statistics is related to this idea more later.) 8

Abstraction of data: Random Variable We often reduce data to numbers. 1 means heads, 0 means tails. A random variable is a mapping from the event space to a number (or vector.) Usually rendered in uppercase italics X is every statistician s favourite, followed closely by Y and Z. Realizations of X are written in lower case, e.g. x 1, x 2,... We will write the set of possible realizations as: X for X, Y for Y, and so on. Distributions of random variables Realizations are observed according to probabilities specified by the distribution of X Can think of X as an infinite supply of data Separate realizations of the same r.v. X are independent and identically distributed (i.i.d.) Formal definition of a random variable requires measure theory, not covered here Probabilities for random variables Random variable X, realization x. What is the probability we see x? Pr(X = x), (if lazy, Pr(x), but don t do this) Subsets of the domain of a random variable correspond to events. Pr(X > 0) probability that I see a realization that is positive. Discrete Random Variables Discrete random variables take values from a countable set Coin flip X X = {0, 1} Number of snowflakes that fall in a day Y Y = {0, 1, 2,...} Probability Mass Function (PMF) For a discrete X, p X (x) gives Pr(X = x). Requirement: x X p X(x) = 1. Note that the sum can have an infinite number of terms. 9

Probability Mass Function (PMF) Example X is number of heads in 20 flips of a fair coin X = {0, 1,..., 20} 0.15 p X (x) 0.10 0.05 0.00 x Cumulative Distribution Function (CDF) For a discrete X, P X (x) gives Pr(X x). Requirements: P is nondecreasing sup x X P X (x) = 1 Note: P X (b) = x b p X(x) Pr(a < X b) = P X (b) P X (a) Cumulative Distribution Function (CDF) Example X is number of heads in 20 flips of a fair coin 10

1.00 0.75 P X (x) 0.50 0.25 0.00 x Continuous random variables Continuous random variables take values in intervals of R Mass M of a star M = (0, ) Oxygen saturation S of blood S = [0, 1] For a continuous r.v. X, Pr(X = x) = 0 for all x. There is no probability mass function. However, Pr(X (a, b)) 0 in general. Probability Density Function (PDF) For continuous X, Pr(X = x) = 0 and PMF does not exist. However, we define the Probability Density Function f X : Pr(a X b) = b a f X(x) dx Requirement: x f X (x) > 0, f X(x) dx = 1 11

Probability Density Function (PDF) Example 0.15 0.10 Density 0.05 0.00 0 5 10 15 20 x Cumulative Distribution Function (CDF) For a continuous X, F X (x) gives Pr(X x) = Pr(X (, x]). Requirements: F is nondecreasing sup x X F X (x) = 1 Note: F X (x) = x f X(x) dx Pr(x 1 < X x 2 ) = F X (x 2 ) F X (x 1 ) 12

Cumulative Distribution Function (CDF) Example 1.00 0.75 Probability 0.50 0.25 0.00 0 5 10 15 20 x Expectation The expected value of a discrete random variable X is denoted E[X] = x p X (X = x) x X The expected value of a continuous random variable Y is denoted E[Y ] = y f Y (Y = y) dy y Y E[X] is called the mean of X, often denoted µ or µ X. Sample Mean Given a dataset (collection of realizations) x 1, x 2,..., x n of X, the sample mean is: x n = 1 n Given a dataset, x n is a fixed number. We use X n to denote the random variable corresponding to the sample mean computed from a randomly drawn dataset of size n. i x i 13

Datasets and sample means Datasets of size n = 15, sample means plotted in red. (Weak) Law of Large Numbers Informally: If n is large, then x n is probably close to µ X. Formally: lim Pr( X n µ x > ε) = 0 n Statistics, Parameters, and Estimation A statistic is any summary of a dataset. (E.g. function applied to a dataset. Xn, sample median.) A statistic is the result of a A parameter is any summary of the distribution of a random variable. (E.g. µ X, median.) A parameter is the result of a function applied to a distribution. Estimation uses a statistic (e.g. Xn ) to estimate a parameter (e.g. µ X ) of the distribution of a random variable. Estimate: value obtained from a specific dataset Estimator: function (e.g. sum, divide by n) used to compute the estimate Estimand: parameter of interest 14

Consistency We often use X n to estimate µ X. Law of Large Numbers is one bit of theory that justifies this choice. An estimator is consistent for an estimand if it converges to the estimand in probability. Sampling Distributions Given an estimate, how good is it? The distribution of an estimator is called its sampling distribution. Bias The expected difference between estimator and parameter. If 0, estimator is unbiased. E[ X n µ X ] Sometimes, Xn > µ X, sometimes X n < µ X, but the long run average of these differences will be zero. Variance The expected squared difference between estimator and its mean 15

Positive for all interesting estimators. For an unbiased estimator E[( X n E[ X n ]) 2 ] E[( X n µ X ) 2 ] Sometimes, Xn > µ X, sometimes X n < µ X, but the squared differences are all positive and do not cancel out. Central Limit Theorem Informally: The sampling distribution of Xn is approximately normal if n is big enough. More formally, for X with finite variance: where F Xn ( x) x 1 e ( x µ X ) 2 2σ n 2 σ n 2π σ 2 n = σ2 n is called the standard error and σ 2 is the variance of X. NOTE: More data means lower standard error. Normal (Gaussian) Distribution f X (x) = 1 e (x µ X ) 2 2σ 2 X σ X 2π The normal distribution is special (among other reasons) because many estimators have approximately normal sampling distributions or have sampling distributions that are closely related to the normal. Reminder, σ 2 X = E[(X µ X) 2 ]. If X is normal and we let we have Z = X µ X σ X f Z (z) = 1 2π e z2 2 Who cares? Eruptions dataset has n = 272 observations. Our estimate of the mean of eruption times is x 272 = 3.4877831. What is the probability of observing an x 272 that is within 10 seconds of the true mean? 16

Who cares? Let σ X272 = σ X / 272, let Z = X 272 µ X σ X272 be a new r.v. By the C.L.T., Pr( 0.17 X 272 µ X 0.17) = Pr( 0.17 σ X272 Z 0.17 σ X272 ) 0.17 σ X272 1 z= 0.17 σ X272 Note! I estimated σ X here. (Look up t-test. ) 2π e z2 2 = 2.456 z= 2.456 1 2π e z2 2 = 0.986 2.456 z= 2.456 1 2π e z2 2 = 0.986 0.4 0.3 Density 0.2 0.1 0.0 z Confidence Intervals Typically, we specify confidence given by 1 α Use the sampling distribution to get an interval that traps the parameter (estimand) with probability 1 α. 95% C.I. for eruption mean is (3.35, 3.62) 17

95% Confidence Region 0.4 0.3 Density 0.2 0.1 0.0 z 18

What a Confidence Interval Means 19

Effect of n on width The Bootstrap CLT gives theoretical approximate sampling distribution of X n. We could also estimate the sampling distribution of Xn by drawing many datasets of size n, computing X n on each, constructing histogram. This is impossible. But we can use the data we have as a surrogate. The Bootstrap Call our dataset D. Draw B new datasets by sampling observations with replacement from D. (B is often at least 1000) Compute X (b) n for each of the datasets. Use the histogram/empirical distribution of these pretend X to determine confidence limits. Bootstrap example library(boot) bootstraps <- boot(faithful$eruptions,function(d,i){mean(d[i])},r=5000) bootdata = data.frame(xbars=bootstraps$t); limits = quantile(bootdata$xbars,c(0.025,0.975)) 20

ggplot(bootdata, aes(x=xbars)) + labs(y="prop.") + geom_histogram(aes(y =..density..)) + geom_errorbarh(aes(xmin=limits[[1]], xmax=limits[[2]], y=c(0)),height=0.25,colour="red",size=2) 6 4 Prop. 2 0 3.2 3.4 3.6 xbars 21

Reality Check 0.8 0.6 Prop. 0.4 0.2 0.0 2 3 4 5 eruptions How much data do I need? Performance measurement: Preview My classifier is correct 20 times out of 30 on this test set! Let X be r.v. representing correctness as {0, 1}. What does µ X mean? Have 50 observations of X. Performance measurement: Preview My classifier is correct 20 times out of 30 on this test set! Let X be r.v. representing correctness as {0, 1}. What does µ X mean? Have 50 observations of X. binom.test(20,30) ## ## Exact binomial test 22

## ## data: 20 and 30 ## number of successes = 20, number of trials = 30, p-value = 0.09874 ## alternative hypothesis: true probability of success is not equal to 0.5 ## 95 percent confidence interval: ## 0.4718800 0.8271258 ## sample estimates: ## probability of success ## 0.6666667 Test set sample size calculation. Suppose true accuracy is 0.66. Can I tell the difference from 0.5 with a sample size of 30? How much data would I need to distinguish my classifier from 0.5 with probability (1 β) = 0.8 at a significance level of α = 0.05? Test set sample size calculation. Suppose true accuracy is 0.66. Can I tell the difference from 0.5 with a sample size of 30? How much data would I need to distinguish my classifier from 0.5 with probability (1 β) = 0.8 at a significance level of α = 0.05? library(pwr) pwr.p.test(h = ES.h(p1 = 0.5, p2 = 0.66), n = NULL, power = 0.8, sig.level = 0.05) ## ## proportion power calculation for binomial distribution (arcsine transformation) ## ## h = 0.3257295 ## n = 73.97628 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided R summary commands str() shows the structure of a vector, matrix, table, data frame, etc. summary() shows basic summary statistics foo$bar extracts column bar from data frame foo length(),nrow(),ncol() size information for vector, data frame, etc. min(), max(), median(), IQR(), quantile(data,prob) do what you expect IQR is Inter-Quartile Range: 3rd Quartile minus 1st Quartile. mean(), var(), sd() Note: variance and standard deviation use n 1 in denominator Utility commands rep(e,n) creates vector by repeating element e, n times 23

which(b) returns list of indices for which boolean expression b is true R commands hist() computes and plots a histogram (probability=t shows proportions instead of frequency) ecdf() boxplot() draws box plot. Whiskers extend at most 1.5 IQR from the nearest quartile. density() constructs a kernel density estimate using given data plot() creates a new scatterplot of given x, y coordinates. Can also be used to plot many other R objects. Try it! points() adds additional points to an existing plot Common function arguments main - Plot title xlab - x label for plot ylab - y label for plot pch - plotting character, what shape to use for points cex - character expansion - multiplicative factor to enlarge/shrink points ggplot2 Very stylish Learning curve steep but worth it Examples in this document Lots of resources on the web 24