A Short Course in Basic Statistics

Similar documents
Study Sheet. December 10, The course PDF has been updated (6/11). Read the new one.

Institute of Actuaries of India

Class 11 Maths Chapter 15. Statistics

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Asymptotic Statistics-III. Changliang Zou

Chapter 3. Data Description

Subject CS1 Actuarial Statistics 1 Core Principles

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Bivariate Relationships Between Variables

Contents. Acknowledgments. xix

STAT 4385 Topic 03: Simple Linear Regression

Chapter 1:Descriptive statistics

Sociology 6Z03 Review I

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling

Review of Statistics

3 Joint Distributions 71

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

M & M Project. Think! Crunch those numbers! Answer!

Ma 3/103: Lecture 25 Linear Regression II: Hypothesis Testing and ANOVA

Unit 2. Describing Data: Numerical

Variance reduction. Michel Bierlaire. Transport and Mobility Laboratory. Variance reduction p. 1/18

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Dr. Babasaheb Ambedkar Marathwada University, Aurangabad. Syllabus at the F.Y. B.Sc. / B.A. In Statistics

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

CIVL 7012/8012. Collection and Analysis of Information

STATISTICS; An Introductory Analysis. 2nd hidition TARO YAMANE NEW YORK UNIVERSITY A HARPER INTERNATIONAL EDITION

Simple Linear Regression for the MPG Data

Business Statistics. Lecture 9: Simple Regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Linear Models and Estimation by Least Squares

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Marquette University MATH 1700 Class 5 Copyright 2017 by D.B. Rowe

Application of Variance Homogeneity Tests Under Violation of Normality Assumption

For more information about how to cite these materials visit

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Preliminary Statistics course. Lecture 1: Descriptive Statistics

Bivariate Paired Numerical Data

Transition Passage to Descriptive Statistics 28

Descriptive Data Summarization

Regression diagnostics

Correlation and Regression

Binary choice 3.3 Maximum likelihood estimation

Econometrics Summary Algebraic and Statistical Preliminaries

Empirical Research Methods

Finding Relationships Among Variables

Vectors and Matrices Statistics with Vectors and Matrices

Quantitative Methods Chapter 0: Review of Basic Concepts 0.1 Business Applications (II) 0.2 Business Applications (III)

MIDTERM EXAMINATION (Spring 2011) STA301- Statistics and Probability

Statistics II. Management Degree Management Statistics IIDegree. Statistics II. 2 nd Sem. 2013/2014. Management Degree. Simple Linear Regression

Chapter 3: Regression Methods for Trends

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

Ch 2: Simple Linear Regression

Topic 21 Goodness of Fit

11 Hypothesis Testing

Final Exam. Economics 835: Econometrics. Fall 2010

MATH4427 Notebook 4 Fall Semester 2017/2018

STATISTICS 479 Exam II (100 points)

MATH 1150 Chapter 2 Notation and Terminology

Learning Objectives for Stat 225

7.0 Lesson Plan. Regression. Residuals

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

Correlation analysis. Contents

STA 302f16 Assignment Five 1

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

After completing this chapter, you should be able to:

11. Regression and Least Squares

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Statistics. Statistics

Data presentation and descriptive statistics

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

ECN221 Exam 1 VERSION B Fall 2017 (Modules 1-4), ASU-COX VERSION B

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

Descriptive statistics Maarten Jansen

From last time... The equations

AIM HIGH SCHOOL. Curriculum Map W. 12 Mile Road Farmington Hills, MI (248)

Statistical Data Analysis

Dr. Allen Back. Sep. 23, 2016

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Glossary for the Triola Statistics Series

First Year Examination Department of Statistics, University of Florida

Introduction and Single Predictor Regression. Correlation

The regression model with one fixed regressor cont d

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Statistics I Chapter 2: Univariate data analysis

STATISTICS ( CODE NO. 08 ) PAPER I PART - I

Multivariate Random Variable

ZIMBABWE SCHOOL EXAMINATIONS COUNCIL (ZIMSEC) ADVANCED LEVEL SYLLABUS

Simple Linear Regression

BIOL 4605/7220 CH 20.1 Correlation

Diploma Part 2. Quantitative Methods. Examiner s Suggested Answers

Scatter plot of data from the study. Linear Regression

Homework 2: Simple Linear Regression

Analysing data: regression and correlation S6 and S7

Chapter 4. Displaying and Summarizing. Quantitative Data

Probability reminders

Calculus first semester exam information and practice problems

Transcription:

A Short Course in Basic Statistics Ian Schindler November 5, 2017 Creative commons license share and share alike BY: C 1 Descriptive Statistics 1.1 Presenting statistical data Definition 1 A statistical population is a set Ω of objects of study. We may assume that card(ω) <. Therefore there is a bijection i : Ω 1, n for some n N. This leads to using Ω and 1, n interchangeably. A random variable is a function X : Ω ω. The goal of descriptive statistics is to summarise the range of random variables. 1.1.1 Raw data Let X : Ω = 1, n ω. If ω R, the value of X at i 1, n, X(i) will be denoted x i. Continuous: ω = I R. Discrete: ω R and card(ω) is small. Categorical : ω is nominal. 1.1.2 A table of a simple statistical series For a discrete variable with K categories, {x i } n, we define: The class x k where x k < x k+1 for k 1, K 1. The number of elements in the class k: n k with n = K k=1 n k. The frequency: f k = n k /n with K k=1 f k = 1. 1

A discrete variable can be represented in a table: (k, x k, n k ) K k=1. (1) More completely: (k, x k, n k, f k ) K k=1 (2) The cumulative number of elements is s k = k l=1 n k with s K = n. The cumulative frequency Φ k = k l=1 f k. Below is a table representing a simple statistical series where Ω is 40 children and X : Ω R is their age in years. k x k n k s k f k Φ k 1 6 4 4 0,1 0,10 2 7 8 12 0,2 0,30 3 8 14 26 0,35 0,65 4 9 12 38 0,30 0,95 5 10 2 40 0,05 1 The table can be represented by a bar graph: The empirical distribution function Φ(x) : R [0, 1] is defined by Φ(x) := the proportion of the range of X x. Thus Φ is monotone increasing with Φ( ) = 0, Φ(+ ) = 1, Φ(x k ) = Φ k between the values of the range of X Φ(x) is constant. The function Φ(x) can be deduced from the above bar graph of X. 2

The Stem and Leaf diagram, introduced by J.W. Tukey (1977), gives more information than a barplot. The decimal point is at the 6 0000 6 7 00000000 7 8 00000000000000 8 9 000000000000 9 10 00 If the variable X is continuous rather than discrete, classes are prescribed so that it can be treated as if it where discrete. The number of classes, and the bounds on the classes are left to the software package, or chosen to make the graphics look good. A rule of thumb is to have n classes. The data can then be presented in a table of the form: [k, [b k 1; b k [, n k ] k=1,k such that K n k = n k=1 We define the amplitude of the class as a k := b k b k 1. To simplify, we will only consider constant amplitudes. Rather than a barplot representing the frequencies of X, we use a histogram. The height of the classe in the histogram usually corresponds to the number in the class. The frequency of the class can also be used. 3

The definition of the empirical distribution function does not change. One can think of the empirical distribution function as the integral of the histogram of X (using frequencies) just as a distribution function is the integral of the probability density function in probability. 1.2 Two variables Suppose that X : Ω ω 1 and Y : Ω ω 2, that is, using the equivalence Ω 1, n we have couples (x i, y i ) n. Let [X 1,..., X j,..., X J ] and [Y 1,..., Y j,..., Y J ] designate the states of X and Y respectively (quantitative or nominal). Let n jk be the number of elements with state X j and state Y k. 4

X Y Y 1 Y k Y K X 1 n 11 n 1k n 1K n 1...... x j n j1 n jk n jk n j...... x J n J1 n Jk n JK n J. n.1 n.k n.k n Table 1: Contingency table Where n j. = k n jk, n.k = j n jk, and n = j n j. = k n.k. The contingency table can be normalised to a frequency table by dividing each number by n to obtain: f jk = n jk n. The conditional frequencies are used to establish the independence of the variable X and Y. Conditional row profile for row j : n jk n j. Conditional column profile for column k : n jk n.k. 1.3 Summary statistics of a quantitative variable X 1.3.1 Measures of the center Definition 2 The mode:= the class or center of the class with the highest frequency. The median : Q 2 is defined by Φ(Q 2 ).5 and for all x < Q 2, Φ(x) <.5. The arithmetic mean or average: x def = 1 n n x i. The geometric mean: GM def = ( n x i) 1/n. Note that if y i = x i + a, then ȳ = x + a. 1.3.2 Measures of dispersion The range: R = x max x min. The interquartile range: IQR def = Q 3 Q 1. The variance: VarX = 1 n (x i x) 2 n ( n ) = 1 x 2 i x 2. n 5

To estimate the variance an unbiased estimator is used: s 2 X = 1 n 1 n (x i x) 2. The kth moment: µ k := 1 n n (x i x) k. Standard deviation: σ X = Var X = 1 n n (x i x) 2. The coefficient of variation: c X = σ X x. etc. Note that Var(aX + b) = a 2 Var X. 2 Graphics Graphics should clarify thus be readable. This means a graph should be titled and scales should be clear. 2.0.3 Nominative variables Column graph Bar graph Pie chart. 2.0.4 Discrete quantitative variable Bar graph. Empirical distribution function. 2.1 Continuous quantitative variable Histogramme Empirical distribution function. 2.1.1 Two quantitative variables Scatterplot. 6

2.2 Link between two variables The scalar product of two vectors X and Y in R n is defined as (X, Y ) = n x iy i. Note that X 2 = (X, X). We also have (X, Y ) = X Y cos θ where θ is the angle between X, and Y. 2.2.1 Two quantitative variables X et Y The data (i, x(i), y(i)) n can be represented as vectors X and Y Rn. def We also define X = x(1,..., 1). We have Var X = 1 n (X X, X X) = 1 n X X 2. We compute The covariance: Cov (X, Y ) = 1 n n (x i x)(y i ȳ) = [ 1 n n x ] iy i xȳ = 1 n (X X, Y Ȳ ). Property: Var (X + Y ) = Var X+Var Y + 2Cov(X, Y ). The Bravais Pearson correlation coefficient. r(x, Y ) = Cov(X, Y ) σ X σ Y = cos θ. Where θ is the angle between Y and Ŷ (see below). R 2 = Ŷ Ȳ 2 / Y Ȳ 2 = cos 2 θ, Where θ is as above. Linear regression or least squares regression consists of projecting Y Ȳ onto the span of X X. In other words we solve the problem: min Y Ȳ a(x X) 2. (3) a R to obtain the projection of Y Ȳ onto the space spanned by X X. The projection â(x X) is called Ŷ Ȳ. Equivalently we suppose y i = f(x i ) = ax i + b + ɛ i and compute â and ˆb such that n ɛ2 i = n (âx i + ˆb) 2 is minimal. One obtains Remark 1 The equation Cov (X,Y ) Var X â = b = ȳ â x. Ŷ = â(x x) + ȳ Defines the linear regression line which goes through the point ( x, ȳ). From the contingency table one can calculate: 7

The mean: x := 1 n I n i.x i. La variance: σ 2 (X) := 1 n I n i.(x i x) 2. The conditional means : The conditional variances: x j := 1 n.j σ 2 j (X) := 1 n.j I n ij x i. I n ij (x i x j ) 2. The variance of the conditional means of X or explained variance: σ 2 ( x j ) := 1 n J n.j ( x j x) 2. The average conditional variance of X or residual variance: σ 2 j (x) := 1 n j=1 J n.j σj 2 (X). j=1 The correlation quotients: η 2 X/Y and η Y/X: η 2 X/Y : = σ2 ( x j ) σ 2 (X) η 2 Y/X : = σ2 (ȳ i ) σ 2 (Y ) The correlation quotients are used to test for a functional relationship (not necessarily linear) between X and Y. 8

2.2.2 Time Series Indices: Elementary indices: I n 0 = P n/p 0. Properties: circularity: I n 0 = I n n I n 0 reversibility: I 0 n = 1/I n 0. Chains: I n 0 = I n n 1 I n 1 n 2 I 1. 0 Synthetic indices: Let {B i } N, be N goods, with respective prices: {p i 0 }N at t = t 0 and {p i n} N at t = t n bought with respective quantities {q0 i } at t = t 0 and {qn} i at t = t n. Simple average index: I n 0 = N pi n N pi 0 Average of indices: The Laspeyre index: I n 0 = 1 N N I i n 0 L n 0 := N pi nq i 0 N pi 0 qi 0 = i p i n ρ i p i. 0 The Paasche index: P n 0 := N pi nq i n N pi 0 qi n = 1. i ρ i pi 0 p i n A simple time series is a data set of the form: (t, x t ) T t=1. Let T t := the trend (4) s t := seasonal variations (5) c t := cyclical variations (6) ɛ t := random variations. (7) We consider the model: x t = T t + s t + c t + ɛ t. 9

In certain cases x t = T t s t c t ɛ t In which case we consider Log x t to reduce to the first case. We suppose s t to be periodic with 0 average, that is, there is a k such that The moving averages k : s t = s t+k t k s t+i = 0 t. i=0 for k = 2p + 1, for k = 2p. Forecasting: y t := x t p + x t p+1 + + x t + + x t+p 2p + 1 y t := x t p/2 + x t p+1 + + x t+p 1 + x t+p /2 2p Step 1: linear regression of y t as a function of t, ŷ t = ât + ˆb. (8) (9) Step 2: estimate seasonal values, ŝ t = x t ˆT t = x t ŷ t Step 3: because we have estimated s t several times, we take the average Thus ŝ t := 1 I I ŝ t+ik. ˆx t = ŷ t + ŝ t. 2.2.3 Two nominal variables From the contingency table (cf 1.4) we may test whether two variables X and Y are independent or not with the χ 2 test of Independence s: χ 2 = L l=1 c=1 C (n lc E lc ) 2 where n lc := the observed number E lc := n l.n. c n. Note that E lc is the expected number if the variables are independent. 10 E lc

Remark 2 1. If X and Y have no interdependence, then one expects that χ 2 = 0. 2. χ 2 depends on n, L, and C and lim n χ 2 is a known distribution. If C + L 2, we have the following bounds on χ 2 : 0 χ 2 n[min(l, C) 1]. For large n, we approximate χ 2 with the limit distribution to test the hypothesis: H 0 : X and Y are independent. H 0 is rejected if the χ 2 statistic is sufficiently large. More precisely, we fix our tolerance of error, usually at either 5% or 1%. A bound is computed such that the probability of χ 2 exceeding the bound under H 0 is our tolerated error. If this bound is exceeded by our computed statistic H 0 is rejected. 11