Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Similar documents
Lecture 2 and Lecture 3

Chapter 7. Scatterplots, Association, and Correlation

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

MATH 1150 Chapter 2 Notation and Terminology

Chapter 5. Understanding and Comparing. Distributions

THE PEARSON CORRELATION COEFFICIENT

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

Section 3.4 Normal Distribution MDM4U Jensen

Sociology 6Z03 Review I

Scatterplots and Correlations

Descriptive Univariate Statistics and Bivariate Correlation

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

Chapter 4 Describing the Relation between Two Variables

Chapter 3: Examining Relationships

Unit 2. Describing Data: Numerical

AMS 7 Correlation and Regression Lecture 8

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Chapter 12 - Part I: Correlation Analysis

Section 3. Measures of Variation

Chapter 6: Exploring Data: Relationships Lesson Plan

Why Is It There? Attribute Data Describe with statistics Analyze with hypothesis testing Spatial Data Describe with maps Analyze with spatial analysis

CHAPTER 3 Describing Relationships

STA 218: Statistics for Management

A company recorded the commuting distance in miles and number of absences in days for a group of its employees over the course of a year.

Lecture 2. Descriptive Statistics: Measures of Center

Density Curves and the Normal Distributions. Histogram: 10 groups

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Statistics for Managers using Microsoft Excel 6 th Edition

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

appstats27.notebook April 06, 2017

Business Statistics. Lecture 10: Correlation and Linear Regression

MAT Mathematics in Today's World

Looking at Data Relationships. 2.1 Scatterplots W. H. Freeman and Company

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Lecture 3: Chapter 3

Week 8: Correlation and Regression

Chapter 2: Tools for Exploring Univariate Data

AP Final Review II Exploring Data (20% 30%)

P8130: Biostatistical Methods I

Math 1710 Class 20. V2u. Last Time. Graphs and Association. Correlation. Regression. Association, Correlation, Regression Dr. Back. Oct.

Review of Statistics 101

Lecture 6. Probability events. Definition 1. The sample space, S, of a. probability experiment is the collection of all

Contents. Acknowledgments. xix

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Big Data Analysis with Apache Spark UC#BERKELEY

Chapter 27 Summary Inferences for Regression

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Measures of. U4 C 1.2 Dot plot and Histogram 2 January 15 16, 2015

Review of Multiple Regression

Stats Review Chapter 3. Mary Stangler Center for Academic Success Revised 8/16

Mathematics for Economics MA course

appstats8.notebook October 11, 2016

Class 11 Maths Chapter 15. Statistics

The response variable depends on the explanatory variable.

Overview. 4.1 Tables and Graphs for the Relationship Between Two Variables. 4.2 Introduction to Correlation. 4.3 Introduction to Regression 3.

M 225 Test 1 B Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Dr. Allen Back. Sep. 23, 2016

Elementary Statistics

Chapter. Numerically Summarizing Data. Copyright 2013, 2010 and 2007 Pearson Education, Inc.

Review. Midterm Exam. Midterm Review. May 6th, 2015 AMS-UCSC. Spring Session 1 (Midterm Review) AMS-5 May 6th, / 24

Chapter 7 Linear Regression

AP Statistics. Chapter 6 Scatterplots, Association, and Correlation

Arvind Borde / MAT , Week 5: Relationships I

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Lecture 18: Simple Linear Regression

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

3.1 Scatterplots and Correlation

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

The Normal Distribuions

Unit 6 - Introduction to linear regression

Index I-1. in one variable, solution set of, 474 solving by factoring, 473 cubic function definition, 394 graphs of, 394 x-intercepts on, 474

Unit 6 - Simple linear regression

Chapter 5 Least Squares Regression

Correlation & Simple Regression

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

ES-2 Lecture: More Least-squares Fitting. Spring 2017

Probability Distributions

How spread out is the data? Are all the numbers fairly close to General Education Statistics

SCATTERPLOTS. We can talk about the correlation or relationship or association between two variables and mean the same thing.

STAB22 Statistics I. Lecture 7

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

The empirical ( ) rule

Unit 2: Numerical Descriptive Measures

Access Algebra Scope and Sequence

Chapter 10 Correlation and Regression

Matrices, Row Reduction of Matrices

Chapter 3 Statistics for Describing, Exploring, and Comparing Data. Section 3-1: Overview. 3-2 Measures of Center. Definition. Key Concept.

Section 5.4. Ken Ueda

STT 315 This lecture is based on Chapter 2 of the textbook.

ACMS Statistics for Life Sciences. Chapter 11: The Normal Distributions

Scatterplots. STAT22000 Autumn 2013 Lecture 4. What to Look in a Scatter Plot? Form of an Association

equal to the of the. Sample variance: Population variance: **The sample variance is an unbiased estimator of the

HUDM4122 Probability and Statistical Inference. February 2, 2015

Chapter 8. Linear Regression /71

Transcription:

Lecture 5 1 Lecture 3 The Population Variance The population variance, denoted σ 2, is the sum of the squared deviations about the population mean divided by the number of observations in the population, N: σ 2 = (xi µ) 2 N = (x 1 µ) 2 + (x 2 µ) 2 + (x N µ) 2 N. σ 2 = x 2 i N Another alternative formula is: ( xi N ) 2 = x 2 i N µ2. REMARK: To avoid round-off errors, which accumulate quickly in these formulas, do not round until the last computation, and use as

Lecture 5 2 many decimal places as allowed in your calculator.

Lecture 5 3 The Sample Variance When the population is large, we approximate the population mean µ with the sample mean, x. Similarly, we approximate the population variance σ 2 by the sample variance, denoted s 2 : s 2 = (xi x) 2 n 1 = (x 1 x) 2 + (x 2 x) 2 + + (x n x) 2 n 1. The alternative form is: s 2 = (xi x) 2 n 1 ( x i ) 2 n(n 1). REMARK: Notice that we divide by the sample size minus one (this is different from the formula for the population variance).

Lecture 5 4 Informally, we say: a sample of size n has n degrees of freedom; one degree of freedom is used up in computing x, so there are only n 1 degrees of freedom available for the sample variance.

Lecture 5 5 The Standard Deviation For both cases (the population or the sample), the standard deviation is the square root of the corresponding variance: The population standard deviation is denoted by σ: σ = σ 2. The sample standard deviation is denoted by s: s = s 2. Advantage of the (population or sample) standard

Lecture 5 6 deviation: it is given in the same units as the observations. Advantage of the (population or sample) variance: it is easier to manipulate algebraically, in some cases. Both the standard deviations and variances are interpreted as follows: the larger they are, the more spread is the distribution (if they equal 0, the smallest possible value, then all observations must be equal). Remark 1. Standard deviation measures spread about the mean and should be used only when the mean is chosen as the measure of center. Remark 2. Standard deviation is not robust.

Lecture 5 7 Remark 3. The sum of the deviations of the observations from their mean will always be zero.

Lecture 5 8 Density curves Histograms are approximations to an exact variable distribution. Increasing the number of classes in a histogram makes each rectangle less wide and as the number of rectangles approaches infinity, the graph becomes a curve, called density curve. Properties of the density curve 1. The curve is always above the x-axis: the function f(x) describing the curve is nonnegative (could be zero) for all x 2. The total area underneath the curve and above the x-axis equal 1.

Lecture 5 9 Density curves, as we saw, have mean, medians and modes as well as standard deviation. the notations are similar to the one for the population mean and standard deviation (why?). Most of the time we use software to estimate density curves. Many times we assume that data follows a certain density curve.

Lecture 5 10 The normal distribution Often called Gaussian curve, the normal curve was introduced by Carl Friedrich Gauss in 1809 as an error curve of least square regression, about which we will talk next time. There are other symmetric bell-shaped density curves that are not normal. Remark 4. The curve is described completely by 2 parameters: µ-the mean and σ-the standard deviation.

Lecture 5 11 The Empirical Rule If the distribution is approximately bell shaped (not only normal), then: 1. Approximately 68% of the data will lie within one standard deviation of the mean. That is, about 68% of the data will be between µ σ and µ + σ. 2. Approximately 95% of the data will lie within two standard deviations of the mean. 3. Approximately 99.7% of the data will lie within three standard deviations of the mean. For exact values, we need to integrate to find the area between two points.

Lecture 5 12 In general, for any distribution, not only the normal distribution, Chebyshev s rule could be applied: The proportion of values from a data set that will fall within k standard deviations of the mean will be at least (1 1 k 2 )100% where k > 1. his rule could be applied to samples too.

Lecture 5 13 Finding the area under the normal density curve is not an easy task. It requires a lot of calculus. One way of avoiding this is to use tables that give us these areas (probabilities). But for each µ and σ we would need a new table. How can we avoid this? By transforming somehow all these curves into a standard one. Choose µ = 0 and σ 2 = 1 Standardizing Convert other values to standard units or z-scores, by subtracting the mean and dividing by standard deviation z = x µ σ

Lecture 5 14 Example: Standardize x = 3 with µ = 2 and σ = 4. What z-score range corresponds to (8, 17) with µ = 12 and σ 2 = 9?

Lecture 5 15 Interpretation: z is the number of standard deviations that x is away from the mean. The z-score is unit free. We can use it to compare observations from different sources ( apples to oranges ). Notation The standard normal distribution is denoted by N(0, 1) and any other normal distribution with mean µ and variance σ 2 by N(µ, σ).

Lecture 5 16 Relations between variables. Scatter diagrams In practice statisticians are interested in multiple variable relationships. For 2 variables, the pairs of data points match forming an observation. Sometimes we use the value of one variable in order to predict another variable.the response variable is the variable whose value can be explained by, or is determined by, the value of the explanatory variable. The response variable measures the outcome of a study. An explanatory variable explains or causes changes in the response variable. Example:

Lecture 5 17 The relationship between two variables could be represented by crosstabulation, side by side or clustered bar graphs, and scatterplots.

Lecture 5 18 Definition 5. A scatter diagram is a graph that shows the relationship between two quantitative variables measured on the same individual. How to draw a scatter diagram: The explanatory variable is plotted on the horizontal axis and the response variable is plotted on the vertical axis. Each individual in the data set is represented by a point in the scatter diagram. Do not connect the points when drawing a scatter diagram.

Lecture 5 19 How we interpret a scatter diagram Scatter diagrams imply a linear relationship nonlinear relationship no relation Definition 6. Two variables that are linearly related are said to be positively associated if, whenever the values of the predictor variable

Lecture 5 20 increase, the values of the response variable also increase, and it is said to be negatively associated if, whenever the values of the predictor variable increase, the value of the response variable decrease.

Lecture 5 21 Be careful!! Do not conclude causation through association.

Lecture 5 22 Definition 7. The linear correlation coefficient is a measure of the strength of linear relation between two quantitative variables. The sample correlation correlation coefficient is computed by: r = n i=1 ( x i x s x )( y i y s y ) n 1 where x is the sample mean of the predictor variable s x is the sample standard deviation of the predictor variable. y is the sample mean of the response variable s x is the sample standard deviation of the response variable. n is the number of individuals in the sample.

Lecture 5 23 The population correlation coefficient is denoted by ρ Example: (0, 0)(1, 2)(2, 2)(3, 5)(4, 6)

Lecture 5 24 Interpretation and properties of r 1 r 1 If r = 1 there is a perfect positive linear relation between the 2 variables. If r = 1 there is a perfect negative linear relation between the 2 variables. The closer r is to 1 the stronger the evidence of a positive linear relation and the closer to -1 the stronger the evidence of negative association between the two variables. If r is close to 0 there is evidence of no linear relation between the 2 variables. This does not mean no relation, just no linear relation.

Lecture 5 25 r is a untiles measure of association. r is not resistant. It is strongly affected by outlier. Both variables should be quantitative.