Week 11 Sample Means, CLT, Correlation

Similar documents
COMP6053 lecture: Sampling and the central limit theorem. Markus Brede,

appstats8.notebook October 11, 2016

COMP6053 lecture: Sampling and the central limit theorem. Jason Noble,

Lecture 30. DATA 8 Summer Regression Inference

Statistics and Data Analysis in Geology

Business Statistics. Lecture 5: Confidence Intervals

SAMPLING DISTRIBUTIONS

One-sample categorical data: approximate inference

Review of Multiple Regression

7 Estimation. 7.1 Population and Sample (P.91-92)

Sampling Distribution Models. Chapter 17

Z score indicates how far a raw score deviates from the sample mean in SD units. score Mean % Lower Bound

STAT 201 Assignment 6

STA Why Sampling? Module 6 The Sampling Distributions. Module Objectives

MATH 1150 Chapter 2 Notation and Terminology

Lecture 26: Chapter 10, Section 2 Inference for Quantitative Variable Confidence Interval with t

Section 5.4. Ken Ueda

Introduction to Linear Regression

THE SAMPLING DISTRIBUTION OF THE MEAN

Statistical Intervals (One sample) (Chs )

Section I Questions with parts

Lecture 18: Simple Linear Regression

Lecture 27. DATA 8 Spring Sample Averages. Slides created by John DeNero and Ani Adhikari

Sampling Distribution Models. Central Limit Theorem

Chapter 6 The Normal Distribution

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

Series, Parallel, and other Resistance

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Stat 101: Lecture 12. Summer 2006

Lecture # 31. Questions of Marks 3. Question: Solution:

ACMS Statistics for Life Sciences. Chapter 13: Sampling Distributions

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

Lesson 3-2: Solving Linear Systems Algebraically

Math Lecture 3 Notes

Class 15. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Chapter 18. Sampling Distribution Models. Copyright 2010, 2007, 2004 Pearson Education, Inc.

STA Module 8 The Sampling Distribution of the Sample Mean. Rev.F08 1

CS 147: Computer Systems Performance Analysis

Business Statistics. Lecture 10: Course Review

Lecture 10: The Normal Distribution. So far all the random variables have been discrete.

Numerical Methods Lecture 7 - Statistics, Probability and Reliability

Vectors. Vector Practice Problems: Odd-numbered problems from

Lecture 17: Small-Sample Inferences for Normal Populations. Confidence intervals for µ when σ is unknown

Chapter 24. Comparing Means. Copyright 2010 Pearson Education, Inc.

Unit 4 Probability. Dr Mahmoud Alhussami

Correlation. January 11, 2018

Notes 3: Statistical Inference: Sampling, Sampling Distributions Confidence Intervals, and Hypothesis Testing

Business Statistics. Lecture 9: Simple Regression

Directional Derivatives and the Gradient

Chapter 6 Part 4. Confidence Intervals

You have 3 hours to complete the exam. Some questions are harder than others, so don t spend too long on any one question.

EXPERIMENT: REACTION TIME

Comparing Means from Two-Sample

Confidence intervals

Correlation and regression

SESSION 5 Descriptive Statistics

Lecture 2: Change of Basis

3.1 Inequalities - Graphing and Solving

Section B. The Theoretical Sampling Distribution of the Sample Mean and Its Estimate Based on a Single Sample

Ch. 7: Estimates and Sample Sizes

Review. Midterm Exam. Midterm Review. May 6th, 2015 AMS-UCSC. Spring Session 1 (Midterm Review) AMS-5 May 6th, / 24

Chapter 10 Regression Analysis

1.1 Administrative Stuff

Elementary Statistics Triola, Elementary Statistics 11/e Unit 17 The Basics of Hypotheses Testing

1 Review of the dot product

Intro to probability concepts

1. Descriptive stats methods for organizing and summarizing information

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Introduction to Survey Analysis!

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Homework 1 2/7/2018 SOLUTIONS Exercise 1. (a) Graph the following sets (i) C = {x R x in Z} Answer:

review session gov 2000 gov 2000 () review session 1 / 38

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

POL 681 Lecture Notes: Statistical Interactions

STAT/SOC/CSSS 221 Statistical Concepts and Methods for the Social Sciences. Random Variables

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

10. Alternative case influence statistics

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Lecture 8 CORRELATION AND LINEAR REGRESSION

Unit 22: Sampling Distributions

Calculating Normal Distribution Probabilities

L6: Regression II. JJ Chen. July 2, 2015

Section 7.2 Homework Answers

Probability and Data Management AP Book 8, Part 2: Unit 2

5.9 Representations of Functions as a Power Series

Vector Basics, with Exercises

Reminders. Homework due tomorrow Quiz tomorrow

MTH What is a Difference Quotient?

Linear Algebra. Introduction. Marek Petrik 3/23/2017. Many slides adapted from Linear Algebra Lectures by Martin Scharlemann

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

Descriptive Statistics (And a little bit on rounding and significant digits)

Vector Multiplication. Directional Derivatives and the Gradient

KCP e-learning. test user - ability basic maths revision. During your training, we will need to cover some ground using statistics.

CHAPTER 7 THE SAMPLING DISTRIBUTION OF THE MEAN. 7.1 Sampling Error; The need for Sampling Distributions

CSE 241 Class 1. Jeremy Buhler. August 24,

Understanding Inference: Confidence Intervals I. Questions about the Assignment. The Big Picture. Statistic vs. Parameter. Statistic vs.

Lecture 10A: Chapter 8, Section 1 Sampling Distributions: Proportions

Transcription:

Week 11 Sample Means, CLT, Correlation Slides by Suraj Rampure Fall 2017

Administrative Notes Complete the mid semester survey on Piazza by Nov. 8! If 85% of the class fills it out, everyone will get a bonus point (on something).

Variability of the Sample Mean

Variability of the Sample Mean Though you ve likely already seen it, this relationship will be very important when we start dealing with the Central Limit Theorem and correlation/regression. Source: https://www.inferentialthinking.com/chapters/12/5/variability-of-the-sample-mean.html

Sample Size SD of Sample Means

Overview of Data 8 Part 1 Part 2 Part 3 Tables and Visualization Statistical Inference Prediction We re about to transition to Part 3.

The Normal Distribution and the Central Limit Theorem (Review from lecture)

Normal Distribution Also known as the bell curve Several natural phenomena (eg. human heights and weights) are distributed normally. Distributions are unique defined by their parameters. The normal distribution has two parameters: the mean and the variance.

Histogram of human heights, separated into male and female Source: http://www.usablestats.com/lessons/normal Histogram of CS 70 midterm scores Source: My own misery

Properties of the Normal Distribution If some quantity has an (approximately) normal distribution, then 68% of values are within 1 SD of the mean 95% of values are within 2 SD of the mean 99.7% of values are within 3 SD of the mean These three facts are crucial to this chapter. Remember them!

Properties of the Normal Distribution If some quantity has an (approximately) normal distribution, then These three facts are crucial to this chapter. Remember them!

What is the Central Limit Theorem?

This picture appeared when I searched Central Limit Theorem so I thought I d include it.

Central Limit Theorem If you have a sample that is both: LARGE AND DRAWN AT RANDOM WITH REPLACEMENT THEN regardless of the distribution of the population, the probability distribution of the sample sum/average is approximately normal.

Why do we care about the CLT? If we have a large, random sample, and are looking to estimate means, the CLT holds. If the CLT holds, we can assume several useful properties of the normal distribution. We ll usually use the fact that we have a good estimate on the proportion of values that lie between n standard deviations of the mean. This will help us calculate confidence intervals quickly!

Confidence Intervals with the CLT Consider the population of 30,000 of Berkeley undergraduates. As per usual, we d like to find the mean height of all students with a high degree of confidence. Of course, we don t have the time (nor energy) to survey all 30,000 students for their heights, but we can survey a reasonably large sample. Assume the population has a standard deviation of 3 inches. Suppose we d like the width of our 95% confidence interval to be 1 inch. How many samples must we take in this case?

Suppose we d like the width of our 95% confidence interval to be 1 inch. How many samples must we take in this case? Since we re dealing with a large, random sample, the CLT holds! We know that 95% of values lie within 2 standard deviations of the mean. This means that the width of our confidence interval will be 4 standard deviations. Therefore, we must choose a sample size of at least 144 in order to have a 95% CI of 1 inch.

Confidence Intervals with the CLT Consider the population of 30,000 of Berkeley undergraduates. As per usual, we d like to find the mean height of all students with a high degree of confidence. Of course, we don t have the time (nor energy) to survey all 30,000 students for their heights, but we can survey a reasonably large sample. Assume the population has a standard deviation of 3 inches. Now, suppose that we took a sample (with replacement) of 3,650 students, and we found that the sample mean was 58 inches. What distribution will your sample means follow, and what parameters will it have?

Now, suppose that we took a sample (with replacement) of 3,650 students, and we found that the sample mean was 58 inches. What distribution will your sample means follow, and what parameters will it have? Since we have a large, random sample, we can apply the CLT. The sample means will follow the normal distribution, with parameters mean (µ) = 58 and standard deviation = 3/sqrt(3650), by the relationship between the SD of sample means and the SD of populations.

What is this concept of standard units?

Standard Units If you look at graphs of several normal distributions with different means and SDs, you ll probably notice something. Their shape is more or less the same. In order to make calculations convenient and standardized, we always convert our values to standard units when dealing with normal distributions. (You ll see later we convert to standard units for other purposes, as well.) In statistics, we deal with the standard normal distribution, that is, the normal distribution with mean 0 and standard deviation 1.

How to Convert to Standard Units Given a set of values x 1, x 2, x n, with mean µ and standard deviation s, we convert each x i to standard units by the rule zi = (xi- µ) / s By converting to standard units, we re effectively changing each value from its raw quantity to the number of standard deviations it is away from the mean. Standard units have no units! Sometimes, in other contexts, this is referred to as the z-score. We won't use that terminology in this class, but now you know!

How to Convert to Standard Units For example, let s convert the following five weights to standard units. 120 lbs, 160 lbs, 230 lbs, 140 lbs, 180 lbs µ = (120 + 160 + 230 + 140 + 180 lbs) / 5 = 166 lbs s = np.std(make_array(120, 160, 230, 140, 180)) 38 lbs original 120 lb std units -1.21 160 lb -0.16 230 lb 1.68 140 lb -0.68 180 lb 0.37

Correlation

Correlation Correlation gives us a way to describe the linear relationship between two sets of data. More specifically, the correlation coefficient r is a number between -1 and 1. If r = 1, then the two sets of data lie exactly on some line with a positive slope. If r = 0, then the two sets of data are uncorrelated. If r = -1, then the two sets of data lie exactly on some line with a negative slope. Values of r in between these describe relationships that are weakly/ strongly positively/negatively correlated, depending on the value.

Various Values of the Correlation Coefficient r = 0.9 r=0 r = -0.55

Various Values of the Correlation Coefficient r=0 With the correlation coefficient, we only look at one type of relationship linear. Even though the graph to the left plots points on the equation y = x**2, that isn t a linear relationship, so r = 0.

Calculation and Properties of r r is usually calculated using a relatively complicated expression. With the functions we have at our disposal, though, it s relatively simple. r is the mean of the product of x and y, when both are measured in standard units. We first convert to standard units before calculating r because correlation only looks at the shape of the distribution of two variables, not the values themselves. By converting to std. units, we can compare values that are at different scales easily.

Calculation and Properties of r r is a unit-less value. This is because it is based off of values in standard units, which themselves are unit-less values. Remember, r just looks at the shape of the distribution. Therefore, you can add, subtract, multiply or divide all of your values by any constant, and the value of r won t change, so long as you make this change to each of your values. correlation(<tbl name>, <col A name>, <col B name>) calculates r for columns A, B in the specified table. This definition isn t built-in to datascience, but this is often how it will be implemented in assignments.

std_units converts arrays to standard units, and correlation finds r for any two given arrays. Notice that when we shift the values of x and y, the value of r doesn t change.

Let heights and weights be columns in the table people. Is correlation(people, heights, weights )always equal to correlation(people, weights, heights )? Yes. r is the average of the product of X, Y in standard units, and multiplication is commutative (a b = b a).