Announcements. Final exam, Saturday 9AM to Noon, usual classroom cheat sheet (1 page, front&back) + calculator

Similar documents
Announcements. Final Review: Units 1-7

Annoucements. MT2 - Review. one variable. two variables

Unit5: Inferenceforcategoricaldata. 4. MT2 Review. Sta Fall Duke University, Department of Statistical Science

FinalExamReview. Sta Fall Provided: Z, t and χ 2 tables

STA 101 Final Review

STATISTICS 141 Final Review

Sets and Set notation. Algebra 2 Unit 8 Notes

Nicole Dalzell. July 2, 2014

MATH 1150 Chapter 2 Notation and Terminology

Sociology 6Z03 Review II

Review of Statistics 101

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Business Statistics. Lecture 10: Course Review

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

An introduction to biostatistics: part 1

AP Statistics Cumulative AP Exam Study Guide

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Lecture 5: ANOVA and Correlation

SMAM 314 Practice Final Examination Winter 2003

This document contains 3 sets of practice problems.

Mr. Stein s Words of Wisdom

REVIEW: Midterm Exam. Spring 2012

Ch. 1: Data and Distributions

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

STA220H1F Term Test Oct 26, Last Name: First Name: Student #: TA s Name: or Tutorial Room:

Mathematical Notation Math Introduction to Applied Statistics

Confidence Intervals, Testing and ANOVA Summary

ANOVA: Analysis of Variation

Contents. Acknowledgments. xix

Exam details. Final Review Session. Things to Review

Chapter 23. Inferences About Means. Monday, May 6, 13. Copyright 2009 Pearson Education, Inc.

Tables Table A Table B Table C Table D Table E 675

ECON Semester 1 PASS Mock Mid-Semester Exam ANSWERS

Final Exam STAT On a Pareto chart, the frequency should be represented on the A) X-axis B) regression C) Y-axis D) none of the above

Interpret Standard Deviation. Outlier Rule. Describe the Distribution OR Compare the Distributions. Linear Transformations SOCS. Interpret a z score

Determining the Spread of a Distribution

Determining the Spread of a Distribution

:the actual population proportion are equal to the hypothesized sample proportions 2. H a

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

dates given in your syllabus.

Chapter 1 Statistical Inference

INFERENCE FOR REGRESSION

Chapter 3. Measuring data

Business 320, Fall 1999, Final

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Macomb Community College Department of Mathematics. Review for the Math 1340 Final Exam

Chapter 2: Tools for Exploring Univariate Data

Overview. INFOWO Statistics lecture S1: Descriptive statistics. Detailed Overview of the Statistics track. Definition

Index I-1. in one variable, solution set of, 474 solving by factoring, 473 cubic function definition, 394 graphs of, 394 x-intercepts on, 474

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

appstats27.notebook April 06, 2017

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Collaborative Statistics: Symbols and their Meanings

are the objects described by a set of data. They may be people, animals or things.

STAT 4385 Topic 01: Introduction & Review

Chapter 27 Summary Inferences for Regression

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

IT 403 Statistics and Data Analysis Final Review Guide

GLOSSARY. a n + n. a n 1 b + + n. a n r b r + + n C 1. C r. C n

Describing Distributions

AIM HIGH SCHOOL. Curriculum Map W. 12 Mile Road Farmington Hills, MI (248)

Lecture 3B: Chapter 4, Section 2 Quantitative Variables (Displays, Begin Summaries)

Correlation and Regression

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling

Nemours Biomedical Research Statistics Course. Li Xie Nemours Biostatistics Core October 14, 2014

Chapter 4.notebook. August 30, 2017

Conditions for Regression Inference:

Elementary Statistics

Statistical Modeling and Analysis of Scientific Inquiry: The Basics of Hypothesis Testing

Harvard University. Rigorous Research in Engineering Education

Introduction to Statistics

Chapter 6 The Normal Distribution

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL - MAY 2005 EXAMINATIONS STA 248 H1S. Duration - 3 hours. Aids Allowed: Calculator

Comparing Several Means

Practical Statistics for the Analytical Scientist Table of Contents

Stat 412/512 REVIEW OF SIMPLE LINEAR REGRESSION. Jan Charlotte Wickham. stat512.cwick.co.nz

13.4 Probabilities of Compound Events.notebook May 29, I can calculate probabilities of compound events.


Stat 2300 International, Fall 2006 Sample Midterm. Friday, October 20, Your Name: A Number:

Sampling, Frequency Distributions, and Graphs (12.1)

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

Inferences for Regression

Extra Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences , July 2, 2015

AP Final Review II Exploring Data (20% 30%)

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

Marquette University Executive MBA Program Statistics Review Class Notes Summer 2018

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

MGEC11H3Y L01 Introduction to Regression Analysis Term Test Friday July 5, PM Instructor: Victor Yu

Chapter 1 - Lecture 3 Measures of Location

Descriptive Univariate Statistics and Bivariate Correlation

2. Outliers and inference for regression

Sociology 6Z03 Review I

Basic Statistics. Resources. Statistical Tables Murdoch & Barnes. Scientific Calculator. Minitab 17.

Class 24. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

STATISTICS 1 REVISION NOTES

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

M & M Project. Think! Crunch those numbers! Answer!

Transcription:

Announcements Announcements FINAL REVIEW: UNITS 1-7 STATISTICS 101 Nicole Dalzell August 7, 2014 Final exam, Saturday 9AM to Noon, usual classroom cheat sheet (1 page, front&back) + calculator Check grades on Sakai over the weekend and let me know if anything is missing/wrong before the final, no grade changes after the final exam Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 2 / 23 Which of the following is true? Modeling ( response) Which of the following is the best visualization for evaluating the relationship between two variables? Modeling ( response) (a) If the sample size is large enough, conclusions can be generalized to the population. (b) If subjects are randomly assigned to treatments, conclusions can be generalized to the population. (c) Blocking in experiments serves a similar purpose as stratifying in observational studies. (d) Representative samples allow us to make causal conclusions. (e) Statistical inference requires normal distribution of the response variable. (a) side-by-side box plots (b) mosaic plot (c) pie chart (d) segmented frequency bar plot (e) relative frequency histogram Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 3 / 23 Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 4 / 23

Which of the following is false? Modeling ( response) Which of the following is false? data Inference Modeling ( response) (a) Box plots are useful for highlighting outliers, but we cannot determine skew based on a box plot. (b) Median and IQR are more robust statistics than mean and SD, respectively, since they are not affected by outliers or extreme skewness. (c) When the response variable is extremely right skewed, it may be useful to apply a log transformation to obtain a more symmetric distribution, and model the logged data. (d) Segmented frequency bar plots are good enough for evaluating the relationship between two variables if the sample sizes are the same for various levels of the explanatory variable. (a) If A and B are independent, then having information on A does not tell us anything about B. (b) If A and B are disjoint, then knowing that A occurs tells us that B cannot occur. (c) Disjoint (mutually exclusive) events are always dependent since if one event occurs we know the other one cannot. (d) If A and B are independent, then P(A and B) = P(A) + P(B). (e) If A and B are not disjoint, then P(A or B) = P(A) + P(B) - P(A and B). Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 5 / 23 Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 6 / 23 Which of the following is the least useful method for assessing if the data follow a normal distribution? Modeling ( response) Two students in an introductory statistics class choose to conduct similar studies estimating the proportion of smokers at their school. Student A collects data from 100 students, and student B collects data from 50 students. How will the standard errors used by the two students compare? Assume both are simple random samples. Modeling ( response) (a) Check if 68% of the data are within 1 SD of the mean, 95% of data are within 2 SDs of the mean, and 99.7% of data are within 3 SDs of the mean. (b) Check if the points are on a straight line on a normal probability plot. (c) Check if the mean and median are equal. (d) Check if the distribution is unimodal and symmetric. (e) Generate normally distributed random data with same mean and standard deviation as the observed data, overlay the plots of the generated and observed data, and check if they line up. (a) SE used by Student A < SE used as Student B. (b) SE used by Student A > SE used as Student B. (c) SE used by Student A = SE used as Student B. (d) SE used by Student A SE used as Student B. (e) Cannot tell without knowing the true proportion of smokers at this school. Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 7 / 23 Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 8 / 23

Application exercise: Test the hypothesis H 0 : µ = 10 vs. H A : µ > 10 for the following 6 samples. Assume σ = 2. x 10.05 10.1 10.2 n = 30 R Jacob and the R-gonauts Rdolla p value ArrStudio Kool Kucumbers Star-tisticians Belle Curve Mighty Ducks Supa Hot Fire Can We Phone a Friend? MJRJ Team 2 Group 4.0 Mu Chi Sigma The 99% n = 5000 The Alternative Hypotheses The Database The Unexpected Errors The Big Three The Outliers The X Factor The Confounding Variables The Pythons Very-Ables The Control Group The Standard Deviants What are the Odds Which of the following is the best method for evaluating the relationship between two variables? (a) chi-square test of independence (b) chi-square test of goodness of fit (c) anova (d) linear regression (e) t-test Modeling ( response) p value The Critical Values The Statletes Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 9 / 23 Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 10 / 23 Inference for data: Which of the following is the best method for evaluating the relationship between a and a variable with many levels? (a) z-test (b) chi-square test of goodness of fit (c) anova (d) linear regression (e) t-test Modeling ( response) One : Parameter of interest: µ n 30 Z, n < 30 T One vs. one (with 2 levels): Parameter of interest: µ 1 µ 2 n 1 and n 2 30 Z, n 1 or n 2 < 30 T If samples are dependent (paired), first find differences between paired observations One vs. one (with > 2 levels) - mean: Parameter of interest: NA ANOVA HT only For all other parameters of interest: simulation Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 11 / 23 Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 12 / 23

Inference for data - binary outcome: Inference for data - > 2 outcomes: Binary outcome: One : Parameter of interest: p S/F condition met Z, if not simulation One vs. one, each with only 2 outcomes: Parameter of interest: p 1 p 2 S/F condition met Z, if not simulation S/F: use obs. S and F for CIs and exp. for HT > 2 outcomes: One, compared to hypothetical distribution: Parameter of interest: NA At least 5 exp. successes in each cell χ 2 GOF, if not simulation HT only One vs. one, either with > 2 outcomes: Parameter of interest: NA At least 5 exp. successes in each cell χ 2 Independence, if not simulation HT only Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 13 / 23 Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 14 / 23 Data are collected at a bank on 6 tellers randomly sampled transactions. Do average transaction times vary by teller? Modeling ( response) Response variable:, Explanatory variable: ANOVA Summary statistics: n_1 = 14, mean_1 = 65.7857, sd_1 = 15.2249 n_2 = 23, mean_2 = 79.9174, sd_2 = 23.284 n_3 = 15, mean_3 = 82.66, sd_3 = 18.1842 n_4 = 15, mean_4 = 77.9933, sd_4 = 23.2754 n_5 = 44, mean_5 = 81.7295, sd_5 = 21.5768 n_6 = 29, mean_6 = 75.3069, sd_6 = 20.4814 H_0: All means are equal. H_A: At least one mean is different. Analysis of Variance Table 20 40 60 80 100 120 1 2 3 4 5 6 1.508 Response: data Df Sum Sq Mean Sq F value Pr(>F) group 5 3315 663.06 1.508 0.1914 Residuals 134 58919 439.69 Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 15 / 23 Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 16 / 23

Application exercise: ANOVA Data are collected on download times at three different times during the day. We want to evaluate whether average download times vary by time of day. Fill in the??s in the ANOVA output below. Modeling ( response) What is the result of the ANOVA? Response variable:, Explanatory variable: Summary statistics: n_early (7AM) = 16, mean_early (7AM) = 113.375, sd_early (7AM) = 47.6541 n_eve (5 PM) = 16, mean_eve (5 PM) = 273.3125, sd_eve (5 PM) = 52.1929 n_late (12 AM) = 16, mean_late (12 AM) = 193.0625, sd_late (12 AM) = 40.9023 Analysis of Variance Table Response: data Df Sum Sq Mean Sq F value Pr(>F) group???????? 1.306e-11 Residuals?? 100020?? Total?? 304661 100 150 200 250 300 350 Early (7AM) Evening (5 PM) Late Night (12 AM) 46.0349 Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 17 / 23 Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 18 / 23 Application exercise: ANOVA (cont.) The next step is to evaluate the pairwise tests. There are 3 pairs of times of day 1 Early vs. Evening R, ArrStudio, Belle Curve, Can We Phone a Friend?, Group 4.0, Jacob and the R-gonauts, Kool Kucumbers, Mighty Ducks, MJRJ, Mu Chi Sigma 2 Evening vs. Late Night Rdolla, Star-tisticians, Supa Hot Fire, Team 2, The 99%, The Alternative Hypotheses, The Big Three, The Confounding Variables, The Control Group, The Critical Values 3 Early vs. Late Night The Database, The Outliers, The Pythons, The Standard Deviants, The Statletes, The Unexpected Errors The X Factor, Very-Ables, What are the Odds? Determine the appropriate significance level for these tests, and then complete the test assigned to your team. Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 19 / 23 Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 20 / 23

What percent of variability in download times is explained by time of day? Modeling ( response) n = 50 and ˆp = 0.80. Hypotheses: H 0 : p = 0.82; H A : p < 0.82. We use a randomization test because the sample size isn t large enough for ˆp to be distributed nearly normally (50 0.82 = 41 < 10; 50 0.18 = 9 < 10) Which of the following is the correct set up for this hypothesis test? Red: success, blue: failure, ˆp sim = proportion of reds in simulated samples. Modeling ( response) Response: data Df Sum Sq Mean Sq F value Pr(>F) group 2 204641 102320 46.035 1.306e-11 Residuals 45 100020 2223 (a) 204641 204641+100020 (b) 204641 100020 (c) 100020 (d) 204641 102320 102320+2223 (a) Place 80 red and 20 blue chips in a bag. Sample, with replacement, 50 chips proportion of simulations where ˆp sim 0.82. (b) Place 82 red and 18 blue chips in a bag. Sample, without replacement, 50 chips proportion of simulations where ˆp sim 0.80. (c) Place 82 red and 18 blue chips in a bag. Sample, with replacement, 50 chips proportion of simulations where ˆp sim 0.80. (d) Place 82 red and 18 blue chips in a bag. Sample, with replacement, 100 chips proportion of simulations where ˆp sim 0.80. Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 21 / 23 Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 22 / 23 Randomization distribution What is the center of the randomization distribution? What is the result of the hypothesis test? observed 0.8 0.70 0.75 0.80 0.85 0.90 randomization statistic Statistics 101 ( Nicole Dalzell ) Final August 7, 2014 23 / 23