Announcements. Final Review: Units 1-7

Similar documents
Announcements. Final exam, Saturday 9AM to Noon, usual classroom cheat sheet (1 page, front&back) + calculator

Unit5: Inferenceforcategoricaldata. 4. MT2 Review. Sta Fall Duke University, Department of Statistical Science

Annoucements. MT2 - Review. one variable. two variables

FinalExamReview. Sta Fall Provided: Z, t and χ 2 tables

STA 101 Final Review

STATISTICS 141 Final Review

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Ch. 1: Data and Distributions

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Unit4: Inferencefornumericaldata. 1. Inference using the t-distribution. Sta Fall Duke University, Department of Statistical Science

Nicole Dalzell. July 2, 2014

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Announcements. Lecture 5: Probability. Dangling threads from last week: Mean vs. median. Dangling threads from last week: Sampling bias

Sociology 6Z03 Review II

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

MGEC11H3Y L01 Introduction to Regression Analysis Term Test Friday July 5, PM Instructor: Victor Yu

Lecture 5: ANOVA and Correlation

Lecture 11 - Tests of Proportions

Lecture 19: Inference for SLR & Transformations

Mathematical Notation Math Introduction to Applied Statistics

2. Outliers and inference for regression

Example - Alfalfa (11.6.1) Lecture 14 - ANOVA cont. Alfalfa Hypotheses. Treatment Effect

Tables Table A Table B Table C Table D Table E 675

Lecture 15 - ANOVA cont.

Announcements. Unit 4: Inference for numerical variables Lecture 4: ANOVA. Data. Statistics 104

MATH 1150 Chapter 2 Notation and Terminology

AP Statistics Cumulative AP Exam Study Guide

Analysis of Variance. Contents. 1 Analysis of Variance. 1.1 Review. Anthony Tanbakuchi Department of Mathematics Pima Community College

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL - MAY 2005 EXAMINATIONS STA 248 H1S. Duration - 3 hours. Aids Allowed: Calculator

Occupy movement - Duke edition. Lecture 14: Large sample inference for proportions. Exploratory analysis. Another poll on the movement

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations

An introduction to biostatistics: part 1

Lecture 9 Two-Sample Test. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

A proportion is the fraction of individuals having a particular attribute. Can range from 0 to 1!

Harvard University. Rigorous Research in Engineering Education

Wolf River. Lecture 19 - ANOVA. Exploratory analysis. Wolf River - Data. Sta 111. June 11, 2014

Business Statistics. Lecture 10: Course Review

Statistics and Quantitative Analysis U4320

Example - Alfalfa (11.6.1) Lecture 16 - ANOVA cont. Alfalfa Hypotheses. Treatment Effect

Review of Statistics

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

Statistiek II. John Nerbonne using reworkings by Hartmut Fitz and Wilbert Heeringa. February 13, Dept of Information Science

SMAM 314 Practice Final Examination Winter 2003

BIO5312 Biostatistics Lecture 6: Statistical hypothesis testings

Mr. Stein s Words of Wisdom

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Chapter 23. Inferences About Means. Monday, May 6, 13. Copyright 2009 Pearson Education, Inc.

Macomb Community College Department of Mathematics. Review for the Math 1340 Final Exam

STA220H1F Term Test Oct 26, Last Name: First Name: Student #: TA s Name: or Tutorial Room:

Chapter 26: Comparing Counts (Chi Square)

:the actual population proportion are equal to the hypothesized sample proportions 2. H a

This document contains 3 sets of practice problems.

Cogs 14B: Introduction to Statistical Analysis

Contents. Acknowledgments. xix

BIOS 6222: Biostatistics II. Outline. Course Presentation. Course Presentation. Review of Basic Concepts. Why Nonparametrics.

Review of Statistics 101

Simple Linear Regression: One Qualitative IV

Analysis of variance (ANOVA) Comparing the means of more than two groups

Conditions for Regression Inference:

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015

Section 4.6 Simple Linear Regression

Sets and Set notation. Algebra 2 Unit 8 Notes

Class 24. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Comparing Several Means

In ANOVA the response variable is numerical and the explanatory variables are categorical.

Final Exam - Solutions

Final Exam STAT On a Pareto chart, the frequency should be represented on the A) X-axis B) regression C) Y-axis D) none of the above

Chapter 7. Practice Exam Questions and Solutions for Final Exam, Spring 2009 Statistics 301, Professor Wardrop

Course Review. Kin 304W Week 14: April 9, 2013

STAT 101 Notes. Introduction to Statistics

Inferences for Regression

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Degrees of freedom df=1. Limitations OR in SPSS LIM: Knowing σ and µ is unlikely in large

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

STAT:5100 (22S:193) Statistical Inference I

Confidence Intervals, Testing and ANOVA Summary

ANOVA: Analysis of Variation

Data Analysis and Statistical Methods Statistics 651

Interpret Standard Deviation. Outlier Rule. Describe the Distribution OR Compare the Distributions. Linear Transformations SOCS. Interpret a z score

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

16.400/453J Human Factors Engineering. Design of Experiments II

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Chapter 1 Statistical Inference

Mathematical Notation Math Introduction to Applied Statistics

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

AIM HIGH SCHOOL. Curriculum Map W. 12 Mile Road Farmington Hills, MI (248)

Extra Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences , July 2, 2015

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

Nemours Biomedical Research Statistics Course. Li Xie Nemours Biostatistics Core October 14, 2014

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

Basic Statistics. Resources. Statistical Tables Murdoch & Barnes. Scientific Calculator. Minitab 17.

appstats27.notebook April 06, 2017

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

dates given in your syllabus.

Statistics Primer. ORC Staff: Jayme Palka Peter Boedeker Marcus Fagan Trey Dejong

Salt Lake Community College MATH 1040 Final Exam Fall Semester 2011 Form E

MA : Introductory Probability

Transcription:

Announcements Announcements Final : Units 1-7 Statistics 104 Mine Çetinkaya-Rundel June 24, 2013 Final on Wed: cheat sheet (one sheet, front and back) and calculator Must have webcam + audio on at all times, and be visible - you are welcomed to mute your audio to not hear others but I should be able to hear you Graded work: Everything including project will be returned by tonight, last PS will be returned by 10am tomorrow morning. Check your grades on Sakai and let me know if there is anything missing. No grade changes will be made after the final exam. Final exam will be graded by 1pm on Thursday, and you will be able to see the results once the grades are released. Final course grades will be submitted to the Registrar s on Friday. Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 2 / 24 Which of the following is true? Modeling ( response) Which of the following is the best visualization for evaluating the relationship between two variables? Modeling ( response) (a) If the sample size is large enough, conclusions can be generalized to the population. (b) If subjects are randomly assigned to treatments, conclusions can be generalized to the population. (c) Blocking in experiments serves a similar purpose as stratifying in observational studies. (d) Representative samples allow us to make causal conclusions. (e) Statistical inference requires normal distribution of the response variable. (a) side-by-side box plots (b) mosaic plot (c) pie chart (d) segmented frequency bar plot (e) relative frequency histogram Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 3 / 24 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 4 / 24

Which of the following is false? Modeling ( response) Which of the following is false? Modeling ( response) (a) Box plots are useful for highlighting outliers, but we cannot determine skew based on a box plot. (b) Median and IQR are more robust statistics than mean and SD, respectively, since they are not affected by outliers or extreme skewness. (c) When the response variable is extremely right skewed, it may be useful to apply a log transformation to obtain a more symmetric distribution, and model the logged. (d) Segmented frequency bar plots are good enough for evaluating the relationship between two variables if the sample sizes are the same for various levels of the explanatory variable. (a) If A and B are independent, then having information on A does not tell us anything about B. (b) If A and B are disjoint, then knowing that A occurs tells us that B cannot occur. (c) Disjoint (mutually exclusive) events are always dependent since if one event occurs we know the other one cannot. (d) If A and B are independent, then P(A and B) = P(A) + P(B). (e) If A and B are not disjoint, then P(A or B) = P(A) + P(B) - P(A and B). Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 5 / 24 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 6 / 24 Which of the following is the least useful method for assessing if the follow a normal distribution? About 30% of human twins are identical and the rest are fraternal. Identical twins are necessarily the same sex half are males and the other half are females. One-quarter of fraternal twins are both male, one-quarter both female, and one-half are mixes: one male, one female. You have just become a parent of twins and are told they are both girls. Given this information, what is the posterior probability that they are identical? Modeling ( response) Modeling ( response) Type of twins Gender (a) Check if 68% of the are within 1 SD of the mean, 95% of are within 2 SDs of the mean, and 99.7% of are within 3 SDs of the mean. (b) Check if the points are on a straight line on a normal probability plot. (c) Check if the mean and median are equal. (d) Check if the distribution is unimodal and symmetric. (e) Generate normally distributed random with same mean and standard deviation as the observed, overlay the plots of the generated and observed, and check if they line up. identical, 0.3 fraternal, 0.7 males, 0.5 0.3*0.5 = 0.15 females, 0.5 0.3*0.5 = 0.15 male&female, 0.0 0.3*0 = 0 males, 0.25 0.7*0.25 = 0.175 females, 0.25 0.7*0.25 = 0.175 P(iden f) = P(iden & f) P(f) = 0.15 0.15 + 0.175 = 0.46 male&female, 0.50 0.7*0.5 = 0.35 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 7 / 24 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 8 / 24

Which of the following is false? Modeling ( response) (a) Suppose you re evaluating 4 claims. If prior to collection you don t have a preference for one claim over another, you should assign 0.25 as the prior probability to each claim. (b) Posterior probability and the p-value are the equivalent. (c) One advantage of is that can be integrated to the inferential scheme as they are collected. (d) Suppose a patient tests positive for a disease that 2% of the population are known to have. A doctor wants to confirm the test result by retesting the patient. In the second test the prior probability for having the disease should be more than 2%. Posterior = P(hypothesis ), p-value P( hypothesis) Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 9 / 24 Two students in an introductory statistics class choose to conduct similar studies estimating the proportion of smokers at their school. Student A collects from 100 students, and student B collects from 50 students. How will the standard errors used by the two students compare? Assume both are simple random samples. (a) SE used by Student A < SE used as Student B. (b) SE used by Student A > SE used as Student B. (c) SE used by Student A = SE used as Student B. (d) SE used by Student A SE used as Student B. Modeling ( response) (e) Cannot tell without knowing the true proportion of smokers at this school. Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 10 / 24 Test the hypothesis H 0 : µ = 10 vs. H A : µ > 10 for the following 8 samples. Assume σ = 2. n = 30 Jacob Shelby Daryn x 10.05 10.1 10.2 p value 0.45 0.39 29 n = 5000 Thomas Cece Lili x 10.05 10.1 10.2 p value 0.04 0.0002 0 When n is large, even small deviations from the null (small effect sizes), which may be considered practically insignificant, can yield statistically significant results. Which of the following is the best method for evaluating the relationship between two variables? (a) chi-square test of independence (b) chi-square test of goodness of fit (c) anova (d) linear regression (e) t-test Modeling ( response) Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 11 / 24 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 12 / 24

for : Which of the following is the best method for evaluating the relationship between a and a variable with many levels? (a) z-test (b) chi-square test of goodness of fit (c) anova (d) linear regression (e) t-test Modeling ( response) One : Parameter of interest: µ n 30 Z, n < 30 T One vs. one (with 2 levels): Parameter of interest: µ 1 µ 2 n 1 and n 2 30 Z, n 1 or n 2 < 30 T If samples are dependent (paired), first find differences between paired observations One vs. one (with > 2 levels) - mean: Parameter of interest: NA ANOVA HT only For all other parameters of interest: simulation Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 13 / 24 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 14 / 24 for - binary outcome: for - > 2 outcomes: Binary outcome: One : Parameter of interest: p S/F condition met Z, if not simulation One vs. one, each with only 2 outcomes: Parameter of interest: p 1 p 2 S/F condition met Z, if not simulation S/F: use obs. S and F for CIs and exp. for HT > 2 outcomes: One, compared to hypothetical distribution: Parameter of interest: NA At least 5 exp. successes in each cell χ 2 GOF, if not simulation HT only One vs. one, either with > 2 outcomes: Parameter of interest: NA At least 5 exp. successes in each cell χ 2 Independence, if not simulation HT only Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 15 / 24 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 16 / 24

Data are collected at a bank on 6 tellers randomly sampled transactions. Do average transaction times vary by teller? Modeling ( response) Response variable:, Explanatory variable: ANOVA Summary statistics: n_1 = 14, mean_1 = 65.7857, sd_1 = 15.2249 n_2 = 23, mean_2 = 79.9174, sd_2 = 23.284 n_3 = 15, mean_3 = 82.66, sd_3 = 18.1842 n_4 = 15, mean_4 = 77.9933, sd_4 = 23.2754 n_5 = 44, mean_5 = 81.7295, sd_5 = 21.5768 n_6 = 29, mean_6 = 75.3069, sd_6 = 20.4814 H_0: All means are equal. H_A: At least one mean is different. Analysis of Variance Table 20 40 60 80 100 120 1 2 3 4 5 6 1.508 Response: Df Sum Sq Mean Sq F value Pr(>F) group 5 3315 663.06 1.508 0.1914 Residuals 134 58919 439.69 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 17 / 24 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 18 / 24 Data are collected on download times at three different times during the day. We want to evaluate whether average download times vary by time of day. Fill in the??s in the ANOVA output below. Modeling ( response) Response variable:, Explanatory variable: Summary statistics: n_early (7AM) = 16, mean_early (7AM) = 113.375, sd_early (7AM) = 47.6541 n_eve (5 PM) = 16, mean_eve (5 PM) = 273.3125, sd_eve (5 PM) = 52.1929 n_late (12 AM) = 16, mean_late (12 AM) = 193.0625, sd_late (12 AM) = 40.9023 Analysis of Variance Table Response: Df Sum Sq Mean Sq F value Pr(>F) group???????? 1.306e-11 Residuals?? 100020?? Total?? 304661 What is the result of the ANOVA? 100 150 200 250 300 350 Early (7AM) Evening (5 PM) Late Night (12 AM) Since 1.306e-11 < 0.05, we reject the null hypothesis. The provide convincing evidence that the average download time is different for at least one pair of times of day. 46.0349 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 19 / 24 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 20 / 24

The next step is to evaluate the pairwise tests. There are 3 pairs of times of day 1 Early vs. Evening - Cece and Jacob 2 Evening vs. Late Night - Lili and Daryn 3 Early vs. Late Night - Shelby and Thomas Determine the appropriate significance level for these tests, and then complete the test assigned to you. α = 0.05/3 = 0.0167 (1) Early vs. Evening T 45 = 113.375 273.3125 2223 16 + 2223 16 = 159.9375 = 9.59 16.67 p val < 0.01 (2) Evening vs. Late Night T 45 = 113.375 193.0625 2223 16 + 2223 16 = 79.6875 = 4.78 16.67 p val < 0.01 (3) Early vs. Late Night T 45 = 273.3125 193.0625 2223 16 + 2223 16 = 80.25 16.67 = 4.81 p val < 0.01 What percent of variability in download times is explained by time of day? Response: Df Sum Sq Mean Sq F value Pr(>F) group 2 204641 102320 46.035 1.306e-11 Residuals 45 100020 2223 204641 (a) 204641+100020 = 0.67 (b) 204641 100020 (c) 100020 (d) 204641 102320 102320+2223 Modeling ( response) Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 21 / 24 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 22 / 24 n = 50 and ˆp = 0.80. Hypotheses: H 0 : p = 0.82; H A : p < 0.82. We use a randomization test because the sample size isn t large enough for ˆp to be distributed nearly normally (50 0.82 = 41 < 10; 50 0.18 = 9 < 10) Which of the following is the correct set up for this hypothesis test? Red: success, blue: failure, ˆp sim = proportion of reds in simulated samples. Modeling ( response) Randomization distribution What is the center of the randomization distribution? What is the result of the hypothesis test? observed 0.8 (a) Place 80 red and 20 blue chips in a bag. Sample, with replacement, 50 chips proportion of simulations where ˆp sim 0.82. (b) Place 82 red and 18 blue chips in a bag. Sample, without replacement, 50 chips proportion of simulations where ˆp sim 0.80. (c) Place 82 red and 18 blue chips in a bag. Sample, with replacement, 50 chips proportion of simulations where ˆp sim 0.80. (d) Place 82 red and 18 blue chips in a bag. Sample, with replacement, 100 chips proportion of simulations where ˆp sim 0.80. 0.70 0.75 0.80 0.85 0.90 randomization statistic Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 23 / 24 Statistics 104 (Mine Çetinkaya-Rundel) Final June 24, 2013 24 / 24