Part III: Unstructured Data

Similar documents
Part III: Unstructured Data. Lecture timetable. Analysis of data. Data Retrieval: III.1 Unstructured data and data retrieval

Chapter Eight: Assessment of Relationships 1/42

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

REVIEW 8/2/2017 陈芳华东师大英语系

Inferences for Correlation

11 Correlation and Regression

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH1725 (May-June 2009) INTRODUCTION TO STATISTICS. Time allowed: 2 hours

EM375 STATISTICS AND MEASUREMENT UNCERTAINTY CORRELATION OF EXPERIMENTAL DATA

Linear Models and Estimation by Least Squares

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Business Mathematics and Statistics (MATH0203) Chapter 1: Correlation & Regression

determine whether or not this relationship is.

Chapter 12 - Part I: Correlation Analysis

EC2001 Econometrics 1 Dr. Jose Olmo Room D309

Six Sigma Black Belt Study Guides

POLI 443 Applied Political Research

Can you tell the relationship between students SAT scores and their college grades?

Visual interpretation with normal approximation

Chapter 5: HYPOTHESIS TESTING

MATH 10 INTRODUCTORY STATISTICS

Ordinary Least Squares Regression Explained: Vartanian

Statistics: revision

Stats Review Chapter 14. Mary Stangler Center for Academic Success Revised 8/16

Lecture 14. Analysis of Variance * Correlation and Regression. The McGraw-Hill Companies, Inc., 2000

Lecture 14. Outline. Outline. Analysis of Variance * Correlation and Regression Analysis of Variance (ANOVA)

MATH 240. Chapter 8 Outlines of Hypothesis Tests

Quantitative Analysis and Empirical Methods

t-test for b Copyright 2000 Tom Malloy. All rights reserved. Regression

Confidence Interval Estimation

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

10: Crosstabs & Independent Proportions

Chapter 16. Simple Linear Regression and dcorrelation

Lecture 9 Two-Sample Test. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

How do we compare the relative performance among competing models?

Do students sleep the recommended 8 hours a night on average?

Chapter 16. Simple Linear Regression and Correlation

1 Correlation and Inference from Regression

Chi square test of independence

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Correlation: Relationships between Variables

The t-test Pivots Summary. Pivots and t-tests. Patrick Breheny. October 15. Patrick Breheny Biostatistical Methods I (BIOS 5710) 1/18

Two-Sample Inferential Statistics

POLI 443 Applied Political Research

Reminder: Student Instructional Rating Surveys

Inferential Statistics

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

PhysicsAndMathsTutor.com

Linear Correlation and Regression Analysis

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

Confidence Intervals. - simply, an interval for which we have a certain confidence.

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

appstats27.notebook April 06, 2017

Psych 230. Psychological Measurement and Statistics

SCATTERPLOTS. We can talk about the correlation or relationship or association between two variables and mean the same thing.

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Mathematical Notation Math Introduction to Applied Statistics

Outline for Today. Review of In-class Exercise Bivariate Hypothesis Test 2: Difference of Means Bivariate Hypothesis Testing 3: Correla

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.

Inferences for Regression

Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences h, February 12, 2015

Midterm 2 - Solutions

Difference between means - t-test /25

Black White Total Observed Expected χ 2 = (f observed f expected ) 2 f expected (83 126) 2 ( )2 126

Practical Statistics

Question. Hypothesis testing. Example. Answer: hypothesis. Test: true or not? Question. Average is not the mean! μ average. Random deviation or not?

Chapter 27 Summary Inferences for Regression

Statistics Introductory Correlation

PSY 305. Module 3. Page Title. Introduction to Hypothesis Testing Z-tests. Five steps in hypothesis testing

Statistical Inference. Why Use Statistical Inference. Point Estimates. Point Estimates. Greg C Elvers

Sampling Distributions: Central Limit Theorem

Ch 13 & 14 - Regression Analysis

Harvard University. Rigorous Research in Engineering Education

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Power of a hypothesis test

Retrieve and Open the Data

Slides for Data Mining by I. H. Witten and E. Frank

14: Correlation. Introduction Scatter Plot The Correlational Coefficient Hypothesis Test Assumptions An Additional Example

Stat 231 Exam 2 Fall 2013

Evaluation. Andrea Passerini Machine Learning. Evaluation

Statistical inference provides methods for drawing conclusions about a population from sample data.

Hypothesis Testing One Sample Tests

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Extra Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences , July 2, 2015

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

Evaluation requires to define performance measures to be optimized

Scatter plot of data from the study. Linear Regression

Inferential statistics

STP 226 EXAMPLE EXAM #3 INSTRUCTOR:

Inferences About Two Proportions

Correlation and Linear Regression

16.3 One-Way ANOVA: The Procedure

The t-statistic. Student s t Test

Chapter 4: Regression Models

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

Chapter 14 Simple Linear Regression (A)

WELCOME! Lecture 14: Factor Analysis, part I Måns Thulin

Homework 9 Sample Solution

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Transcription:

Inf1-DA 2010 2011 III: 51 / 89 Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval Statistical Analysis of Data: III.2 Data scales and summary statistics III.3 Hypothesis testing and correlation III.4 χ 2 and collocations

Inf1-DA 2010 2011 III: 52 / 89 Several variables Often, one wants to relate data in several variables (i.e., multi-dimensional data). For example, the table below tabulates, for eight students (A H), their weekly time (in hours) spent: studying for Data & Analysis, drinking and eating. This is juxtaposed with their Data & Analysis exam results. A B C D E F G H Study 0.5 1 1.4 1.2 2.2 2.4 3 3.5 Drinking 25 20 22 10 14 5 2 4 Eating 4 7 4.5 5 8 3.5 6 5 Exam 16 35 42 45 60 72 85 95 Thus, we have four variables: study, drinking, eating and exam. (This is four-dimensional data.)

Inf1-DA 2010 2011 III: 53 / 89 Correlation We can ask if there is any relationship between the values taken by two variables. If there is no relationship, then the variables are said to be independent. If there is a relationship, then the variables are said to be correlated. Caution: A correlation does not imply a causal relationship between one variable and another. For example, there is a positive correlation between incidences of lung cancer and time spent watching television, but neither causes the other. However, in cases in which there is a causal relationship between two variables, then there often will be an associated correlation between the variables.

Inf1-DA 2010 2011 III: 54 / 89 Visualising correlations One way of discovering correlations is to visualise the data. A simple visual guide is to draw a scatter plot using one variable for the x-axis and one for the y-axis. Example: In the example data on Slide III: 52, is there a correlation between study hours and exam results? What about between drinking hours and exam results? What about eating and exam results?

Inf1-DA 2010 2011 III: 55 / 89 Studying vs. exam results This looks like a positive correlation.

Inf1-DA 2010 2011 III: 56 / 89 Drinking vs. exam results This looks like a negative correlation.

Inf1-DA 2010 2011 III: 57 / 89 Eating vs. exam results There is no obvious correlation.

Inf1-DA 2010 2011 III: 58 / 89 Statistical hypothesis testing The last three slides use data visualisation as a tool for postulating hypotheses about data. One might also postulate hypotheses for other reasons, e.g.: intuition that a hypothesis may be true; a perceived analogy with another situation in which a similar hypothesis is known to be valid; existence of a theoretical model that makes a prediction; etc. Statistics provides the tools needed to corroborate or refute such hypotheses with scientific rigour: statistical tests.

Inf1-DA 2010 2011 III: 59 / 89 The general form of a statistical test One applies an appropriately chosen statistical test to the data and calculates the result R. Statistical tests are usually based on a null hypothesis that there is nothing out of the ordinary about the data. The result R of the test has an associated probability value p. The value p represents the probability that we would obtain a result similar to R if the null hypothesis were true. N.B., p is not the probability that the null hypothesis is true. This is not a quantifiable value.

Inf1-DA 2010 2011 III: 60 / 89 The general form of a statistical test (continued) The value p represents the probability that we would obtain a result similar to R if the null hypothesis were true. If the value of p is significantly small then we conclude that the null hypothesis is a poor explanation for our data. Thus we reject the null hypothesis, and replace it with a better explanation for our data. Standard significance thresholds are to require p < 0.05 (i.e., there is a less than 1/20 chance that we would have obtained our test result were the null hypothesis true) or, better, p < 0.01 (i.e., there is a less than 1/100 chance)

Inf1-DA 2010 2011 III: 61 / 89 Correlation coefficient The correlation coefficient is a statistical measure of how closely the data values x 1,..., x N are correlated with y 1,..., y N. Let µ x and σ x be the mean and standard deviation of the x values. Let µ y and σ y be the mean and standard deviation of the y values. The correlation coefficient ρ x,y is defined by: ρ x,y = N i=1 (x i µ x )(y i µ y ) Nσ x σ y If ρ x,y is positive this suggests x, y are positively correlated. If ρ x,y is negative this suggests x, y are negatively correlated. If ρ x,y is close to 0 this suggests there is no correlation.

Inf1-DA 2010 2011 III: 62 / 89 Correlation coefficient as a statistical test In a test for correlation between two variables x, y (e.g., exam result and study hours), we are looking for a correlation and a direction for the correlation (either negative or positive) between the variables. The null hypothesis is that there is no correlation. We calculate the correlation coefficient ρ x,y. We then look up significance in a critical values table for the correlation coefficient. Such tables can be found in statistics books (and on the Web). This gives us the associated probability value p. The value of p tells us whether we have significant grounds for rejecting the null hypothesis, in which case our better explanation is that there is a correlation.

Inf1-DA 2010 2011 III: 63 / 89 Critical values table for the correlation coefficient The table has rows for N values and columns for p values. N p = 0.1 p = 0.05 p = 0.01 p = 0.001 7 0.669 0.754 0.875 0.951 8 0.621 0.707 0.834 0.925 9 0.582 0.666 0.798 0.898 The table shows that for N = 8 a value of ρ x,y > 0.834 has probability p < 0.01 of occurring (that is less than a 1/100 chance of occurring) if the null hypothesis is true. Similarly, for N = 8 a value of ρ x,y > 0.925 has probability p < 0.001 of occurring (that is less than a 1/1000 chance of occurring) if the null hypothesis is true.

Inf1-DA 2010 2011 III: 64 / 89 Studying vs. exam results We use the data from III: 52 (see also III: 55), with the study values for x 1,..., x N, and the exam values for y 1,..., y N, where N = 8. The relevant statistics are: µ x = 1.9 σ x = 0.981 µ y = 56.25 σ y = 24.979 ρ x,y = 0.985 Our value of 0.985 is (much) higher than the critical value 0.925. Thus we reject the null hypothesis with very high confidence (p < 0.001) and conclude that there is a correlation. It is a positive correlation since ρ x,y is positive not negative.

Inf1-DA 2010 2011 III: 65 / 89 Drinking vs. exam results We now use the drinking values from III: 52 (see also III: 56) as the values for x 1,..., x 8. (The y values are unchanged.) The new statistics are: µ x = 12.75 σ x = 8.288 ρ x,y = 0.914 Since 0.914 = 0.914 > 0.834, we can reject the null hypothesis with confidence (p < 0.01). This result is still significant though less so than the previous. This time, the value 0.914 of ρ x,y is negative so we conclude that there is a negative correlation

Inf1-DA 2010 2011 III: 66 / 89 Estimating correlation from a sample As on slides III: 47 48, assume samples x 1,..., x n and y 1,..., y n from a population of size N where n << N. Let m x and m y be the estimates of the means of the x and y values (V: 47) Let s x and s y be the estimates of the standard deviations (V: 48) The best estimate r x,y of the correlation coefficient is given by: r x,y = n i=1 (x i m x )(y i m y ) (n 1)s x s y The correlation coefficient is sometimes called Pearson s correlation coefficient, particularly when it is estimated from a sample using the formula above.

Inf1-DA 2010 2011 III: 67 / 89 Correlation coefficient subtleties The correlation coefficient measures how close a scatter plot of x, y values is to a straight line. Nonetheless, a high correlation does not mean that the relationship between x, y is linear. It just means it can be reasonably closely approximated by a linear relationship. Critical value tables for the correlation coefficient are often given with rows indexed by degrees of freedom rather than by N. For the correlation coefficient, the number of degrees of freedom is N 2, so it is easy to translate such a table into the form given here. (The notion of degree of freedom, in the case of correlation, is too advanced a concept for D&A.) Also, critical value tables often have two classifications: one for one-tailed tests and one for two-tailed tests. Here, we are applying a two-tailed test: we consider both positive and negative values as significant. In a one-tailed test, we would be interested in just one of these possibilities.