Data Collection and Statistical Inference

Size: px

Start display at page:

Download "Data Collection and Statistical Inference"

Cameron Ryan
6 years ago
Views:

1 MWO Lecture 2 Artificial Intelligence Laboratory Vrije Universiteit Brussel katrien kevin@arti.vub.ac.be October 8, 2010

2 The Research Process Empirical Data

3 Data gathering Empirical Data Data Sampling Generating Hypotheses Variables Distributions Already a source of many mistakes (conscious or unconscious) Examples A railway company investigating the temporal accuracy of the trains The government investigating the happiness of the people A questionnaire about hygiene In all of these cases, the results are likely to be biased Be aware of possible biases, and report them together with your statistics!

4 Methods of data gathering Data Sampling Generating Hypotheses Variables Distributions Random sampling: Each case has an equal chance to become part of the sample. Needed: a well-defined population, a list of all cases and a random number generator. Systematic sampling: The first case is picked randomly, the rest according to a specific procedure (e.g. Start at random role number, increment with 10 after that.). Possibly introduces a bias. (Example?)

5 How much data is required? Data Sampling Generating Hypotheses Variables Distributions The more, the better! In practice, research is of course limited by money, time and space. The amount of data sometimes depends on the distribution of data points (e.g. some may be very rare but still have a major influence). This could require iterated sampling. Always report the number of samples and how they were obtained. If not, you could just as well be showing that the probability of throwing a 6 when rolling a dice is 100%.

6 )*+",-./-0-1$/1+"20"3+4+$%1* Empirical Data Data Sampling Generating Hypotheses Variables Distributions $%"&'('

7 Importance of Hypotheses Data Sampling Generating Hypotheses Variables Distributions Science and engineering proceed by the formulation of hypotheses and the provision of supporting (or refuting) evidence for them. Informatics should be no exception. But the provision of explicit hypotheses in Informatics is rare! This causes lots of problems: Usually many possible hypotheses Ambiguity is a major cause of referee/reader misunderstanding Vagueness is a major cause of poor methodology (inconclusive evidence, unfocussed research direction)

8 Evaluation begins with claims Data Sampling Generating Hypotheses Variables Distributions Hypotheses in Informatics can be: Claims about a task, system, technique or parameter, e.g.: System X performs better than System Y on dimension Z Technique X has property Y X is the optimal setting of parameter Y Properties and relations along scientific, engineering or cognitive science dimensions.

9 Data Sampling Generating Hypotheses Variables Distributions Scientific Hypotheses For the first claim, relevant hypotheses would be: Experimental Hypothesis (H 1 ): The mean of the ratings for the new system is higher than the mean of the ratings for the baseline system. Null Hypothesis (H 0 ): There is no difference in the mean of the ratings for the new system and the mean of the ratings of the baseline system.

10 Data Sampling Generating Hypotheses Variables Distributions Variables The data of an experiment is a set of observations that is characterised by one or more properties that are extracted as variables: Independent variable: A variable that indicates something you manipulate in an experiment, or some supposedly causal factor that you can t manipulate such as Corpus and System in the sentence compression experiment. Dependent variable: A variable that indicates to greater or lesser degree the causal effects of the factors represented by the independent variables. Examples for sentence compression are compression rate (percentage of words removed) and sentence ratings (1-5).

11 Levels of Measurement Empirical Data Data Sampling Generating Hypotheses Variables Distributions Variables can be split into categorical and continuous, and within these types there are different levels of measurement: Categorical (entities that are divided into distinct categories) Binary variable: There are only two categories Nominal variable: There are more than two categories Ordinal variable: The same as a nominal variable but the categories have a logical order Continuous (entities get a distinct score) Interval variable: Equal intervals on the variable represent equal differences in the property being measured Ratio variable: The same as an interval variable, but the ratios of scores on the scale must also make sense

12 Data as Distributions Empirical Data Data Sampling Generating Hypotheses Variables Distributions fruit - Answer at a glance...\ A distribution depicts pictures the of frequency of x value each value of a measured variable: Tally FREQUENCY //// 5 //// 4 //// // 7 //// 4 tribution of fruit preferences le na ge ar mbers of choosers = frequency B. Single snapshot: DISTRIBUTION = frequency Distribution of fruit preferences Pear Orange Banana Apple type of fruit

13 Data Sampling Generating Hypotheses Variables Distributions Probability Distributions Statistics is used to analyse experimental results. Probability Theory is a mathematical abstraction. To use it (i.e. apply it) you need an interpretation of how real world concepts relate to mathematics. Probability as a degree of belief. P = 1 is certainty; P = 0 is impossibility. We write P(X = x) for the probability distribution (density or mass) that random variable X takes value x.

14 Central Tendency Dispersion Symmetry What do we need to describe about a distribution? Where is it on the scale axis = central tendency What kind of shape does it have? Does it spread out or bunch up? = dispersion Is it symmetrical? = symmetry

Central Tendency Dispersion Symmetry Representative values Useful Terms Mean: The average value. x = 1 N xi Median: The value which 1 2 3 7 splits 7 8 14 a15 sorted 17 21 22distribution in half.

15 Central Tendency Dispersion Symmetry Representative values Useful Terms Mean: The average value. x = 1 N xi Median: The value which splits a15 sorted distribution in half. The 50th quantile of themean: distribution Quantile: A cut point Median: q that 8 divides the distribution into pieces of size q/ and (q/100) Examples: th quantile Mean: 93.1 cuts the distribution in half. 25th quantile cuts off the lower quartile. 75th quantile Median: cuts 11 off the upper quartile. Median: The value which splits a sorted distribution in half. The 50th quantile of the distribution. Quantile: A "cut point" q that divides the distribution into pieces of size q/100 and 1- (q/100). Examples: 50th quantile cuts the distribution in half. 25th quantile cuts off the lower quartile. 75th quantile cuts off the upper quartile. Mode: Most frequent value.

16 M3 Empirical Data Central Tendency Dispersion Symmetry Reporting a statistic

17 Central Tendency Dispersion Symmetry Variables and their CT measures Scale Category /nominal consistent coding labels order Intervals 0-point Y N N N Ordinal Y Y N N Interval Y Y Y N Ratio Y Y Y Y permitted operations counting frequencies counting frequencies ranking counting frequencies ranking +! counting frequencies ranking +! " #, etc examples favourite fruit part of speech native language ok v * degree class letter grade beg-interm-adv ok v * v ** skirt length from knee shoe size skirt length from waist % correct height in in/cm permitted measures of central tendency mode Mode, median Mode, median, mean, Mode, median, mean MATISYAHU:TEACHING:SED:08-09:sed5-08.docLast printed 7/10/08 3:02P age 3 of 13

18 Central Tendency Dispersion Symmetry Measures of Dispersion Some measures of dispersion should always accompany representative values. A good measure of dispersion should: take into account all data points; describe the average deviation of data points with respect to the mean; increase when data heterogeneity increases. Examples: Range = max(x) min(x) Deviation = difference of a score from the sample mean: (x i x) Variance = average of squared deviations from the mean: V = 1 N (xi x) 2 (used when scale is no issue) Standard Deviation: S = V = 1 N (xi x) 2

19 Central Tendency Dispersion Symmetry Symmetry versus Skew Symmetrical distributions have 1. Symmetry v skew Inherent order (need ordinal scale or better) a. Symmetrical distributions have i. Same Inherent volume order - so either need ordinal side scale of their or better point of balance Skewed distributions are asymmetrical b. Skewed Negative distributions skew: are pulled asymmetrical out towards low values i. Negative skew: pulled out towards low values Positive pulled out towards high values C. MEASURES OF SHAPE ii. Same volume either side of their point of balance (to R and L of the red line) ii. Positive skew: pulled out towards high values (a) (b) (c) c. Relationship to measures of central tendency: i. Where mean and mode are appropriate, positively skewed distributions often have mean > median

20 Central Tendency Dispersion Symmetry Relationship to measures of central tendency Where mean and mode are appropriate, positively skewed distributions often have mean > median. Where mean and mode are appropriate, positively skewed distributions often haven mean < median.

21 Central Tendency Dispersion Symmetry Symptoms of bad methodology Where there is a minimum or maximum score and distribution is pushed up against it. Ceiling effect (negative skew) (e.g. topmost score) Floor effect (positive skew) (e.g. fastest possible reading time) Where there are a few outliers = cases separated from bulk of cases and from central tendency. If you don t examine the distribution of results in such studies, you may be drawing incorrect conclusions from your results. E.g. An outlier affects the mean disproportionally, for example, the college with the highest mean salary for its graduates. (Use standard deviation!)

22 Using statistics to test hypotheses Significance testing Examples Using statistics to test hypotheses Hypothesis: A dice is crooked I roll it twice, 6 shows up both times Hypothesis: Using Microsoft Windows makes people angry A friend of mine is using Windows and he s always complaining to me about how unstable his computer is Hypothesis: Using Microsoft Windows makes people angry I ask 312 VUB students to fill out a questionnaire after using the computer lab, stating which operation system they used and whether they felt happy or angry when leaving the lab. Operating system usage is roughly the same, but while only 12% of Linux and MacOS users felt angry, 37% of Windows users did.

23 Using statistics to test hypotheses Significance testing Examples Using statistics to test hypotheses Hypothesis: Using Microsoft Windows makes people angry I run a large scale evaluation with participants who have to perform standardised tasks on different operating systems. The participants are evenly distributed across all ages, half of them are male and half of them female. Right before and right after working at the computer for 30 minutes they undergo a standardised psychological test to evaluate their aggressiveness before and after the task. I end up with ordinal results for each participant stating whether they became less (<) or more aggressive (>), or whether their aggressiveness level stayed approximately the same (=). The percentage of Windows users with a > result seems disproportionately higher than in the other operating system groups.

24 Using statistics to test hypotheses Significance testing Examples The plural of anecdote isn t data The more data, the better Hypothesis: A coin is biased towards heads N Heads Tails Higher N is closer in size to an infinite population Remember: we want to make a general claim But: We want to have a measure on how much data we need

25 Using statistics to test hypotheses Significance testing Examples Significance testing We cannot simply compare descriptive parameters, but entire distributions of data Null Hypothesis Significance Testing H0 comes from the sampling distribution: we are seeing only variations we would expect to find by chance when sampling the population H 1 is defined against that distribution: the result is very unlikely to belong to the distribution of chance outcomes according to H 0 In order to have a standard case to compare against (H 0 ), we need a model of the data For the coin: p(heads) = p(tails) = 0.5 For the dice: p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1 To test the effectiveness of a medical treatment: the mortality rate of untreated patients 6

26 Using statistics to test hypotheses Significance testing Examples Significance level Statistical significance tests give a probability p p = the probability that the given data set is simply sampled from the normal H 0 distribution and that any variation from it can be accounted for by the deviation which we would expect for the sample size N If we want to support our H 1, we want this p to be low Typical significance levels α =.05,.01,.001,.0001 (* ** ***) A low p does not mean that H 1 is proven, only that H 0 doesn t account well for the observed data Say we run and publish 20 experiments where the result is p <.05. What does this mean?

27 Using statistics to test hypotheses Significance testing Examples Types of error in statistical hypothesis testing Type I error (false positive): reject the null hypothesis when it is actually true Type II error (false negative): accept the null hypothesis when it is actually false Since we want to be conservative with the claims we make based on our experiments, we want to keep the Type I error rate α low. The lower we set our mandatory significance level α, the higher our Type II error rate β gets - we increase the chance that we reject our H 1 when it is actually true

28 Using statistics to test hypotheses Significance testing Examples The simplest case with two outcomes: Binomial test The math is easy so we can use the Binomial distribution to get an exact result For the coin example, use B(N, 1 2 ) to run a one-tailed test How likely is it that we got H or even more Heads assuming the coin is not biased? N Heads Tails p(b(n, 1 2 ) H) Remember: None of this means that H 1 is proven!

29 Using statistics to test hypotheses Significance testing Examples Multiple nominal outcomes: χ 2 -test For more than two possible outcomes and large sample sizes we can t afford to run an exact test but have to use an approximation (e.g. Pearson s χ 2 -test) χ 2 = n (O i E i ) 2 E i i=1 O i = observed frequency E i = expected frequency n = number of possible outcomes (bins)

30 Using statistics to test hypotheses Significance testing Examples Pearson s χ 2 -test χ 2 = n (O i E i ) 2 E i i=1 Additional conditions for Pearson s χ 2 -test: unrelated design: different individuals in different bins - otherwise you would have to use a multinomial test - why? bins mutually exclusive if the sample size or the expected frequencies for each bin are too low, the approximation of Pearson s χ 2 -test to a real χ 2 distribution is not reliable. In these cases Fisher s exact test can be used instead. Can also be used to check whether 2 observed sample sets are likely to come from the same distribution!

31 Using statistics to test hypotheses Significance testing Examples Directionality of hypotheses A hypothesis can be directed/one-tailed: there is a bias in one specific direction - we are only interested in the probability of that one tail of possible outcomes undirected/two-tailed: there is some bias - unlikely high as well as unlikely low outcomes confirm our hypothesis Some tests (like χ 2 ) can only be used for two-tailed tests. Why? (Hint: only when n = 2 can you inquire about the directionality of the bias)

32 Using statistics to test hypotheses Significance testing Examples Other univariate significance tests For ordinal data: Wilcoxon, Mann-Whitney, Friedman, Kruskal-Wallis, Cohen s Kappa,... For normally distributed interval data: z-test (simply the continuous version of the Binomial distribution/test) For normally distributed interval data of which the underlying distribution of H 0 is not known: t-test For experimental designs involving more than 2 conditions: ANOVA (Analysis of Variance)

33 Using statistics to test hypotheses Significance testing Examples Multivariate statistics So far we have only looked at univariate statistics: only a single dependent variable What about correlations between multiple dependent variables measured in the same experiment? Non-parametric (ordinal): Spearman Rank Order Correlation Parametric (interval/ratio): Pearson Product-Moment Correlation Joint probability distributions - covariance Correlation will not tell you the direction of a potential causal relationship Correlation causation! There is a strong negative correlation between the number of mules and number of PhDs among American states

34 Using statistics to test hypotheses Significance testing Examples Linear regression Simple linear regression: treating Y as a function of X Multiple linear regression: treating Y as a function of any number of Xs Different from multivariate statistics: We are investigating the conditional rather than the joint probability distributions Outcome Linear model: Y = β1 X 1 + β 2 X β n X n + ɛ Quantitative measure of the strength (significance) of the relationship between Y and every X i Just like with correlation coefficients only linear relationships can be detected

35 Using statistics to test hypotheses Significance testing Examples Statistical significance tests in practice Wide-spread in psychology and medicine Controlled experiment setups (often geared towards suiting a particular test... ) Software: SPSS, MatLab, R,... Sometimes tests are run on data that they aren t actually made for... Interval+ratio data can be binned to run ordinal+nominal tests on (e.g. χ 2 )... The conditions for many tests aren t strictly adhered to and there are a number of established corrections that have been proven to work in practice (e.g. Yates Correction for χ 2 with low expected frequencies,... ) Many experimental setups (e.g. complex interactions in multi-agent systems) are hard to capture with statistical tests

36 Using statistics to test hypotheses Significance testing Examples Final thoughts on statistical significance tests Not all correlations are interesting, relevant or important You shouldn t run random or exhaustive tests Testing should be motivated by your theory and hypotheses Results should be analysed and interpreted in terms of your theory (and beyond - keep thinking!) Statistical parameters don t capture everything Eyeball your data closely before running tests, it can give you important clues on what you actually want to look for A picture is worth a thousand words

Using statistics to test hypotheses Significance testing Examples Exercise Running and reporting some simple significance tests using R You can download it from http://www.r-project.

37 Using statistics to test hypotheses Significance testing Examples Exercise Running and reporting some simple significance tests using R You can download it from A manual can be found here: co.uk/education/lectures/r/basics.htm.data files available from Easily loadable into R via scan() and read.csv() Identify appropriate tests and run them Report the reasons for your choice of tests, the commands you ran and the results in a report (max. 1 page)

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the