Statistiek I ATW, CIW, IK John Nerbonne, spreekuur H , di. 11:15-12

Similar documents
Samples and Populations Confidence Intervals Hypotheses One-sided vs. two-sided Statistical Significance Error Types. Statistiek I.

Basics on t-tests Independent Sample t-tests Single-Sample t-tests Summary of t-tests Multiple Tests, Effect Size Proportions. Statistiek I.

CIVL 7012/8012. Collection and Analysis of Information

Chapter 2: Tools for Exploring Univariate Data

compare time needed for lexical recognition in healthy adults, patients with Wernicke s aphasia, patients with Broca s aphasia

SESSION 5 Descriptive Statistics

Analysis of Variance Inf. Stats. ANOVA ANalysis Of VAriance. Analysis of Variance Inf. Stats

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

MATH 117 Statistical Methods for Management I Chapter Three

MATH 1150 Chapter 2 Notation and Terminology

Vocabulary: Samples and Populations

2011 Pearson Education, Inc

TOPIC: Descriptive Statistics Single Variable

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

BNG 495 Capstone Design. Descriptive Statistics

(quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables)

Statistiek II. John Nerbonne using reworkings by Hartmut Fitz and Wilbert Heeringa. February 13, Dept of Information Science

MATH 10 INTRODUCTORY STATISTICS

Unit 2. Describing Data: Numerical

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Introduction to Statistics

After completing this chapter, you should be able to:

Practice problems from chapters 2 and 3

CHAPTER 1. Introduction

STAT 200 Chapter 1 Looking at Data - Distributions

LC OL - Statistics. Types of Data

Z score indicates how far a raw score deviates from the sample mean in SD units. score Mean % Lower Bound

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

The empirical ( ) rule

Descriptive Statistics Class Practice [133 marks]

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

AP Final Review II Exploring Data (20% 30%)

Chapter 2 Solutions Page 15 of 28

Continuous random variables

STT 315 This lecture is based on Chapter 2 of the textbook.

Probabilities and Statistics Probabilities and Statistics Probabilities and Statistics

Descriptive statistics

16.400/453J Human Factors Engineering. Design of Experiments II

Descriptive Statistics-I. Dr Mahmoud Alhussami

Sampling, Frequency Distributions, and Graphs (12.1)

Descriptive Univariate Statistics and Bivariate Correlation

A is one of the categories into which qualitative data can be classified.

Elementary Statistics

Chapter 3. Data Description

Example 2. Given the data below, complete the chart:

Module 1. Identify parts of an expression using vocabulary such as term, equation, inequality

Measures of Central Tendency and their dispersion and applications. Acknowledgement: Dr Muslima Ejaz

Descriptive Data Summarization

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

1. Exploratory Data Analysis

Chapter 2. Mean and Standard Deviation

Describing distributions with numbers

6 THE NORMAL DISTRIBUTION

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Overview. INFOWO Statistics lecture S1: Descriptive statistics. Detailed Overview of the Statistics track. Definition

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

FREQUENCY DISTRIBUTIONS AND PERCENTILES

Introduction to Statistics

GRACEY/STATISTICS CH. 3. CHAPTER PROBLEM Do women really talk more than men? Science, Vol. 317, No. 5834). The study

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

Chapter 4. Displaying and Summarizing. Quantitative Data

Describing Distributions With Numbers Chapter 12

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2004

Summarizing Measured Data

Statistics for Managers using Microsoft Excel 6 th Edition

SUMMARIZING MEASURED DATA. Gaia Maselli

Midrange: mean of highest and lowest scores. easy to compute, rough estimate, rarely used

P8130: Biostatistical Methods I

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

21 ST CENTURY LEARNING CURRICULUM FRAMEWORK PERFORMANCE RUBRICS FOR MATHEMATICS PRE-CALCULUS

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

PubHlth 540 Fall Summarizing Data Page 1 of 18. Unit 1 - Summarizing Data Practice Problems. Solutions

Determining the Spread of a Distribution

Draft Proof - Do not copy, post, or distribute

Determining the Spread of a Distribution

Unit 1: Statistics. Mrs. Valentine Math III

Units. Exploratory Data Analysis. Variables. Student Data

8. f(x)= x 3 + 9x 2 + 6x 56

Descriptive Statistics Solutions COR1-GB.1305 Statistics and Data Analysis

Description of Samples and Populations

Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

Scales of Measuement Dr. Sudip Chaudhuri

Why Is It There? Attribute Data Describe with statistics Analyze with hypothesis testing Spatial Data Describe with maps Analyze with spatial analysis

Histograms allow a visual interpretation

Chapter 2: Summarizing and Graphing Data

Descriptive Statistics C H A P T E R 5 P P

M 140 Test 1 B Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Chapter 1 - Lecture 3 Measures of Location

Frequency Distribution Cross-Tabulation

STRAND E: STATISTICS. UNIT E4 Measures of Variation: Text * * Contents. Section. E4.1 Cumulative Frequency. E4.2 Box and Whisker Plots

Course Review. Kin 304W Week 14: April 9, 2013

Transcription:

Statistics Statistiek I ATW, CIW, IK John Nerbonne, JNerbonne@rugnl, spreekuur H1311436, di 11:15-12 Eleanora Rossi, ERossi@rugnl Mik van Es, MvanEs@rugnl 1

Statistics Statistics collecting, ordering, analyzing data Why in general? Wherever studies are empirical (involving data collection), and where that data is variable Most areas of applied science require statistical analysis General education eg, political, economic discussion is statistical (see newspapers) 2

Why Statistics in Humanities Studies? Linguistics Experiments inter alia in communications, information science, linguistics Characterizing geographical, social, sexual s Processing uncertain input speech, OCR, text(!) History, esp social, economic advantages of agriculture (over hunting)? economic benefits of slavery (to slaveholders) colonialism and development Literature Characteristics of authors, genres, epochs diction; sentence structure, length Authorship studies (eg Federalist Papers) Stemmata in philology (RuG diss, JBrefeld) Availability of online data increases opportunities for statistical analysis! 3

Statistics in Humanities This Course Practical approach Emphasis on statistical reasoning Understand uses (in other courses) Conduct basic statistical analysis Look at data before and during stat analysis De-emphasis on mathematics no prerequisite Use of SPSS Illustrates concepts, facilitates learning (eventually) Bridge to later use simpler Topics, examples from Humanities studies 4

Formal Requirements Weekly lecture (attendance required) Five exercises with SPSS (labs) Six weekly quizzes One exam (in het Nederlands) Grades Lectures (5%) Attendance required at all lectures Check based on at least five (of seven) times Quizzes (5%) wwwletrugnl/nerbonne/teach/statistiek-i SPSS Labs (15%); Complete/Incomplete (50% if late less one week) Exam (75%) 5

Role of Labs Walk through case studies Think through what statistical software is demonstrating Acquire facility with SPSS Practice statistical reporting How to approach labs Chance to try out ideas from lecture, book Ask whether your labs jibe with theory How to waste time with labs Copy results from others Go through the motions without thinking 6

Descriptive Statistics Descriptive Statistics describe data without trying to make further conclusions Example: describe average, high and low scores from a set of test scores Purpose: characterizing data more briefly, insightfully Inferential Statistics describe data and its likely relation to a larger set Example: scores from sample of 100 students justify conclusions about all Purpose: learn about large population from study of smaller, selected sample, esp where the larger population is inaccessible or impractical to study Note sample vs population 7

Common Pitfalls ignoratio elenchi: (missing the point) the most common error in arguments involving statistics is not mathematical or even technical Most common error: getting off track L is a better cold medicine It kills 10% more germs Retail food is a rough business Profit margins are as low as 2%! XXX is completely normal 317% of the population reports that they have engaged in XXX Of course, this is not limited to statistical argumentation! 8

Terminology We refer to a property or a measurement as a variable, which can take on different values Variable Typical Values height 170 cm, 171 cm, 183 cm, 197 cm, sex male, female reaction time 305 ms, 3762 ms, 497 ms, 5039 ms, language Dutch, English, Urdu, Khosa, corpus frequency 000205, 000017, 000018, age 19, 20, 25, Variables tell us the the properties of individuals or cases 9

A More Formal View Terminology: we speak of CASES, eg, Joe, Sam, and VARIABLES, eg height (h) and native language (l) Then each variable has a VALUE for each case, h j is Joe s height, and l s is Sam s native language When we examine relations, we always examine the realization of two variables on each of a group of cases height vs weight on each of a group of Dutch adults effectiveness vs a design feature of group of web sites, eg use of menus, use of frames, use of banners pronunciation correctness vs syntactic category of a group of words phonetic vs geographic distance on a group of pairs of Dutch towns 10

Tabular Presentation Example: A test is given to students of Dutch from non-dutch countries Variables: Variable Values area of origin EUrope, AMerica, AFrica, ASia test score 0-40 sex Male, Female Here is part of the results area score sex EU 22 M AM 21 F Three variables, where only score is numeric, & others nominal Each row is a CASE Tables show all data, which is nice, but large tables are not insightful 11

Coding It is often necessary to code information in a particular way for a particular software package In general, SPSS allows fewer manipulations and analyses for data coded in letters Use numbers as a matter of course This causes us to recode area of origin and sex, since these were coded in letters area of origin EUrope AMerica AFrica ASia 0 1 2 3 sex Male Female 1 2 Notate bene: this is a weakness in SPSS In general, it is good practice to use meaningful codings But in SPSS, this will limit what you can do use numbers! 12

Classifying It is also sometimes useful to group numeric values into classes We ll group score into 0-16 (beginner), 17-24 (advanced beginner), 25-32 (intermediate), and 33-40 (advanced) area score sex score class 0 22 1 1 1 21 2 1 2 15 2 0 3 26 1 2 Grouping numerical information into classes loses information Care! Reminder: area of origin EUrope AMerica AFrica ASia 0 1 2 3 sex Male Female 1 2 13

Data/Measurement Scales nonnumeric scales nominal, ordinal numeric scales interval, ratio, etc Scale determines type of statistics possible We can average numeric data, but not non-numeric data We speak of the average height of an individual (numeric), but not his average native language (nonnumeric) 14

Variable Subtypes Non-numeric nominal/categorical categorized, but not ordered: male, female part of speech, POS in linguistics, eg noun, verb, countries, languages, type of artefact, ordinal ordered (ranked), but s not comparable rank listing of job candidates lots of test scores! marks of satisfaction, agreement, etc Circle the answer that most closely fits Taxes must decline 1 2 3 4 5 "strongly "strongly agree" disagree" 15

Variable Subtypes Numeric interval ordered, s comparable, but no true zero (needed for multiplication) temperature (in Celsius of Fahrenheit) ratio like interval plus zero available frequency of occurence, eg 3 times per week height, weight, age elapsed time, reaction time logarithmic like ratio, but successive intervals multiply in size Richter scale in earthquakes loudness (auditory perception) improvement (in error) rates (often) 16

Measures of Central Tendency mode most frequent element the only meaningful measure for nominal data median half of cases are above, half below the median available for ordinal data mean arithmetic average x = x 1 + x 2 + + x n n 1 nx n i=1 µ for populations, m (and x) for samples x i 17

Measures of Central Tendency need not coincide from How to Lie with Statistics 18

x-ile s Quartiles, quintiles, percentiles divide a set of scores into equal-sized groups quartiles: 37 68 78 90 49 71 79 90 54 71 79 90 56 73 83 92 60 75 83 94 64 76 85 95 65 77 87 96 65 77 88 97 q 1 1 st quartile -dividing pt between 1 st & 2 nd groups; q 2 div pt 2 nd & 3 rd (= median!) percentiles: divide into 100 groups thus q 1 = 25th percentile, median = 50th, Score at nth percentile is better than n% of scores 19

Measures of Variation none for nonnumeric data! why? minimum, maximum lowest, highest values range difference between minimum and maximum interquartile range (q 3 q 1 ) center where half of all scores lie semi-interquartile range (q 3 q 1 )/2 box-n-whiskers diagram showing q 2 & q 3, range sometimes median included 20

Visualizing Variation box-n-whiskers diagram showing q 2 & q 3, range; sometimes median included 40 30 toets nlvoor anderstalige 20 10 0 N = 10 Europa 10 America 10 Africa 10 Azie gebied van afkomst Test results Dutch for Foreigners for four groups of students Boxes show q 3 q 1, line is median Whiskers show first and last quartiles 21

Measures of Variation deviation is difference between observation and mean variance average square of deviation σ 2 = 1 n 1 nx (x i x) 2 i=1 standard deviation square root of variance σ = σ 2 σ 2 for population, s 2 for sample square allows orthogonal sources of deviation (error) to be analyzed e 2 = e 2 1 + e 2 2 + + e2 n 22

Other Statistical Measures skew scheefheid measure of balance of distribution = 8 < : if more on left of mean 0 if balanced + if more on right kurtosis relative flatness/peakedness in distribution = 8 < seen in SPSS, not used further in this course : if relatively flat 0 if as expected + if peak is relatively sharp 23

Other Measures index numbers eg, Consumer Price Index, Composite Index of Leading Indicators, Producer Price Index, measures the value of a variable relative to its value at a base period Example an apple cost Dfl 020 in 1990 but Dfl 022 in 1995 The apple price index in 1995 with 1990 as base is: 22 20 100 = 110 always relative to some fixed base therefore not per annum percentage changes exception: one year after base real (composite) indices are weighted averages of simple indices weight reflecting relative share of costs, values 24

Standardized Scores Tom got 112, and Sam only got 105 What do scores mean? Knowing µ, σ one can transform raw scores into standardized scores, aka z- scores: z = x µ σ = deviation standard deviation 25

Standardized Scores Suppose µ = 108, σ = 10, then z 112 = 112 108 10 04 z 105 = 105 108 10 03 z shows distance from mean in number of standard deviations 26

Standardized Scores If we transform all raw scores into z-scores using: z = x µ σ = deviation standard deviation We obtain a new variable z, whose z-score = distance from µ in σ s mean is 0 standard deviation is 1 uses: sampling, hypothesis testing 27

Toward Distributions DISTRIBUTION is the pattern of variation of a variable Example: Number of health web-site visitors for 57 consecutive days 279 244 318 262 335 321 165 180 201 252 145 192 217 179 182 210 271 302 169 192 156 181 156 125 166 248 198 220 134 189 141 142 211 196 169 237 136 203 184 224 178 279 201 173 252 149 229 300 217 203 148 220 175 188 160 176 128 stem n leaf diagram sorts by most significant (leftmost) digit rightmost digit As above, ignoring 1 2233444445566666777778888889999 2 000011112222344556777 3 00123 28

Displaying Distributions Histograms show how frequently all values appear, often require categorization into small number of ranges ( 10) Look for general pattern, outliers, symmetry/skewness 29

Time Series Same variable at regular intervals eg, indices, web site visits, Change often focus of attention 30

Special Moving Averages Some measures fluctuate due to weather, business cycles, chance moving average sums over overlapping intervals to eliminate some effects of fluctuation Year Export 5-yr Ave 6-yr Ave 1855 957 1856 1158 1857 1220 1161 1858 1166 1241 1218 1859 1304 1260 1250 1860 1359 1264 1277 1861 1251 1324 1334 1862 1240 1384 1400 1863 1465 1444 1864 1604 1865 1658 from JTLindblad Statistiek voor Historici 31

Distribution Functions Frequency distributions verdelingen show how often various values occur absolute frequency How many times values are seen, eg, 16 men, 24 women relative frequency What percentage or fraction of all occurrences, eg, 40% (= 16/40) men, 60% (= 24/40) women Example: relative frequency of an honest die P (d = d) Distribution of die toss where f 3, is paid per dot 1/6 3 6 9 12 15 18 32

Distribution Functions cumulative frequency how often values at least as large as a given value occur Example: cumulative relative frequency of an honest die P (d d) 1 Cumulative distribution of d, = expectation that d d 1/2 1/6 3 6 9 12 15 18 33

Numeric Variables Most numeric variables take any number of values (Ordinal) variables that take more than about 7 values are often analysed as numeric eg, test scores We display their frequency distributions by grouping values Grouped frequencies in reaction time experiment 40 30 20 10 2 4 6 8 1 12 sec 34

Density Displays Example: reaction time results appear to fit on the curve 3 f(x, 06, 015) 25 2 15 1 05 0 0 02 04 06 08 1 12 Most very close to 06 sec (600ms) interpret as p% of reaction times = 600ms 700ms reaction time 25% maybe no reaction time was exactly 600ms 35

Density Displays 3 f(x, 06, 015) 25 2 15 1 05 0 0 02 04 06 08 1 12 Interpretation: plot frequency DENSITY, so area under curve corresponds to percentage of values that fall within area 36

Probability Density Functions assign (fractional) values to events, 0 P (e) 1, where an event is a collection of (possible) occurences sum to one (all possible events) R P (x)dx = 1 lots of possibilities, most famously normal distributions bell-shaped curve µ-3σ µ-2σ µ-σ µ µ+σ µ+2σ µ+3σ 37

Normal Curve In normal distribution, the mean is always exactly at the center, and the standard deviations appear at fixed proportions We refer to a particular normal curve using the mean and standard deviation, N(µ, σ), eg, N(100, 16) (the distribution of IQ s) µ-3σ µ-2σ µ-σ µ µ+σ µ+2σ µ+3σ Very important in statistics because sample averages are always normally distributed 38

Normal Curve Interpretation of normal curve fixed for standardized variables (z): 95% z = 1645 In every normal curve, 95% of the mass is under the curve below the point which is 1645 standard deviations above the mean 39

Normal Curve Tables See M&M, Tabel A, pp696-97 z 00 01 02 03 04 05 06 16 9452 9463 9474 9484 9495 9505 9515 where z is the standardized variable: z = x µ σ = deviation standard deviation 40

Interpreting z-scores If distribution is normal, then standardized scores correspond to percentiles z 00 01 02 03 04 05 06 16 9452 9463 9474 9484 9495 9505 9515 Table specifies the correspondence ( 100), containing the fraction of the frequency distribution less than the specified z value Tables in other books give, eg, 1 (Percentile 100) 41

Interpreting z- Scores Typical questions, where tables can be applied P (z > 15) =? What s the chance of a z value greater than 15? P (z 15) =? P (z 15) =? P ( 1 z 1) =? We assume normally distributed variables Exercises: Interpretation of Normal Distribution 42

Is the Distribution Normal? Some statistical techniques can only be applied if the data is (roughly) normally distributed, eg, t-tests, ANOVA How can one check whether the data is normally distributed? Normal Quantile Plots show (roughly) straight lines if data is (roughly) normal Sort data from smallest to largest showing its organisation into quantiles Calculate the z-value that would be appropriate for the quantile value (normalquantile value), eg, z = 0 for 50 th percentile, z = 1 for 16 th, z = 2 for 975 th, etc Plot data values against normal-quantile values 43

Normal Quantile Plots Example: Verbal reasoning scores of 20 children 90 verbal reasoning score 80 70 60-2 -1 0 1 2 Normal Distribution Plot expected normal distribution quantiles (x axis) against quantiles in samples If distribution is normal, the line is roughly straight Here: distribution roughly normal M&M show normal quantile values on x-axis, SPSS on y but check is always for straight line 44

Next Samples 45