Overview INFOWO Statistics lecture S1: Descriptive statistics Peter de Waal Introduction to statistics Descriptive statistics Department of Information and Computing Sciences Faculty of Science, Universiteit Utrecht Lecture S1: 1 / 48 Lecture S1: 2 / 48 Detailed Overview of the Statistics track Definition S1 Descriptive statistics S2 Scores and probability distributions S3 Hypothesis testing and t-test S4 More t-tests S5 Correlation and prediction M5 Homegeneity and reliability S6 Analysis of variance S7 Chi 2 -test Q&A lecture Statistics: The study of the collection, organization, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments. (from Wikipedia) Lecture S1: Statistics Introduction 3 / 48 Lecture S1: Statistics Introduction 4 / 48
Statistics are everywhere The usefulness of statistics For: Information, argumentation, infotainment, commercial Use of equipment for mobile internet Frequent e-shoppers by gender and age To contribute to the accuracy and reliability of the evidence we argue for our ideas... Summarise and systematise data. Interpret research findings on the basis of numbers: Is there a systematic factor behind observed differences? Are heavy Facebook users more assertive/aggressive/autistic? Bridge the gap between sample and population (statistical inference). Can we generalise our findings from this group to all students? Lecture S1: Statistics Introduction 5 / 48 Lecture S1: Statistics Introduction 6 / 48 The bad reputation of statistics Distorting images: UU Jaarbeeld 2012 Complicated and difficult Biased predictions Varying definitions Distorted images False conclusions... But statistics can be fun too! Advice: Keep up with the course. Test yourself. When well-placed, students flourish Core figures RESEARCH Scientific publications 2011: 7773 2012: 8114 PhD degrees 2011: 485 2012: 518 Indirect and contract funding (in millions) 2011: 190 2012: 194 STAFF Appointed professors 2011: 70 2011: 313 2012: 72 2012: 301 Academic staff (in FTE) 2011: 2919 2012: 2828 Support and administrative staff (in FTE) 2011: 2376 TEACHING Student enrolment 2011: 30.449 2012: 29.755 Bachelor s programmes 2011: 45 2012: 45 Master s programmes 2011: 100 2012: 75 Teacher training programmes 2011: 32 2012: 20 GRANTS ERC Advanced 2011: 2 2012: 3 ERC Starting 2011: 3 2012: 2278 2012: 7 Lecture S1: Statistics Introduction 7 / 48 VICI FINANCES Lecture S1: Decriptive statistics Budget (x 1000) Measurement scales 2011: 6 2012: 6 8 / 48 2011: 767,354
What is measured? Measurement scales for variables Objects: Things Concrete things: People, students, companies, books, cars, countries... Properties: Characteristics of objects Physical properties: weight, height, posture Psychological properties: attitude, intelligence, opinion Social properties: status, number of friends, peer-group pressure... Measurements: indicants of properties (of objects) Nominal Ordinal Interval Ratio Lecture S1: Decriptive statistics Measurement scales 9 / 48 Lecture S1: Decriptive statistics Measurement scales 10 / 48 Nominal scale Ordinal scale: Comparison operation possible for: (in)equality Values are exhaustive and mutually exclusive Example: Gender Comparison possible for: (in)equality order Example: Highest attained education: 1 primary school 2 high school 3 university Lecture S1: Decriptive statistics Measurement scales 11 / 48 Lecture S1: Decriptive statistics Measurement scales 12 / 48
Interval scale: Comparison possible for: (in)equality order distance/difference (equality of differences) No natural zero value! Example: Temperature in o C. Ratio scale: Comparison possible for: (in)equality order distance/difference proportion (equality of ratios) Has natural zero value, and no negative values! Example: Weight Lecture S1: Decriptive statistics Measurement scales 13 / 48 Lecture S1: Decriptive statistics Measurement scales 14 / 48 Measurement scale? Measurement scale? Apple growing areas by variety Rank of students on final grade of INFOWO: 1 Jansen 2 Pietersen 3 Jones 4.... 76 Zijlstra Lecture S1: Decriptive statistics Measurement scales 15 / 48 Lecture S1: Decriptive statistics Measurement scales 16 / 48
Measurement scale? Measurement scale? Age (years): Indicate your age (tick one box!): 1 15 24 2 25 34 3 35 44 4 45 54 5 55 64 6 65 Caracal course evaluation: Question: I learned a lot during the lecture (so far): Totally Totally disagree agree 1 2 3 4 5 Questions: What is the measurement scale? Why would you want to measure age like this? Lecture S1: Decriptive statistics Measurement scales 17 / 48 Lecture S1: Decriptive statistics Measurement scales 18 / 48 Summarizing data Frequency measurements (Frequency table) Indicates how often different values occur in measurements. Descriptive measures Frequency measurements Measure of location/central tendency Measure of dispersion Measures of shape Example: Consumer choice of smartphone type Absolute frequencies: 13 (out of 42) Relative frequencies: 26.5% 0.265 Also called: Proportion. Lecture S1: Decriptive statistics Descriptive measures 19 / 48 Lecture S1: Decriptive statistics Descriptive measures 20 / 48
Frequency measurements (Pie chart) Relative frequencies: Percentages Example: Consumer choice of smartphone type Absolute frequencies: 13 (out of 42) Relative frequencies: 26.5% 0.265 Also called: Proportion. Lecture S1: Decriptive statistics Descriptive measures 21 / 48 Lecture S1: Decriptive statistics Descriptive measures 22 / 48 Frequency measurements (Frequency graph) Frequency Tables in SPSS Example: Consumer choice of smartphone type Absolute frequencies: 13 (out of 42) Relative frequencies: 26.5% 0.265 Also called: Proportion. How-to: Menu Analyze Descriptive Statistics Frequencies Lecture S1: Decriptive statistics Descriptive measures 23 / 48 Lecture S1: Decriptive statistics Descriptive measures 24 / 48
Frequency Bar Graph in SPSS Frequencies: Histogram in SPSS How-to: Menu Analyze Descriptive Statistics Frequencies How-to: Menu Analyze Descriptive Statistics Frequencies Lecture S1: Decriptive statistics Descriptive measures 25 / 48 Lecture S1: Decriptive statistics Descriptive measures 26 / 48 Percentiles Percentiles: example Percentile The score of the n-th percentile (P n ) is the score at which n% in the distribution is lower and (100 n)% is higher. Example: P 90 = 189 means that 90% of the scores has a value 189 and 10% has a value 189. Frequently used percentiles are: P 50 : Second quartile (also Median) P 25 : First quartile P 75 : Third quartile Age Frequency Cumulative Percentile 23 1 1 12.5 24 3 4 50.0 25 2 6 75.0 26 0 6 75.0 27 2 8 100.0 Lecture S1: Decriptive statistics Percentiles 27 / 48 Lecture S1: Decriptive statistics Percentiles 28 / 48
Summarizing data Frequency graph versus histogram Descriptive measures Frequency measurements Measure of location/central tendency Measure of dispersion Measures of shape Lecture S1: Decriptive statistics Measures of location 29 / 48 Lecture S1: Decriptive statistics Measures of location 30 / 48 Measures of location / central tendency Purpose: Identity center of the distribution Identify best representative score Mode: Most frequently occuring value Bimodal/multimodal: more than one value is most frequent Median: Midpoint of the distribution Insensitive with respect to outliers (contrary to mean) Mean: Equilibrium or balance point of the distribution. Median: Midpoint of the distribution The Median represents the midpoint of the scores in a distribution when they are listed in order from smallest to largest. The median equals the 50-th percentile (P 50 ). The median divides the groups into two groups of equal size. Lecture S1: Decriptive statistics Measures of location 31 / 48 Lecture S1: Decriptive statistics Measures of location 32 / 48
Mean: Balance point of distribution N i=1 Population: µ = X i N n i=1 Sample: X = M = X i n Population versus sample Why are there two formulas for the mean? Population Set of all the individuals of interest in a particular study The size of the population is usually denoted as: N. The mean µ is a parameter of the population, and usually unknown. Sample Selection of individuals from a population, usually to represent the population in a particular study The size of the sample is usually denoted as: n. The mean X is a statistic, a value obtained from the sample, which is used as an estimate for the unknown population parameter. Lecture S1: Decriptive statistics Measures of location 33 / 48 Lecture S1: Decriptive statistics Measures of location 34 / 48 Mean versus median Which measure for which scale? Example: Sample 1 2 2 3 5 6 7 8 11 Mean: 5 Median: 5 Example: Sample 1 2 2 3 5 6 7 8 20 Mean: 6 Median: 5 Mode Median Mean Nominal: Mode Ordinal: Mode, Median Interval: Mode, Median, Mean Ratio: Mode, Median, Mean Lecture S1: Decriptive statistics Measures of location 35 / 48 Lecture S1: Decriptive statistics Measures of location 36 / 48
Measures of spread / dispersion /variability Range: Example 1 What is the range for this frequency distribution? And the IQR? Only for interval or ratio scales! Range: Difference between largest and smallest score of distribution. Interquartile range (IQR): Difference between first and third quartiles of distribution. Variance: A weighted sum of the squared deviations from the mean. Standard deviation: Square root of the variance Age in years Valid Frequency Cumul. Percent. 18 1 5.0 20 1 10.0 22 1 15.0 28 2 25.0 32 2 35.0 41 2 45.0 48 1 50.0 53 3 65.0 57 2 75.0 62 1 80.0 66 2 90.0 70 2 100.0 Lecture S1: Decriptive statistics Measures of dispersion 37 / 48 Lecture S1: Decriptive statistics Measures of dispersion 38 / 48 Range: Example 2A Range: Example 2B Ageinyears Lecture S1: Decriptive statistics Measures of dispersion 39 / 48 Lecture S1: Decriptive statistics Measures of dispersion 40 / 48
Variance and standard deviation Sum of squares Variance: Population and sample variance use the same sum of squared deviations or Sum of Squares for short: N Population (parameter): σ 2 i=1 = (X i µ) 2 N n Sample (statistic): s 2 i=1 = (X i X) 2 n 1 or SS = N (X i µ) 2 (Population) i Notice the differences in the formulas!! SS = n (X i X) 2 (Sample) i This term will re-appear in later chapters. Lecture S1: Decriptive statistics Measures of dispersion 41 / 48 Lecture S1: Decriptive statistics Measures of dispersion 42 / 48 Degrees of freedom Population variance: Mean is known Deviations are computed from a known mean Sample variance as estimate of population Population mean is unknown Using sample mean restricts variability Degrees of freedom Number of scores in sample that are independent and free to vary Degrees of freedom df = n 1. Variance and standard deviation Variance: N Population (parameter): σ 2 i=1 = (X i µ) 2 N n Sample (statistic): s 2 i=1 = (X i X) 2 n 1 Standard deviation: N i=1 Population (parameter): σ = (X i µ) 2 N n i=1 Sample (statistic): s = (X i X) 2 n 1 Average squared distance from the mean. Measured in the same dimension as the mean. Lecture S1: Decriptive statistics Measures of dispersion 43 / 48 Lecture S1: Decriptive statistics Measures of dispersion 44 / 48
Measure of shape Skewness example Skewness (sk): Measures the distribution s deviation from symmetry. 1 N N i=1 sk = (X i X) 3 ( N ) 3/2. i=1 (X i X) 2 1 N Symmetric: sk = 0. Tilted towards left : sk > 0 ( Positive skew ) Tilted towards right : sk < 0 ( Negative skew ) Statement: In a distribution with negative skew, the mode is larger than the mean. (True or False?) Answer: True Lecture S1: Decriptive statistics Measures of shape 45 / 48 Lecture S1: Decriptive statistics Measures of shape 46 / 48 Lessons learnt What s next Why you want to learn all about statistics What descriptive statistics is The four different types of data The main descriptive measures for data Now: Research practicum meeting Thursday: Methods lecture 2 Exercise class Saturday: submit Deliverable P1a Do not forget to fill in the INFOWO questionnaire! (see website) Lecture S1: Decriptive statistics Summary 47 / 48 Lecture S1: Decriptive statistics Summary 48 / 48