Statistics I Chapter 1: Introduction

Statistics I Chapter 1: Introduction Chapter 1: Introduction Contents What is Statistics? - definition Key-words: population, parameter, sample, statistic, population size, sample size, individuals, objects Types of variables: categorical (ordinal, nominal) and numerical (discrete, continuous) Why sample? Definition of a simple random sample Frequencies and frequency distribution/table: absolute, absolute cumulative, relative, relative cumulative. Properties.

Chapter 1: Introduction Recommended reading Peña, D., Romo, J., Introducción a la Estadística para las Ciencias Sociales Chapters 1, 2, 3 Newbold, P. Estadística para los Negocios y la Economía (2009) Chapter 1 Sections 2.1, 2.4, 2.7. How to lie with Statistics Definition of Statistics Def. Statistics is a science that deals with: collecting, organizing, summarizing, presenting, interpreting, processing data to transform data into information predictions, forecasts, estimation Descriptive Statistics Inferential Statistics On what occasions did you hear/saw word statistics? football/tennis match summary unemployment rates, number of people injured in car accidents There is much more to statistics than percentages and counts!

Key-words A population is the complete collection of all items/individuals/objects/subjects of interest or under investigation N represents the population size A sample is an observed subset of the population, typically chosen to investigate the properties of a parent population n represents the sample size A parameter is a specific characteristic of a population (fixed) A statistic is a specific characteristic of a sample (varies from sample to sample) A variable is a characteristic of an individual Examples Pop: all students at UC3M Variable: height (0, ) Param: Average height of all students Statistic: Average height of sampled students Pop: all fish in a sea Variable: size {L, M, S} Param: Number of small fish in the entire sea Statistic: Number of small fish caught Pop: all patients of Getafe Hospital Variable: blood type {A,B,AB,O} Param: Percentage of all patients with AB Statistic: Percentage of sampled patients with AB Pop: all Philip s light-bulbs Variable: life-expectancy in days {0, 1, 2,...} Param: Variation in life-expectancy of all light-bulbs Statistic: Variation in life-expectancy of sampled light-bulbs

Types of data Data (Variable) Categorical (Qualitative) Numerical (Quantitative) Ordinal Nominal Discrete Continuous classes can be ranked no natural order integer nonintegers Example Example Example Example Clothes size: Blood type: # of children: Height: L>M>S A,B,AB,O 0,1,2,... 1.55cm, 1.71cm Notation: Letters X, Y, Z are typically used. Example: X = height in cm (upper-case letters in definition) x = 1.55 (lower-case letters for specific values) x 1 = 1.55, x 2 = 1.71 (add subscripts if more than one) Why sample? In practice we don t study the population because: We may destroy the population (eg. life-expectancy of a light-bulb) Population may exist as a concept but not in reality (eg. population of defective items) Impractical (eg. population of all fish in a sea) Too expensive Too time consuming

Definition of a simple random sample (SRS) Def. Simple random sample is obtained in such a way that each member of the population is chosen strictly by chance each member of the population is likely to be chosen, and every possible sample of n objects is equally likely to be chosen Notation: Sample of size n from a variable X means that: We have n individuals selected at random from a population For each of the individuals we report the value of the variable X If X is categorical or discrete, it is convenient to write the different sample values that X takes as x 1, x 2,..., x k, k n (ranked from the smallest to the largest, unless X is nominal) Frequencies and frequency distribution Def. A frequency distribution is a list or a table... containing class groupings (categories or ranges within which the data fall)... and the corresponding frequencies with which data fall within each class or category Frequencies: absolute (number of times the value appeared in the sample) relative (proportion of times the value appeared in the sample)

Why use frequency distributions? A frequency distribution is a way to summarize data The distribution condenses the raw data into a more useful form... and allows for a quick visual interpretation of the data Grouping by classes: categorical and discrete data Note: Cumulative Cumulative Absolute Relative Absolute Relative Class, x i Freq, n i Freq, f i Freq, N i Frequency, F i x 1 n 1 f 1 = n 1 N 1 = n 1 F 1 = f 1 x 2 n 2 f 2 = n 2 n N 2 = N 1 + n 2 F 2 = F 1 + f 2..... x k n k f k = n k n N k = n F k = 1 Total n 1 empty empty n i = number of x i in the sample, f i = number of x i n N i = N i 1 + n i, F i = F i 1 + f i 0 f i, F i 1 F i and N i do not make sense for categorical-nominal variables

Grouping by classes Example 1: The data below shows blood types reported for a sample of 40 individuals. AB, A, B, O, A, A, A, B, O, AB, B, O, B, B, B, A, A, A, AB, B, O, A, A, A, AB, AB, O, B, B, AB, O, B, O, O, A, A, O, B, AB, AB What kind of variable is blood type? Find a frequency distribution of the data. What percentage of the sampled people have blood type A? What percentage of the individuals have blood type other than O? Grouping by classes Example 1 cont.: Categorical, nominal with 4 different classes. The frequency distribution is: 30% 100% 22.5% = 77.5% Absolute Relative Class Frequency Frequency A 12 0.300 B 11 0.275 AB 8 0.200 O 9 0.225 Total 40 1

Grouping by classes Example 2: The table below shows different levels of satisfaction (S=satisfied, V=very, U=unsatisfied) for 901 employees. Absolute Class Frequency VU 62 U 108 S 319 VS 412 Total 901 What type of variable is being studied? Find a frequency distribution of the data. What percentage of the sampled people are satisfied? How many individuals are unsatisfied or worse? In %? How many individuals are at least satisfied? In %? Grouping by classes Example 2 cont.: Categorical, ordinal with 4 different classes. The frequency distribution is: Cumulative Cumulative Absolute Relative Absolute Relative Class Frequency Frequency Frequency Frequency VU 62 0.07 62 0.07 U 108 0.12 170 0.19 S 319 0.35 489 0.54 VS 412 0.46 901 1 Total 901 1 35% 170, 19% 319 + 412 = 731 or 901 170 = 731, 35% + 46% = 81% or 100% 19% = 81%

Grouping by classes Example 3: To evaluate the performance of a new pesticide, a sample of 50 plants, from those treated by the new pesticide, was selected. The number of leaves attacked by a pest was counted for each of the sampled plants. The results are shown below. Absolute x i Frequency 0 6 1 10 2 12 3 8 4 5 5 4 6 3 8 1 10 1 Total 50 Grouping by classes Example 3 cont.: What can you say about the variable in the study? Find its frequency distribution. What percentage of the sampled plants had only 3 leaves attacked? How many plants had no more than 3 leaves attacked? How many plants had at least 6 leaves attacked? What percentage of plants have between 3 and 5 leaves attacked? What percentage of plants had at least 8 leaves attacked? What percentage of plants had at most 2 leaves attacked?

Grouping by classes Example 3 cont.: Numerical, discrete with 9 different values. The frequency distribution is: Cumulative Cumulative Absolute Relative Absolute Relative x i Frequency Frequency Frequency Frequency 0 6 0.12 6 0.12 1 10 0.20 16 0.32 2 12 0.24 28 0.56 3 8 0.16 36 0.72 4 5 0.10 41 0.82 5 4 0.08 45 0.90 6 3 0.06 48 0.96 8 1 0.02 49 0.98 10 1 0.02 50 1 Total 50 1 Grouping by classes Example 3 cont.: 16% 36 3 + 1 + 1 or 50 45 = 5 16% + 10% + 8% = 34% or (8 + 5 + 4)/50 = 34% 2% + 2% = 4% or 100% 96% = 4% 56%

Grouping by class intervals: continuous (and discrete) data Note: Class Interval Midpoint [l i 1, l i ) x i = l i +l i 1 2 n i f i N i F i [l 0, l 1 ) x 1 n 1 f 1 N 1 F 1 [l 1, l 2 ) x 2 n 2 f 2 N 2 F 2...... [l k 1, l k ] x k n k f k n 1 Total n 1 empty empty Left end-point is included, but right end-point is excluded (typical convention) Reverse end-point convention can be applied - check your software for definition Useful for tabulating discrete data if X takes many values Grouping by class intervals: continuous (and discrete) data Very often class intervals have the same width Determine the width w of each interval by w = largest number - smallest number number of desired intervals How many intervals? Roughly between 5 and 20. More specifically: k n if n is small k 1 + 3.22 log(n) if n is large Intervals never overlap Round up the interval width to get desirable interval endpoints

Grouping by class intervals: continuous (and discrete) data Example 4: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature (in Fahrenheit) 24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27 Find the frequency distribution of the data. Sort raw data in ascending order: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Find range: 58 12 = 46 Select number of classes: say k = 5 Compute interval width: 10 (46/5 then round up) Determine the end-points: 10 but less than 20, 20 but less than 30, etc Count the observations and assign to classes Grouping by class intervals: continuous (and discrete) data Example 4 cont.: Class Interval Midpoint n i f i N i F i [10, 20) 15 3 0.15 3 0.15 [20, 30) 25 6 0.30 9 0.45 [30, 40) 35 5 0.25 14 0.70 [40, 50) 45 4 0.20 18 0.90 [50, 60] 55 2 0.10 20 1 Total 20 1 On how many days the temperature was below 30F? In %? (3 + 6 = 9, which is 45%) On how many days (approximately) the temperature was at least 45F? In %? (2 + 4 45 40 50 40 = 4, which is 20%)