Introduction to Statistics By A.V. Vedpuriswar October 2, 2016
Introduction The word Statistics is derived from the Italian word stato, which means state. Statista refers to a person involved with the affairs of state. Therefore, statistics useful to the Statista. originally meant the collection of facts 1
Nominal Scale, Ordinal Scale Nominal Scale: In the Nominal Scale of measurement, numbers are used simply as labels for groups or classes. Ordinal Scale: In the ordinal scale of measurement, data elements may be ordered according to their relative size or quality. We do not know how much better one element is than others, only that it is better. 2
Interval Scale, Ratio Scale Interval Scale: In the interval scale of measurement, we can assign a meaning to distances between any two observations. The distances between elements can be measured in units. Ratio Scale: The ratio scale is the most sophisticated scale of measurement. Here not only do distances between paired observations have a meaning, but so do the ratios of the distances. 3
Samples and Populations The population consists of the set of which we are interested. all measurements in The population is also called the Universe. A sample is a subset of measurements selected from the population. If sampling is done randomly, such that every possible sample of n elements will have an equal chance of being selected, it is called a simple random sample, or just a random sample. 4
Statistical Inference A conclusion drawn about a population based on the information in a sample from the population is called a statistical inference. 5
Percentiles and Quartiles The P th percentile of a group of numbers is that value below which lie P% (P percent) of the numbers in the group. Quartiles are the percentiles that break down the data set into quarters first, second, third and fourth quarter. First quartile is the 25 th percentile, below which lie 25% of the data. Median is the 50 th percentile, below which lie half the data. Third quartile is the 75 th percentile, below which lie 75% of the data. Interquartile range is the difference between first and third quartiles. 6
Frequency and Histogram The number of times a data point occurs in a data set is called frequency. Relative frequency is the frequency divided by the total frequency. Data points are often classified into class intervals. The number of data points lying within each class interval is the frequency of the interval. A histogram is a plot of the frequencies of the class intervals. 7
Frequency Polygons and Ogives A frequency polygon is similar to a histogram except that there are no rectangles, only a point in the middle of each interval at a height proportional to the frequency of the category of the interval. By adding up the frequencies, we get the cumulative frequency. An ogive is a cumulative-frequency (or cumulative relativefrequency) graph. An ogive starts at 0 and goes to 1.00 (for a relative-frequency ogive) or to the maximum cumulative frequency. 8
Box Plots A box plot is a set of five summary measures of the distribution of the data: The median of the data The lower quartile The upper quartile The smallest observation The largest observation 9
Measures of Central Tendency The median lies at the center of the data. Half the data lie below it and half above it. The median is thus a measure of centrality. The mode of the data set is the value that occurs most frequently. The mean of a set of observations is their average. It is equal to the sum of all observations divided by the number of observations in the set. 10
More about the Mean The Mean is the most commonly used measure of central tendency. The mean summarizes all of the information in the data. The mean is the point where all the mass of the observations is concentrated. It is the centre of mass of the data. 11
Mean vs Median The mean is based on information contained in all the observations in the data set, rather than being an observation lying in the middle of the set. The mean also has some desirable mathematical properties that make it useful in statistical inference. In cases where we want to guard against the influence of a few outlying observations (called outliers), however, we may prefer to use the median. The median is resistant to extreme observations. 12
Mean vs Mode The mode is less useful than the mean or even the median. There may be several modes in a data set. If a data set or population is symmetric and if the distribution of the observations has only one mode, then the mode, the median, and the mean are all equal. 13
Range The range of a set of observations is the difference between the largest observation and the smallest observation. The range may get distorted due to outliers. The interquartile range is more resistant to extreme observations. 14
Variance and standard deviation Variance = (1/N) X X i - m) 2 The variance and the standard deviation are more useful than the range and the interquartile range. Like the mean, they use the information contained in all the observations in the data set or population. We square the deviations to ensure that the positive and negative deviations do not cancel each other. We work a lot with variance because it has an additive property. We need the standard deviation because it has the same unit as the variable. 15
Skewness Skewness is a measure of the degree of asymmetry of a frequency distribution. Skewness = (1/s 3 ) X (X i m) 3 A distribution which stretches to the right more than it does to the left is right-skewed. Similarly, a left-skewed distribution is one that stretches asymmetrically to the left. Generally, for a right-skewed distribution, the mean is to the right of the median, which in turn lies to the right of the mode (assuming a single mode). The opposite is true for left-skewed distributions. - 16
Kurtosis Kurtosis is a measure of the flatness (versus peakedness) of a frequency distribution. Kurtosis = (1/s 4 ) X (X i - m) 4. X i is the value of the variable, s is the standard deviation and m is the mean. Flat distributions are called platykurtic. Peaked distributions are called leptokurtic. Neutral distributions not too flat and not too peaked are called mesokurtic. 17
Chebyshev s Theorem A mathematical theorem attributed to Chebyshev has established the following rules: (1) At least ¾th of the observations in a data set will lie within 2 standard deviations of the mean. (2) At least 8/9th of the observations in a set will lie within 3 standard deviations of the mean. (3) In general, the rule states that at least (1 1/k 2 ) of the observations will lie within k standard deviations of the mean. 18
Useful Empirical Rules If the distribution of the data is mound shaped that is, if the histogram of the data is more or less symmetric with a single mode or high point then the following rules will apply. (1) Approximately 68% of the observations will be within 1 standard deviation of the mean. (2) Approximately 95% of the observations will be within 2 standard deviations of the mean. (3) A vast majority of the observations (all of them, or almost all of them) will be within 3 standard deviations of the mean. 19