1. Descriptive stats methods for organizing and summarizing information

Size: px

Start display at page:

Download "1. Descriptive stats methods for organizing and summarizing information"

Lionel Fitzgerald
5 years ago
Views:

1 Two basic types of statistics: 1. Descriptive stats methods for organizing and summarizing information Stats in sports are a great example Usually we use graphs, charts, and tables showing averages and associated measures of variation. Inferential stats methods for drawing and measuring the reliability of conclusions about a population based on information obtained from a sample of the population When we poll a small sample of potential voters (sample) we can infer something about the sentiment of the entire voting population Both types are interrelated because we use descriptive stats to organize and summarize sample information to carry out an inferential analysis We can conduct a census, obtain info from entire population, but that is usually time consuming, costly, or impossible Instead we can do a survey where we take information from a sample 1

2 Variable a characteristic that varies from one person or thing to another For people: height, weight, number of siblings, gender, marital status, eye color The first 3 variables yield numerical info and are called quantitative variables The last 3 yield non numerical info and are called qualitative or categorical variables There are types of quantitative variables: 1. Discrete variables have values that can be listed but the list can continue indefinitely The variable may have only a finite number of possible values or its values are some collection of whole numbers Usually a count of something as in number of siblings. Continuous variables have possible values that form some interval of numbers Usually a measurement of something as in height or weight of a person

3 Grouping discrete quantitative data To get a clear picture of trends in a list of observations (together called a data set), we need to group by classes First, decide on class intervals (10 s, 100 s, etc. are a good start) Days to Maturity Tally No. of Investments III I IIII III IIII IIII IIII II IIII II IIII 4 40 We can see several pieces of info that are important, most importantly there were more investments in the day range than any other Generally: Number of classes should be between 5 and 0, although fewer may be used for categorical data Classes should have the same width Frequency and relative frequency Number of observations that fall into each class is called the frequency (or count) 3

4 A table that provides all classes and their frequencies is called a frequency distribution Many times, we are interested in relative frequency or percentage of each class, so divide the frequency of each class by the total number of observations In table for class 50 59: 8/40= = 0% 0.0 is the relative frequency and 0% is the percentage of observations in class Now, we can construct a relative frequency distribution table Single value grouping Days to Maturity Relative Frequency Percentage /40 = X 100 = 7.5% /40 = X 100 =.5% % % % % % % In some cases, we are interested in classes that represent a single possible value This is the usually the case for discrete data where there are only a few possible observations 4

5 Number of TV sets in 50 households Single value grouped data table No. of TVs Frequency Relative Frequency Grouping continuous quantitative data An important difference between grouping discrete and continuous data is that you must decide on exactly where to distinguish classes along the continuum of real numbers For data in Table 6, we could use or as a first group we ll use

6 Also, note that relative frequency may not always sum to 1.0 due to rounding error (0.999, below) If we carried each relative frequency to more decimal places we may be able to sum to 1.0, but it s not of great importance Weight (lb) Frequency Relative Frequency Grouping qualitative data Classes are simply the observed value of the corresponding variable: Are you male or female? Male For a survey of political affiliation for Intro to Stats students: D R O R R R R R D O R D O O R D D R O D R R O R D O D D D R O D O R D R R R R D Classes will be Democratic, Republican, or Other 6

7 Relative frequency distribution table: Party Frequency Relative Frequency Democratic Republican Other From this data, we can say that most stat students are Republican at 45.0%, fewer are Democrats at 3.5%, and.5% have other political affiliations Next, we ll see how to visually represent summarized data Frequency histograms are graphs that display class on the horizontal axis and the frequencies of the classes on the vertical axis Frequency of each class is represented by a vertical bar whose height is equal to the frequency of the class Histograms are used for quantitative data to visualize the actual distribution of data across a scale For data in tables 1 3: Frequency Days to maturity 7

8 We can also graph relative frequency, but the overall shape will be the same because they are proportional For classes including a range of values, the range should be listed under each bar or cut off values should be placed at each tick mark Relative Frequency Days to maturity For single value grouped data (TV data in Tables 4 and 5), place the middle of each histogram bar directly over the single value represented by the class Frequency No. of TVs Relative Frequency No. of TVs 8

9 Graphical displays for qualitative data Bar graphs are used for qualitative data because there is no hierarchy (ascending levels) or scale of values We could order classes in any way to visualize relative frequency (not so with quantitative data) Bars should be separated and class labels should be centered underneath Relative frequency Democratic Republican Other Party Pie charts have wedge shaped pieces that are proportional to relative frequencies Political Party Affiliations Democratic Republican Other Political Party Affiliations Other 3% Democratic 3% Republican 45% 9

10 MA 113 Lecture 1 Table 1. Days to maturity for 40 short term investments Table. Classes and counts for above data. Days to Maturity Tally No. of Investments/Frequency Equation 1. Relative frequency and percentage of a class. 8/ 40 = = 0% Table 3. Relative frequency distribution and percentage for above data. Days to Maturity Relative Frequency Percentage /40 = X 100 = 7.5% /40 = X 100 =.5% % % % % % % 10

11 Table 4. Number of TV sets in 50 randomly selected households Table 5. Grouped data table for number of TV sales. No. of TVs Frequency Relative Frequency Table 6. Weights of 37 males, aged 18 4 years

12 Table 7. Grouped data table for the weights of 37 males, aged 18 4 years. Weight (lb) Frequency Relative Frequency Table 8. Political party affiliations of the students in introductory stats. (Dem, Rep, or Other) D R O R R R R R D O R D O O R D D R O D R R O R D O D D D R O D O R D R R R R D Table 9. Frequency and relative frequency distribution tables for political party affiliations. Party Frequency Relative Frequency Democratic Republican Other

13 More descriptive stats Measures of center descriptive measures that indicate where the center or most typical value of a data set lies Most commonly used is the mean, the sum of observations divided by number of observations For Table 1: x + x + x x n Mean = 1 3 n = 690 n = # of observations = 13 Mean 1 = 690/13 = $ For Table : /10 = $ We may also use the median, the number that divides the bottom 50% of data from the top 50% If # of observations is odd, median is the observation exactly in the middle of the ordered list If # of observations is even, median is the mean of the middle observations of the ordered list 13

14 For Table 1 (don t forget to order data small to large): For Table : median = / = 350 When we compare mean and median, we see that mean is greater than median This is because mean is strongly affected by a few relatively large salaries ($900, $1,050) but the median is not Any time there are a few relatively large or small values in the data set, the mean will be skewed toward those values Summation notation In stats, letters such as x, y and z are used to denote variables If we take height and weight data from people, x = variable height and y = variable weight 14

15 15

16 Measures of variation Two data sets can have the same mean and median but differ in other ways (other than measure of center) Consider heights (inches) of 5 starting players on basketball teams: Team 1: 7, 73, 76, 76, 78 Team : 67, 7, 76, 76, 84 mean for both = 75.0 median for both = 76 Heights on Team vary more than those on Team 1, and we need to describe that difference quantitatively The sample standard deviation (SD) is most commonly used to quantify variation in a data set SD measures variation by indicating how far, on average, observations are from the mean For a data set with a small amount of variation (Team 1), observations will, on average, be closer to the mean and SD is smaller For a data set with a large amount of variation (Team ), observations will, on average, be farther from the mean and SD is larger 16

17 17

18 It seems natural to divide the sum by # of observations, n, but statistical theory shows that this underestimates the population variance Instead we divide by n 1 to give, on average, a better estimate of population or sample variance, denoted by s For Team 1 heights: s ( xi x) = n 1 s ( xi x) = n 1 4 = = But remember, sample variance is in units that are the square of the original units We want a descriptive measure expressed in original units, so to get sample standard deviation, SD, take the square root of sample variance SD is denoted by s ( xi x) s = n 1 For Team 1 heights, simply take the square root of s s = 6 =.4 18

19 This is interpreted as, on average, the heights of players on Team 1 vary from the mean height of 75 inches by.4 inches Team standard deviation (s): x x i x (x i - x ) s 156 = = s = 39 = 6. Quartiles and Boxplots Descriptive measures of variation based on quartiles Remember, the median divides data into the bottom 50% and the top 50% Percentiles divide data into hundredths or 100 equal parts Percentile one, P 1, divides the bottom 1% of data from the top 99% Deciles divide data in 10 equal parts and quartiles into 4 equal parts or quarters 19

We will focus on quartiles A data set has 3 quartiles or dividing lines, Q 1, Q, Q 3 Q 1 is the number that divides the bottom 5% from the top 75%, Q is the median, and Q 3 divides bottom 75% from

20 We will focus on quartiles A data set has 3 quartiles or dividing lines, Q 1, Q, Q 3 Q 1 is the number that divides the bottom 5% from the top 75%, Q is the median, and Q 3 divides bottom 75% from the top 5% For Table 7 data: First, arrange in increasing order and get median, Q : Median = 30.5 Now, get median of lower 50% (Q 1 ) and upper (Q 3 ) 50% of data: Q 1 = Q 3 =

We interpret this as: 5% of TV viewing times are <3 hours 5% are between 3 30.5 hours 5% are between 30.5 36.5 hours 5% are >36.

21 We interpret this as: 5% of TV viewing times are <3 hours 5% are between hours 5% are between hours 5% are >36.5 hours With median as our measure of center, we like to use the Interquartile Range (IQR) as the associated measure of variation IQR is the difference between the 1 st and 3 rd quartiles or IQR = Q 3 Q 1 For Table 7 data: IQR = = 13.5 hours We can also get a measures of variation for the two middle quarters similarly, Q Q 1 and Q 3 Q But, that tells us nothing about variation in quarters 1 and 4 We use minimum value with Q 1 and maximum value with Q 3 to get variation for those quarters: Q 1 Min and Max Q 3 To summarize the dataset, we use a Five number Summary: Min, Q 1, Q, Q 3, Max or 5, 3, 30.5, 36.5, 66 1

22 MA 113 Lecture Table 1. Weekly income ($) for Office Staff Table. Weekly income ($) for Office Staff Equation 1. Mean of a data set. Mean = x 1 + x + x 3 n x n where n = # of observations Equation. Mean of a data set with summation notation. Mean = x = x i n Table 3. Height (ft) of sweetgum trees on a selected study site on Noxubee National Wildlife Refuge

23 Table 4. Deviations from mean for heights of players from Team 1. Height x Deviation from mean x i - x Table 5. Sum of squared deviations for heights of players from Team 1. Height x Squared deviation (x i -x ) Deviation from mean x i - x Equation 3. Sample variance. s ( xi x) = n 1 Equation 4. Sample standard deviation. ( xi x) s = n 1 3

24 Table 6. Sum of squared deviations for heights of players from Team. x x i - x (x i -x ) Table 7. Weekly number of hours of TV watched by 0 Americans from Nielsen ratings

25 Outliers Observations that fall well outside the overall pattern of the data Reasons for outliers: Measurement or recording error An observation from a different population An unusual extreme observations We can use the IQR to identify outliers First, we need to define the lower limit and upper limit of a data set Observations that lie below the lower limit or above the upper limit are potential outliers For Table 1 (TV viewing) data: Lower limit = Q 1 (1.5 X IQR) Upper limit = Q 3 + (1.5 X IQR) IQR = 13.5 Q 1 = 3.0 Q 3 = 36.5 Lower limit = 3.0 (1.5 X 13.5) = =.75 hrs Upper limit = (1.5 X 13.5) = = hrs Anything below or above these values is probably an outlier 5

26 In Table 1, the only extreme value is 66 and we consider this and unusual extreme observation One person in the sample population watches much more TV than others in the population We can easily see this in a histogram Boxplots Also called box and whisker diagram Based on the five number summary to graphically display the center and variation in a data set Additionally, we need to identify adjacent values, the most extreme observations that still lie within lower and upper limits If there are no outliers, adjacent values are the min and max Steps to construct box plots: 1. Determine quartiles 6

27 . Determine outliers and adjacent values 3. Above an x axis, draw marks for quartiles (long lines) and adjacent values (short lines) with vertical lines 4. Connect quartile lines to make a box, then connect box to adjacent value lines 5. Plot each outlier with an asterisk * With samples, we can compare boxplots among samples to visualize the difference in median values and variation in data sets 5 number summary for Table a and b: a: b:

28 Linear Equations Often, it is important to know if or more variables are related and how they re related Linear equations are a good way to assess relationships and even predict future values As an example, we could examine height and shoe size of a sample group of people and determine if there is any relationship between the variables Also, we can determine the strength of the relationship is it a strong or weak connection? General form of a linear equation: y = b 0 + b 1 x b 0 and b 1 are constants (fixed numbers) x is the independent variable y is the dependent variable The graph of a linear equation with one independent variable is a straight line Linear equations are one of the most commonly used statistical tools in practically all fields of research/business (management, marketing, physical and mathematical sciences, etc.) 8

graph, so it is the slope of the equation Some example equations and their graphs: A practical example of application in business

29 Intercept and slope, b 0 and b 1 b 0 is the y value of the point where the line crosses the y axis so we call it the y intercept b 1 measures the steepness of the line or, in other words, how much the y value changes when the x value increases by 1 unit on a graph, so it is the slope of the equation Some example equations and their graphs: A practical example of application in business Business Services offers word processing at $0/hr plus a $5 disk charge Total cost depends on number of hours to complete a job 9

For Table 5 (word processing): Time (hr) x Cost ($) y 5.0 15 7.5 175 15.0 35 0.0 45.

cost To graph a linear equation, you only need values of x To graph the equation y = 5 3x, let s use x values of 1 and 3 (it can be any

30 For Table 5 (word processing): Time (hr) x Cost ($) y The total cost, y, of a job that takes x hours is y = 5 + 0x b 0 = 5 and b 1 = 0 If we know # of hours required we can predict cost To graph a linear equation, you only need values of x To graph the equation y = 5 3x, let s use x values of 1 and 3 (it can be any values but use some logic and consider scale) Also, do not forget that the y intercept (where x = 0) is a value you can graph y = 5 (3 x 1) = (x, y) = (1, ) y = 5 (3 x 3) = 4 (x, y) = (3, 4) 30

31 The Regression Equation Rarely are applications of the linear equation as simple as the word processing example where one variable (cost) can be predicted exactly in terms of another variable (time) So, many times we must rely on rough predictions from a sample data set We can t predict the exact price, y, of a make and model of used car just by knowing the age, x We have to rely on a rough prediction using an estimate of the mean price of a sample of other cars of the same age For Table 4 data (cars): Age (yr) Price ($100) Car x y To visualize the relationship, if any, between age and price we will use a scatterplot A scatterplot is a graph of data from quantitative variables 31

32 Price ($100) Age (yr) Although the age price data points do not fall exactly on a line, they do appear to cluster around a line (there appears to be a relationship) With regression, we can fit a line (equation) to the sample data and use that line to predict or give a rough estimate of a used Orion car based on its age 3

33 MA 113 Lecture 3 Table 1. Weekly number of hours of TV watched by 0 Americans from Nielsen ratings Equation 1. Lower and upper limits to identify potential outliers in a data set. Lower limit = Q 1 (1.5 X IQR) Upper limit = Q 3 + (1.5 X IQR) Table a. Skinfold thickness (mm) for sample of elite runners Table b. Skinfold thickness (mm) for random people of similar age Equation. The general formula for a linear equation. y = b 0 + b 1 x 33

34 Table 3. Times and costs for five word processing jobs. Time (hr) x Cost ($) y Table 4. Age and price data for a sample of 11 used Orion cars. Age (yr) Price ($100) Car x y

The Regression Equation In the last lecture, we demonstrated that you can place a linear line (with a given equation) through a scatterplot in an attempt to fit it to the data We can come up with

35 The Regression Equation In the last lecture, we demonstrated that you can place a linear line (with a given equation) through a scatterplot in an attempt to fit it to the data We can come up with many candidate equations and lines We can compare how well each line fits the data by comparing error values between equations The error value will measure how far the observed y value for each data point is from the predicted y value given by the equation An example with a small data set, Table 1: x y First, a scatterplot of the data to check for a linear trend Next, let s propose candidate equations and determine which best fits the real data Line A: y = x Line B: y = x 35

36 When we graph the linear equations with the scatterplot of real data, we see both seem to fit the data well, but which is best? First, calculate differences between the real values of y and the predicted value of y from the equation When value of x =, the real value for y = (from Table 1) The predicted values for y when x =, which we will denote as ŷ, from equations lines A and B are: Line A: ŷ = () = 3 Line B: ŷ = () =.75 And differences between real and predicted values are: Line A: Error = y ŷ = 3 = 1 Line B: Error = y ŷ =.75 = 0.75 Now, we can make a table of error values for both equations and determine which is the better equation 36

37 Next, construct table to see which equation will provide the lowest value of sum of squares for error values Line A Line B x y ŷ y - ŷ (y - ŷ) x y ŷ y - ŷ (y - ŷ) The bottom value in the 5 th column is the sum of squared errors or Σ (y ŷ) We can see that Line B provides the smallest value of sum of squared errors, so it fits the data better We still do not know if Line B is the best line because there are many more candidate lines to compare In fact we can propose an infinite number of lines Now, we introduce an equation that will give us the best line, the Regression Equation We will calculate the best b 1 and b 0 for the equation First, some notation we need to know: S xy = Σ x i y i (Σ x i )(Σ y i )/n S xx = Σ x (Σ x i ) /n 37

38 As an example, let s look back at the used car data (Table ) from last week in Lecture 3 Step 1: Construct a table with the following columns: Age (yr) Price ($100) Car x y x y xy x The second table will provide all the numbers we need to complete the Regression Equation The Regression Equation for a set of n data points is ŷ = b 0 + b 1 x S where, b 1 = xy 1 and b 0 = ( We use this equation S yi b1 x i ) xx n Step : Calculate b 1, slope of the regression line S b 1 = xy xi yi ( xi )( yi ) / n = We use this equation S xx xi ( xi ) / n 473 (58)(975) /11 36 (58) /11 b 1 = =

39 Step 3: Calculate b 0, the y intercept 1 n 1 11 b 0 = ( yi b1 x i ) = [975 ( 0.6) 58] = Step 4: Fill in Regression Equation ŷ = x Once you have the regression equation, you can graph the line over a scatterplot of observed data ŷ = x Predicted values of y for x =, 4, and 6 (, ) (4, ) (6, 73.91)

40 MA 113 Lecture 4 Table 1. Simple (x, y) data set. x y Equations 1 and. Preliminary calculations for Regression Equation: S xy = Σ x i y i (Σ x i )(Σ y i )/n S = Σ x (Σ x ) /n xx i Equation 3. The Regression Equation: ŷ = b + b x 0 1 Table. Age and price data for a sample of 11 used Orion cars. Car Age (yr) x Price ($100) y

41 Equation 4. Slope of the regression line, b 1. S xy b 1 = = S xx xi yi ( xi )( xi ( xi ) y ) / n i / n Equation 5. The y intercept, b 0. 1 n b 0 = ( yi b x ) 1 i 41

The Coefficient of Determination Now, we can check the usefulness of the regression equation by using some diagnostic techniques For our used car example, is the regression equation useful for

42 The Coefficient of Determination Now, we can check the usefulness of the regression equation by using some diagnostic techniques For our used car example, is the regression equation useful for predicting price, or could we do just as well by ignoring age? One method is to determine % of variation in observed values of the response variable (y) that is explained by the regression To find this %, we need to calculate measures of variation: a) Total variation in observed values of the response variable b) Amount of variation in the observed values of the response variable that is explained by the regression Step 1: Get sum of squared deviations of observed values of y from the mean of y (remember sample variance of heights of basketball players) This is total sum of squares, SST = Step : Get sum of squared deviations of predicted values of y from mean of y This is regression sum of squares, SSR = ( y) y i ( ˆ y) y i 4

43 Step 3: Now, we use SST and SSR to get % of variation in observed values of y that is explained by regression, or the coefficient of determination denoted by r r = SSR/SST r always lies between 0 and 1 A value near 0 suggests the regression equation is not very useful for predictions A value near 1 suggests the regression equation is very useful for predictions Let s return to the used car example (Table 1) and calculate r Table for computing SST y = 88.6 x y y y ( y y)

44 Table for computing SSR x y ŷ ŷ - y (ŷ y ) r = SSR/SST = 884.0/ = X 100 = 85.3% This is a very good regression equation for predicting price of this type of used car based on age 44

45 MA 113 Lecture 5 Equation 1. Total sum of squares, SST. SST = Equation. Regression sum of squares, SSR. SSR = Σ (ŷ ) Equation 3. Coefficient of Determination, r. r = SSR/SST Table 1. Age and price data for a sample of 11 used Orion cars. Car Age (yr) x Price ($100) y

46 Linear Correlation With the last procedure, we assessed fit of a constructed linear equation with a set of (x, y) data points Now, we are going to assess the correlation between two variables with a linear correlation coefficient, r It is also called the Pearson product moment correlation coefficient r will always lie between 1 and 1 Some properties of r: 1. The value of r will reflect slope of the scatterplot It is positive when the scatterplot shows a positive slope and negative when the scatterplot show a negative slope. The magnitude of r indicates the strength of the linear relationship A value close to 1 or 1 indicates a strong relationship and the variable x is a good linear predictor of y A value near 0 indicates a weak relationship between x and y 46

3. The sign of r suggests the type of linear relationship A positive value means that y tends to increase as x increases and the tendency is greater as the value approaches 1 A negative value means

47 3. The sign of r suggests the type of linear relationship A positive value means that y tends to increase as x increases and the tendency is greater as the value approaches 1 A negative value means that y tends to decrease as x increases and the tendency is greater as the value approaches 1 The formula for r: r = ( x ( x x)( y y) x) ( y y) You will notice that for the denominator we have most of the equation for the standard deviation for x and y 47

48 For an example we return to the used Orion car data where x is age of car in years and y is price times $100 x y ( x x) ( y y ) ( x x) ( y y ) ( x x ) ( y y ) The only column we have not computed in the past is column 5 x = 5.7 y = Now plug sums from table into the equation r = r = ( x ( x x)( y y) x) ( y y) , r = = = 0.94 (4.49)(98.54) Also, the coefficient of determination, r, equals the square of the linear correlation coefficient, r For this example: r = (from Lecture 5) r = = 0.94 (can t recover negative) 48

49 MA 113 Lecture 6 Equation 1. Formula for r, Pearson correlation coefficient. r = ( x ( x x)( y y) x) ( y y) Table 1. Orion used car data where x is age of car (years) and y is price (X $100) with columns for calculating r. x y ( x x) ( y y ) ( x x) ( y y ) ( x x ) ( y y )

50 Random Variables A quantitative variable whose value depends on chance As an example, I can ask each student how many siblings they have Number of siblings will vary among students and if I select a student at random, the value of the variable is random The value depends on chance of which student is selected A discrete random variable is a random variable whose possible values can be listed Notation is a bit different for random variables vs. variables Instead of x, y, and z, we use upper case letters X, Y, and Z If X is the number of siblings of a student, then P(X=) is the notation for the probability that a student has siblings Like earlier, we can take values we obtain and construct a probability distribution and then graph the info for a probability histogram Before, we called it a relative frequency distribution and histogram Note that the sum of probabilities will equal to

51 Example: Enrollment data for U.S. public schools by grade level What is P(Y=5)? Grade (y) Freq P(Y=y) 0 4, , , , , , , , , , ,78/33,647 = so 11.1% of elementary school students in the U.S. are in 5 th grade A bit more complex example: We toss a dime 3 times giving 8 equally likely outcomes. Our event of interest (X) is total # of heads obtained in 3 tosses HHH HTH THH TTH HHT HTT THT TTT # of heads P(X=x) What is P(X=)? 3/8 = HHH HTH THH TTH HHT HTT THT TTT What is P(X )? P(X ) = P(X=0) + P(X=1) + P(X=) = = HHH HTH THH TTH HHT HTT THT TTT 51

00 Expected Results Experimental Results Mean and standard deviation of a discrete random variable Notation for a sample (rather than population)

52 Now, let s compare a real example with the probability table we just constructed We flipped 3 dimes 1,000 times and recorded results for number of heads we observed # of heads Freq Proportion Expected Results Experimental Results Mean and standard deviation of a discrete random variable Notation for a sample (rather than population) For a sample For a population x = xi n x i µ = N We can express the mean value of a sample in terms of the probability distribution of X, here ages of 8 students μ =

53 The probability distribution for X: Age (x) P(X=x) μ = [19 P(X=19)]+ [0 P(X=0)]+ [1 P(X=1)]+ [7 P(X=7)] Age (x) P(X=x) x P(X=x) So, the mean of a discrete random variable = x P( X = x) Standard deviation and variance For a sample ( x ) i x s = = variance n 1 = SD For a population σ = ( x µ ) P( ) = variance s = s i x i σ = σ = SD For the data for age of 8 students (μ = 0.88): Age (x) P(X=x) x μ (x μ) (x μ) P(x)

54 Using the formula instead of the table: σ = [( ) 0.50] + [(0 0.88) 0.375] + [(1 0.88) 0.50] + [(7 0.88) 0.15] = 5.86 and σ = σ =.4 54

The Normal Distribution In life, we deal with a variety of variables and many of them have a common distribution in the shape of a bell shaped curve We call it a normal curve

distributed population But, in practice we rarely see a distribution that is exactly in this shape so we often say a variable is approximately normally distributed A normal

55 The Normal Distribution In life, we deal with a variety of variables and many of them have a common distribution in the shape of a bell shaped curve We call it a normal curve because researchers found that it was a common occurrence for a variable to have this distribution If a population variable is normally distributed we say that we have a normally distributed population But, in practice we rarely see a distribution that is exactly in this shape so we often say a variable is approximately normally distributed A normal distribution is determined by the mean and SD, so we call these measurements the parameters of the normal curve Given just these parameters we can graph any normally distributed variable 55

56 Frequency and relative frequency distributions for heights of female college students (n = 3,64) at a small mid western college μ = 64.4 σ =.4 Ht (in) Freq Rel Freq The relative frequency distribution graph of the data and the normal curve with parameters μ = 64.4 and σ =.4 Rel Freq Ht (in) Remember, when we add up all proportions of the relative freq bars we get 1.0 The same applies to the area under the curve, area equals

Equation for the distribution of a Normal Random Variable x: μ = mean of the normal random variable x σ = standard deviation π = 3.141 e =.

57 Equation for the distribution of a Normal Random Variable x: μ = mean of the normal random variable x σ = standard deviation π = e = f ( x) = e σ π (1/ )[( x µ ) / σ ] For the prior example on women s height: x f(x) Now for example, what is P(X=67)? According to relative freq, it is or 7.35%, the crosshatched area of the bar If we superimpose the normal curve over the actual distribution we find that the blue shaded area approximates the area of the cross hatched bar 57

58 Standardizing a Normally Distributed Variable Now, how do we find areas under a normal curve? We would need a table of areas for each conceivable normal curve (all σ and μ), an infinite number So we standardize, or transform, every normal distribution into one in particular, the Standard Normal distribution This distribution will have mean of 0 and a SD of 1.0 We will do this by transforming our observed variables into z scores z = x µ σ For a series of numbers x = 1, 3, 3, 3, 5, 5: μ = 3.0 and σ = x µ x 3 z = = σ z 1 = =, z = = 0, etc. x z Now, treat z as any variable when you compute mean and SD and you will always get μ = 0 and σ = 1 µ z = N z i = =

the standardized curve We will use a standard normal table to find values for area If we want the area between point a and b (real numbers) where a<b:

59 For SD of z: ( zi µ z ) σ z = N = 6 = 1 6 In numerator, ( 0) + (0 0) + + (1 0) = 6 For our example normal curves earlier: For practical use, this tells us that if we have any variable (x) that is normally distributed with mean (μ) and SD (σ) we can use the z conversion to find areas of interest under the standardized curve We will use a standard normal table to find values for area If we want the area between point a and b (real numbers) where a<b: We can find the % of all possible observations of x that lie between a and b by calculating (a μ)/σ and (b μ)/σ and looking up the values in our z table 59

60 For our example of heights of college women: What is the probability of selecting a woman that is 67 inches in height? = or = 3,040/3,64 = or z = x µ = /.4 = 1.5 σ in z table, 1.5 corresponds to an area of Ht (in) Freq Rel

61 We have found the area under the normal curve that represents all the range of possibilities <68 inches or 93.3% of observations Finding areas under the Normal Curve Properties of the Standard Normal Curve (SNC) Total area under the curve is 1.0 It is symmetric about 0 Almost all area under the curve lies between 3 and 3 Now, we will calculate areas under the curve under various scenarios: To the right To the left Between values 61

62 1 st curve: simply the area to left of z nd curve: 1 (area to left of z) 3 rd curve: (area to left of z ) (area to left of z 1 ) Examples: 6

63 For z = 1.3 For z = 0.76 For z = 0.68 For z =

64 Examples with real data Intelligence quotients (IQs) are normally distributed with μ = 100 and σ = 16 What is the % of people who have IQs between 115 and 140? For x = 115, z = 115 μ/σ = /16 = 0.94 For x = 140, z = 140 μ/σ = /16 =.50 Area to left of 0.94 (from z table) = Area to left of.50 = So, = And, we can say that 16.74% of people have IQ between 115 and 140 We can also ask about a certain area, find a z score, then calculate a value (reversing the process we just covered) What is the IQ that is the cutoff for the top 10% for all people? In the z table, we see that is the value closest to 0.90 That corresponds to a z score =

z = 1.8 = x 100/16 Multiply both sides by SD 1.8 16 = x 100 0.

65 z = 1.8 = x 100/16 Multiply both sides by SD = x = x 100 Add mean to both sides = x x = So, 90% of people have IQs below and 10% have higher IQs 65

66 Determine the area under the SNC that lies to the left of: Determine the area under the SNC that lies to the right of: Determine the area under the SNC that lies between:.18 and 1.44 and and 1.51 Find the z score for which the area under the SNC to its left is 0.05 or.5% Find the z score that has area of 0.70 to its right 66

67 Frequency and relative frequency distributions for heights of female college students (n = 3,64) at a small mid western college μ = 64.4 σ =.4 Ht (in) Freq Rel Freq What is the % of female students with heights between 65 and 70 inches? z = /.4 =.3 z 1 = /.4 = 0.5 Area to left of z = Area to left of z 1 = Area between = = or 39.1% of female students will have heights inches In the table, relative frequency between inches = = or 40.6% 67

68 68

69 69

70 Sample Distribution and Sampling Error We ve talked about how much time and money we can save by taking a sample from a large population instead of census But, a sample will guarantee a certain amount of sampling error (s and x will never be population mean and SD) If we are sampling from a normal population, we can expect that our sample will be normally distributed and x will be close to μ For a variable x and a given sample size, the distribution of the variable x is called the sampling distribution of the sample mean Example: Population is 5 starting players on a men s basketball team, players A, B, C, D, and E Player A B C D E Ht (in) µ = N x i = = If we take a random sample of players, we have the following sampling distribution of x with 10 possible combinations Sample Hts x A,B 76,78 77 A,C 76, A,D 76, A,E 76,86 81 B,C 78, B,D 78, B,E 78,86 8 C,D 79,81 80 C,E 79, D,E 81,

71 We see that mean height of players isn t likely to equal μ = 80.0, and only 1/10 sample means = 80 So, we can say that there is a 1/10 = 10% chance that x = μ Also, 3/10 samples have means within 1 inch of μ, so we can say that the probability is 0.3 or there is a 30% chance that a sample mean will be within 1 inch of μ Now, let s choose 4 players at random (only 5 possibilities): Sample A,B,C,D x A,B,C,E A,B,D,E 80.5 A,C,D,E B,C,D,E A graph for the distribution of sample means: Now, none of the sample means equal μ, but 4/5 or 80% are within 1 inch of μ or P(79.0 x 81.0) = 0.8 Graphs of sampling distributions as sample size increases n = n = n = 4 n =

72 This demonstrates that sampling error tends to be smaller for larger samples This is what we see with our basketball player example Sample size (n) # possible samples # within 1" of μ % within 1" of μ # within 0.5" of μ % within 0.5" of μ % 0 0% % 0% % 0% % 3 60% % 1 100% There is a simple relationship between the mean of variable and the population mean, μ µ = µ x x This means that if we take all possible x for any sample size and take the mean ( µ x, the population mean for the sample distribution), we will get μ, the entire population mean For our basketball example for sample size 4: µ x = = There is also a relationship between the standard deviation (s or SD) of the variable x with the population SD or σ σ = x σ n 7

73 For our basketball example for sample size 4: σ x = ( ) + ( ) + ( ) + ( ) + ( ) 5 σ x = = = Note: when sampling is done without replacement from a finite population (basketball example) the formula σ σ = x n will not give you exact sample SD, For all possible sample sizes: σ x Sample SD of size (n) x Example from U.S. Census Bureau Mean living space, μ, for a single family detached home is 1,74 sq. ft. and SD, σ, is 568 sq. ft. a) For sample of 5 single family homes, what is mean and SD of variable x? σ 568 µ x = µ = 1,74 σ = = = x n 5 b) For a sample size of 500? σ 568 µ x = µ = 1,74 σ = = = 5. 4 x n

Confidence Intervals for a Population Mean Remember, when we get a sample mean ( x ), we are getting an estimate of the population mean (μ) which we will call a point estimate A sample mean is

74 Confidence Intervals for a Population Mean Remember, when we get a sample mean ( x ), we are getting an estimate of the population mean (μ) which we will call a point estimate A sample mean is usually not equal to the population mean because we will have sampling error Now, we will attach information to the estimate that will indicate the accuracy of that estimate, a confidence interval estimate for μ, or CI The more information we have for our sample (greater sample size) the more confident we will be that we are close to μ We should now know how to compute areas under the standard normal curve (SNC) between two critical values You should be able to do this whether you start with z scores or values of x From lecture 9: Computing a 95% CI is similar to finding the critical values that are the boundaries for the middle 95% of data for a population (as opposed to a sample) 74

The difference is we are computing critical z values that are the boundaries for a CI surrounding our sample mean In other words, how confident are we that the population mean (μ) lies within the CI

75 The difference is we are computing critical z values that are the boundaries for a CI surrounding our sample mean In other words, how confident are we that the population mean (μ) lies within the CI we have constructed around our sample mean ( x)? If we use calculations for a 95% CI then we are 95% confident that μ is within the CI An increase in sample size will lead to a more narrow CI surrounding x For a 95% CI, the critical z values will always be 1.96 to the left and 1.96 to the right = 0.05, and the area of 0.05 is divided by (for right and left side), so 0.05/ = In z table, the z value of 1.96 corresponds to the area under the SNC of The area of equal size to the right is always the same z value but positive 75

These critical values of z are given a special notation depending on our desired level of confidence (CL) We need to write the CL in the form of 1 α, where α is the number that must be subtracted

76 These critical values of z are given a special notation depending on our desired level of confidence (CL) We need to write the CL in the form of 1 α, where α is the number that must be subtracted from 1 to get the CL If we want a 95% CI, = 0.05 = α and the associated z value is z 0.05 But, because we want to split the area on the left and right side we want z α/ or z 0.05/ = z 0.05 This gives us the formula: α/ = (1 CL)/ For a CL of 90%: α/ = (1 0.90)/ = 0.05 So, we need critical cut off values of z 0.05 In the z table, the area under the SNC of 0.05 corresponds to a z value of 1.64 In most cases, we are interested in CL s of 90%, 95%, and 99% (z = 1.64, 1.96, and.57, respectively) 76

77 Confidence Intervals with known σ If we know the population standard deviation (σ), the following formulas will give us the CI surrounding x x z α / σ n lower limit and σ x + zα / n upper limit or x ± z α / σ n Example: Take a sample of verbal SAT scores from n = 40 high school students. We know the population SD is σ = 83.8 and our sample mean is x = Construct a 95% CI First we need critical values for z α/ and z α/ α = (1 CL) = (1 0.95) = 0.05 z 0.05 = 1.96 For our 95% CI: x z α x + z σ 83.8 / = = = = 44.5 n 40 α σ 83.8 / = = = = n 40 77

78 This is interpreted as, we are 95% confident that the population mean (μ) lies within the range of This also implies that there is a 5% chance that μ does not fall into this interval Actually, μ = from a population of 78 students Graphically, the 95% CI would appear like this: x µ

Next, we can t use the z table (that s only for large populations), so we use a t table (for small populations) The numbers in the

79 Confidence Intervals with unknown σ This is very similar to CI with known σ, but there are things we must do differently 1. First, we have to estimate σ using our sample data to calculate a sample SD (s). Next, we can t use the z table (that s only for large populations), so we use a t table (for small populations) The numbers in the body (middle) of the table represent an estimation of z scores but for smaller samples To get values of t, we need things: α and degrees of freedom (df) which is n 1 (n = sample size) 79

In the example above, n = 14 so df = 13, and if we re looking for the critical value of t α, we look for t 0.05 across the top of the table and df along the side to get 1.

80 In the example above, n = 14 so df = 13, and if we re looking for the critical value of t α, we look for t 0.05 across the top of the table and df along the side to get We find α/ with the same equation: α/ = (1 CL)/ The equations to calculate CI s are identical to solve but we use a t instead of z value and sample SD (s) instead of σ x + tα / s n x tα / s n We will use the same data set for verbal SAT scores for an example (sample of 40 scores) From beginning of semester, we used these formulas to calculate s ( x ) i x s = s = s n 1 80

The sample SD from this sample data set is s = 96.9 and sample mean is the same, x= 450.5 The t value for a 95% CI is t 0.05 (remember α/) with df = 39 so critical t 0.05 =.

81 The sample SD from this sample data set is s = 96.9 and sample mean is the same, x= The t value for a 95% CI is t 0.05 (remember α/) with df = 39 so critical t 0.05 =.03 We put these values into the new equations: x t α s 96.9 / = n = x + t s = n α / =

82 This is has the same interpretation, we are 95% confident that the population mean (μ) lies within the range of But, notice the 95% CI is wider using a sample SD Remember, μ = from a population of 78 students Graphically, let s compare the 95% CI with and without knowing σ: With σ With s

83 83

84 Hypothesis Testing One of the primary uses of calculated statistics is to make decisions about the value of a parameter (inferential stats) Examples: Is the mean weight, μ, of bags of pretzels produced by a company really 454 grams as advertised? Has the mean age of all cars in use increased from the 1995 mean of 8.5 years? Has the mean weight of white tail deer does in a particular area of Mississippi increased or decreased from year to year? We can attempt to answer these questions by setting up a null hypothesis, H 0, to test If we have a null hypothesis, we need at least one alternative hypothesis, H 1 or H a, to test against For the pretzel example: H 0 : mean wt. of bags is 454 g; H 0 : μ = 454 g H a : mean wt. of bags is not 454 g; H a : μ 454 g We call this a two tailed test because H a can be > or < There are also one tailed tests where H a is either > or < 84

We will concentrate on two tailed tests We reject our H 0 if our calculated z value falls out of the do not reject region of the standard normal curve (SNC) Notice that the two tailed SNC above is

85 We will concentrate on two tailed tests We reject our H 0 if our calculated z value falls out of the do not reject region of the standard normal curve (SNC) Notice that the two tailed SNC above is identical to the CI curve we discussed in Lectures 9 and 10 (below) For a hypothesis test, we call the associated α value the significance level Example: If we want to be 95% confident that our sample mean, x, is or is not equal to our hypothesis mean, μ, we use a significance level of α = 0.05 Recall from Lecture 9, the associated critical z value for α = 0.05 with two tails (so α/) is ±

Elementary Statistics

Elementary Statistics Q: What is data? Q: What does the data look like? Q: What conclusions can we draw from the data? Q: Where is the middle of the data? Q: Why is the spread of the data important? Q: