Statistical Concepts. Distributions of Data

Module : Review of Basic Statistical Concepts. Understanding Probability Distributions, Parameters and Statistics A variable that can take on any value in a range is called a continuous variable. Example: The concentration of a contaminant in water samples A variable that can take on only certain values is called discrete. Example: The number of animals visiting a contaminated site in a single day Module. 2

A probability distribution describes the values that t a variable can take on and the probabilities associated with those values. We use probability density functions (pdfs) to describe these distributions. We can also use cumulative density functions (cdfs) For continuous variables, common distributions are the uniform, the triangular, the normal, and the lognormal Module. 3 Discrete Discrete probability distributions simply show the probability bilit of each value occurring. Note that the sum of all of the probabilities in a pdf is one. For the continuous pdfs, the area under the curve equals one. Module. 4 2

Examples of Discrete Probability Distribution Functions (pdfs) Toss of a Fair Coin Probability 5.5..5.7.6.5.4.3. Head Tail 2 Result Roll of a Fair Die 2 3 4 5 6 Probability.3. General Example of a Discrete pdf 2 3 4 5 6 7 8 9 Value If you sum up the probabilities shown, they sum to. Module. 5 The Binomial Distribution Applies in a situation where there are two possible outcomes (success and failure) and the probability of success is constant. Example: Failure is defined to be contamination above a regulatory limit. Assume contamination is uniformly dispersed throughout an area such as a lake and n samples are collected. There will be variability in the amount of measured contamination in the samples due to sampling and measurement errors. There is a probability p that each of the samples will show contamination above the limit. Module. 6 3

Examples of Continuous Probability Distribution Functions (pdfs) Uniform Distribution Triangular Distribution PROBABILIT TY.6.5.4.3.2. 2 3 4 5 6 7 8 9 Value TY PROBABILIT..9.8 7.7.6.5.4.3.2. 2 3 4 5 6 7 8 9 Value Normal Distribution Lognormal Distribution.8 6.6.4..8.6.4.2 PROBABILITY -2.6 -.9.8 2.5 4.2 5.9 7.6 9.3. 2.7 PROBABILITY.3 5.3.3 2.4 3.5 4.5 5.6 6.7 7.7 8.8 9.9.5..5 Module. 7 Examples of Continuous Probability Distribution Functions (cdfs) Uniform Distribution Triangular Distribution BABILITY CUMULATIVE PROB.2.8.6.4 2 3 4 5 6 7 8 9 Value PROBABILIT TY.2.8.6.4 2 3 4 5 6 7 8 9 Value Normal Distribution Lognormal Distribution PROBABILITY.2.8.6.4-2.6 -.9.8 2.5 4.2 5.9 7.6 9.3. 2.7 PROBABILITY.2.8.6.4.3.3 2.4 3.5 4.5 5.6 6.7 7.7 8.8 9.9 Module. 8 4

Values that define key characteristics of probability distributions are called parameters. Parameters are true values that are unknown and generally unknowable Module. 9 Parameters and Statistics A parameter is a characteristic of a population. p It is a value that we would only know if we had perfect information about the entire population. Since we never have this kind of knowledge, parameters can be considered unknown. They are the quantities that we try and estimate from our data. Statistics are quantities calculated from data. For each parameter, there is one or more statistics that estimate it. Module. 5

Parameters and Statistics Example: The population mean is a parameter denoted by, the sample mean estimates and is denoted by Y with a bar over it called. Y Notation: N = number of units in the population n = number of units in the sample N n Y i Y N Y i i n i Module. Parameters and Statistics The population standard deviation is a parameter denoted by and the sample standard deviation estimates it and is denoted by s. N N i ( Y i Y ) 2 s n n i ( Y i Y ) 2 Module. 2 6

Parameters and Statistics Why square them and then have to take the square root? If you added up all of the deviations from the mean, it would be zero Must get them all to be positive values Easiest to square them and then take a square root Module. 3 Parameters and Statistics Once we have data and an equation to calculate a statistic, it s simple arithmetic to get the estimate. However, er the estimate is just that it s not the actual al value of the parameter. The true value of the parameter might be higher or lower than our estimate. If the population was defined by the students registered for this class today, there is a true mean height of that population However, even if I tried to collect data on this population, I couldn t know the true mean. Why? Module. 4 7

Parameters and Statistics However, I can collect a sample and calculate a sample mean that would estimate the true mean. I could also calculate a sample standard deviation and create a confidence interval around dthe true mean. The confidence interval would be a range with a probability attached. It has that probability of including the true mean. Module. 5 The uniform distribution means that there is a range of values defined by the parameters minimum and maximum. All of the values in between have an equal probability of occurring. Uniform Distribution TY PROBABILIT.6.5 4.4.3.2. 2 3 4 5 6 7 8 9 Value Module. 6 8

The triangular has a minimum, maximum, and a most likely l value Triangular Distribution ROBABILITY PR..9.8.7.6.5.4 3.3.2. 2 3 4 5 6 7 8 9 Value Module. 7 The normal is the bell shaped curve, its parameters are the mean and standard deviation. Normal Distribution.8.6.4..8.6.4.2 BILITY PROBAB -2.6 -.9.8 2.5 4.2 5.9 7.6 9.3. 2.7 Module. 8 9

The lognormal also has parameters of the mean and standard deviation. The lognormal has smaller values having a higher probability of occurring and larger values having a smaller and smaller probability of occurring. PROBABILITY.3 5.5..5 Lognormal Distribution.3.3 2.4 3.5 4.5 5.6 6.7 7.7 8.8 9.9 Module. 9 A distribution that is not symmetric is said to be skewed The direction of the skewness is the direction of the long tail A lognormal o distribution is said to be skewed right This is often counter-intuitive Module. 2

More on the Normal Distribution The normal distribution is the bell-shaped curve. Many things that occur in nature follow a normal distribution. Some characteristics: It has two parameters: the mean mu ( ) and the standard deviation sigma ( ) It is shown as N( ) Module. 2 The Normal Distribution Some characteristics: 68.2% of the probability of a normal lies within plus and minus one standard deviation from the mean 95.4% lies within plus and minus 2 standard deviations from the mean 99.7% lies within plus and minus 3 standard deviations from the mean Module. 22

The Standard Normal Distribution The standard normal has = and = It is shown as N(,) Any normal distribution can be transformed into a standard d normal by Z=(X- )/ Module. 23 The Standard Normal Distribution There are an infinite number of normal distributions because there are an infinite number of combinations of and By transforming to a standard normal using Z=(X- )/, you only need a table of the standard normal Module. 24 2

Using the Table of the Normal Distribution Tables such as Table A2. in Manly relate values of Z to the probability bilit (area) under the standard normal pdf from zero to that value. Example: The probability of a value sampled randomly from the standard normal distribution falling between the mean and one standard deviation above it is found by looking up the probability in the table associated with.. It is.34. So, the probability bilit of a value falling within standard d deviation from the mean is double that or.682. Likewise, the probability of a value falling within plus and minus 2 standard deviations is 2 *.477=.954. Module. 25 Using the Table of the Normal Distribution Another Example: Let s say you want to look up the Z value associated with a 95% Confidence Interval. You need the value with 2.5% or.25 in the tail. For this table, you need to subtract that value from.5. So.5.25 =.475. So, look for.475 in the table and then find the Z value associated with it. There it is Z =.96. That s the Z value to use for a 95% confidence interval. Module. 26 3

The Student s t Distribution The Student's t distribution is similar to the normal but with fatter tails. It is used when the true population standard deviation is not known n (most of the time). The exact shape of the t distribution is controlled by the number of data points used to calculate the sample standard deviation. When n is small, the distribution is wide with fat tails. When n gets large, the estimate of is good and the t distribution approaches the shape of a normal distribution. The term for this index is called degrees of freedom (df). For use with the t distribution, df = n-. Module. 27 Using the Table of the t distribution Table A2.2 of Manly gives some selected values from t distributions with degrees of freedom ranging from to infinity. Example: If you have data points, you have 9 degrees of freedom. If you want the value along the t scale that has 97.5% of the probability below it and 2.5% above, use the second column. That t value is 2.262. If you have an infinite number of data points, it s.96, just like the Z table. Module. 28 4

That s it for now! That s a quick review of some basic concepts. We ll continue in Module.2. Module. 29 5