Producing data Toward statistical inference Section 3.3
Toward statistical inference Idea: Use sampling to understand statistical inference Statistical inference is when a conclusion about a population is inferred from the characteristics of a sample drawn from it Population Sample
Terminology A parameter is a number that describes a characteristic of a population Ex: p is the proportion with some trait in the population A statistic is a number that describes a characteristic of a sample Ex: is the proportion with the trait in the sample The observed value of a statistic is used to estimate the unobserved value of a parameter Ex: estimates p
Sampling variability Sampling variability is the phenomenon by which repeated implementation of the sampling mechanism produces distinct samples Suppose a statistic is recalculated for each sample under repeated sampling. The distribution of its values is its sampling distribution
Bias of a statistic The bias of a statistic is described by the center of its sampling distribution A statistic is unbiased if the mean of its sampling distribution is the same as the parameter it is intended to estimate Use random sampling to produce unbiased estimates
Variability of a statistic The variability of a statistic is described by the spread of its sampling distribution A margin of error is determined by the variability of a statistic The variability of a statistic will be smaller if it is calculated from a larger sample Variability can be made arbitrarily small with a large enough sample ( but sampling costs money, time, effort, etc.)
Producing data Data ethics Section 3.4
Risks of data production Ethical issues may arise in the production of data, especially when people are involved as subjects Examples of risks to participating subjects: Direct risk to physical health Violations of personal space and privacy Target of deception
Standards of data ethics Oversight by an institutional review board Charged to protect the interest of subjects Participation only after informed consent Inform of the nature of the experiment and risks Consent in writing, if possible Confidentiality of raw data Only release statistical summaries publically
Probability and Sampling Distributions Randomness Chapter 4.1
Randomness and probability Observations of random phenomena: Patterns emerge in the long-run after many repetitions of a chance-happening Short-term patterns are unpredictable Probability attempts to describe the long-term patterns of random phenomena
Long-run probabilities A probability is the proportion of times that some interesting outcome is observed in the long run. First series of tosses Second series
Probability and Sampling Distributions Probability models Chapter 4.2
Probability models A probability model is a mathematical framework for describing random phenomena An assignment of probabilities to a set of outcomes An outcome is a possible value generated by the chance-happening of interest Probability rules are the mathematical laws required for a probability model to make sense
Basic setup of a probability model The sample space, S, is the set of all possible outcomes Represents a single repetition of a chance-happening An event, A, B, C, etc., is a subset of the sample space Represents the occurrence of a certain interesting thing
Relationships between events The compliment, A c, of an event, A, is the set of outcomes that are not in A. Represents the nonoccurrence of a certain interesting thing Events A and B are disjoint if they share no outcomes Represent things that cannot occur simultaneously Disjoint Not disjoint
Probability rules 0 P(A) 1 P(S) = 1 Complement rule: P(A c ) = 1 P(A) Addition rule for disjoint events: If A and B are disjoint then P(A or B) = P(A) + P(B)
Finite probabilities Probability rules simplify when there are a finite number of possible outcomes. Each probability is a number between zero and one The sum of all probabilities is one The probability of an event is the sum of the probabilities of outcomes comprising that event.
Example: equally likely outcomes A couple wants to have three children. Observe the possible sequences of boys (B) and girls (G). S = { BBB, BBG, BGB, BGG, GBB, GBG, GGB, GGG } Assign equal probability of 1/8 to each outcome B B G B - G - B - G - BBB BBG BGB BGG A = exactly two girls = { BGG, GBG, GGB } P(A) = P(BGG) + P(GBG) + P(GGB) = 1/8 + 1/8 + 1/8 = 3/8 G B G B - G - B - G - GBB GBG GGB GGG
Example: Benford s Law Empirical probabilities of first digits in financial docs 1 st digit 1 2 3 4 5 6 7 8 9 Probability 0.301 0.176 0.125 0.097 0.079 0.067 0.058 0.051 0.046 Probability 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 1 2 3 4 5 6 7 8 9 Outcomes Probability histogram P(1 st digit 6) = 0.067 + 0.058 + 0.051 + 0.046 = 0.222
Example: Two die rolls Thirty-six possible die rolls, equal probabilities: P(sum is 5) = 4/36 = 0.111 P(doubles) = 6/36 = 0.167, etc. Note: X = sum is an example of a random variable (more later)
Probabilities of intervals If S is continuum of values then probabilities are assigned using a density curve. No part of a density curve can be negative The total area under the curve must be one The probability P(A) of an event A = { a X b } is the area under the curve between a and b. Random variable
Example: Uniform density curve Probabilities of a random number generator, S = { numbers between 0 and 1 } P(0.3 X 0.7) = 0.7 0.3 = 0.4
Example: General uniform density curve Probabilities of a custom random number generator, S = { numbers between c 1 and c 2 } P(a X b) = (b a) / (c 2 c 1 )
Example: Sum of two random numbers Sum of two numbers from a random number generator, S = { numbers between 0 and 2 } height base P(X > 1.3) = ½ b h = ½ (2 1.3) (2 1.3) = 0.245
Example: Normal curves X = ACT college entrance exam scores Suppose X is N(µ =18.6, σ = 5.9) Probability interpretation: X is the score of a randomly selected student
Probability and Sampling Distributions Random variables Chapter 4.3
Random variables A random variable, X, is an idealization of quantitative data recorded from many repetitions of a chance -happening. B - BBB Example: A couple wants to have three children B B G G - B - BBG BGB X = # girls G - B - BGG GBB S = { 0, 1, 2, 3 } G B G G - B - GBG GGB G - GGG
Probability distribution of a random variable The probability distribution of a random variable is its assignment of probabilities in an underlying probability model. Example: X = # girls among three children P(X = 0) = P(BBB) = 1/8 P(X = 1) = P(BBG) + P(BGB) + P(GBB) = 3/8 P(X = 2) = P(BGG) + P(GBG) + P(GGB) = 3/8 P(X = 3) = P(BBB) = 1/8
Types of random variables A discrete random variable represents data whose possible values can be counted. Probability distribution is given as a table of probabilities A continuous random variable represents data whose values lie on a continuum. Probability distribution is given as a density curve
The continuous probability of an exact value Suppose X is a continuous random variable: Natural probability questions involve events A = { a X b } Events A = { X = a } are nonsensical, P(X = a) = 0 P(a X b) = P(a X < b) = P(a < X b) = P(a < X < b) Caution: For discrete r.v. s, < or matters critically