Last time. Numerical summaries for continuous variables. Center: mean and median. Spread: Standard deviation and inter-quartile range

Lecture 4

Last time Numerical summaries for continuous variables Center: mean and median Spread: Standard deviation and inter-quartile range Exploratory graphics Histogram (revisit modes )

Histograms Histogram of income Frequency 0 50 100 150 0e00 1e05 2e05 3e05 4e05 income

Histograms: Skew (heavy right tail)

Histograms: Skew bimodal

Histograms: Trimodal

Histograms: Normal (Bell curve)

Histograms: Symmetric

Histograms: Symmetric (with four modes)

5-number summary (Min, Lower Q, Med, Upper Q, Max) Example: Average daily temperatures in Philadelphia (1974-1986), n=5479 Min Lower Qu Median Upper Qu Max -0.25 40.0 55.25 70.25 88.50

Temp 0 20 40 60 80 Boxplots Built around the 5-number summary Box constructed from lower and upper quartiles Median marked with a horizontal bar Whiskers designed to capture most of the data (99.5% for normal or bell shape) Points for data beyond whiskers, possible outliers

Mortality 0 10 20 30 40 50 Boxplots Whiskers Find the largest value that is within 1.5 IQR of the upper quartile; mark that point with a bar Find the smallest value that is no smaller than 1.5 IQR below the lower quartile Points for data beyond whiskers, possible outliers Mortality summary: Min Lower Qu Median Upper Qu Max 3.0 12.0 15.0 78.0 36.0

Histograms of Daily Mortality Histogram of non accidental mortality Frequency 0 200 400 600 800 1000 5 10 15 20 25 30 35 Number of deaths

More boxplots: Non-accidental mortality Daily mortality 5 10 15 20 25 30 35 fall winter spring summer Season

Histograms of Daily Temperatures Histogram of average daily temperatures Frequency 0 100 200 300 400 500 600 0 20 40 60 80 Temperatures

Boxplots of Daily Temperatures Average daily temperature 0 20 40 60 80 fall winter spring summer Season

A similar idea: Conditioning 0 20 40 60 80 summer winter 600 400 200 Count fall spring 0 600 400 200 0 0 20 40 60 80 Temperature

Summaries for discrete or qualitative data Frequency tables and bar graphs Example: BRFSS What is the highest grade or year of school you completed? freq % cumu % None 188 0.1 0.002 Gd 1-8 3413 3.1 0.033 Gd 9-11 7251 6.7 0.100 Gd 12 31483 30.0 0.390 Col1-3 30415 28.0 0.670 Col 4 35654 32.8 0.998 Refused 257 0.2 1.000 n=108,661 100.0

Barplots (note difference from histograms) 0 5000 10000 15000 20000 25000 30000 35000 None Gd 1 8 Gd 9 11 Gd 12 Col1 3 Col 4 Refused

Barplots None Gd 1 8 Gd 12 Col1 3 Col 4 Refused 0.00 0.05 0.10 0.15 0.20 0.25 0.30

Two continuous variables Scatter plots can display the relationship between two continuous variables (conditioning can give more control) Can be used to spot trends in the data, relationships as well as their strength Random versus controlled variables in the plot

Scatter plot (BRFSS) Weight 100 150 200 250 300 350 60 65 70 75 80 Height

Scatter plot (BRFSS) 60 65 70 75 80 100 150 200 250 300 350 Height Weight

Scatter plot (Time series of mortality) Daily mortality 5 10 15 20 25 30 35 Date

Scatter plot (Time series of temperature) Daily temperature 20 40 60 80 Date

Lecture 5

Last time Graphics for exploratory analysis Histograms, boxplots, barplots, scatterplots When faced with several variables... Interactive analysis Conditioning when creating plots How s the lab?

Probability Toss a coin. What is the probability of heads? Roll a die. What is the probability of getting a five?

Some historical examples Count Buffon (1707-1788) tossed a coin 4,040 times, with heads coming up 2,048 or 50.69 percent of the time Karl Pearson (1857-1936) tossed a coin 24,000 times with heads coming up 12,012 or 50.05% of the time John Kerrich tossed a coin 10,000 times (while he was a prisoner of war in WWII) and heads came up 5,067 times or 50.67 percent

Some historical examples F.N. David and Roman dice, 204 tosses Rock crystal 30, 38, 31, 34, 34, 37 Iron 35, 39, 30, 21, 37, 42 Marble 27, 28, 23, 47, 25, 54

Some historical examples Diaconis mechanical coin flipper Coin always lands the same way

Computer experiment 0.6 0.5 relative frequency 0.4 0.3 0.2 0.1 0.0 0 100 200 300 400 500 flip number

Probability and relative frequency Toss a coin many, many times Examine the proportion of flips that turn up heads What should you get?

Probability and relative frequency Perform a large number of independent repetitions of a random phenomenon After each trial, record the proportion of times in which an event occurs This relative frequency approaches a fixed number that we call the probability of the event

Probability models With coins or dice, we often speak of fairness, or rather that there is no preference for one event over another This is an idealization of a random phenomenon Probability models are mathematical descriptions that account for unpredictable factors in random events

Three kinds of probabilities Probabilities from models Probabilities from data (relative frequencies) Subjective probabilities (beliefs)

Some terminology A random experiment is some situation with an unpredictable outcome The sample space of an experiment is the set of all possible outcomes An event is a collection of outcomes

Some terminology Random experiments: Tossing a coin, measuring your blood pressure, taking a test Sample space: H/T, (whatever BP is measured in), A-F grade Event: The coin turns up heads, my blood pressure is in the normal range, I pass the exam

Some notation Sample space is denoted by the symbol S Tossing a coin 2 times, S = {HH,HT,TH,TT} Events (your text uses labels and names events or simply assigns symbols A, B, C...) Tossing at least one head, A={HH,HT,TH}

Lecture 6

Last time Conceptual definition of probability Long-run averages of outcomes from repeated independent experiments Probability models Mathematical descriptions that account account for unpredictable factors in random events Terminology and notation describing events

Working with events The complement of an event A occurs if A does not occur; we denote it by A We combine events with set operations of intersection and union, and and or Two events are mutually exclusive if they cannot occur at the same time A B

Sample space and events S A = {even spots} A

Sample space and events S B = {fewer than 5 spots} A

Complement B = {fewer than 5 spots} B A

Intersection (and) B = {fewer than 5 spots} A = {even spots} A

Intersection (and) A and B = {fewer than 5 spots and even} A

Union (or) B = {fewer than 5 spots} A = {even spots} A

Sample space and events A or B = {Fewer than 5 spots or even} A

Probability distributions For a sample space assign values S = S 1, S 2, S 3,... p 1, p 2, p 3,... we Each value is between 0 and 1 The sum of the values adds to one 1 = p 1 p 2 p 3... The probability of an event A is the sum of the values for all the outcomes in A

Equally likely outcomes By symmetry we can might believe no one outcome occurs more frequently than any other The probability of an event A is then pr(a) = Number of outcomes in A Total number of outcomes in S

S B = {fewer than 5 spots} pr(b) = 4/6 = 2/3 A

Rules for working with probabilities The sample space is certain to occur pr(s) = 1 pr(a does not occur) = 1-pr(A does occur) pr(a) = 1-pr(A) The probability of two mutually exclusive events occurring is the sum of the two events pr(a or B) = pr(a) pr(b)

Mutually exclusive events B = {throw a 1} A = {throw a 6} A pr(a or B) = 1/6 1/6 = 1/3

Rules for working with probabilities Addition rule holds for any number of mutually exclusive events pr(a 1 or A 2 or... or A k ) = pr(a 1 ) pr(a 2 ) pr(a k )

... and if they re not mutually exclusive? When two events are not mutually exclusive, we have double counted the intersection; instead we use the rule pr(a or B) = pr(a) pr(b) - pr(a and B)

Screening tests These are tests designed to determine if someone might have a particular medical condition; typically they are applied to a large segment of the population and the positives are subjected to further diagnostic tests

Screening tests and a 2-by-2 table Disease status Y N Pos sick and pos well and pos Test result Neg sick and neg well and neg

Screening tests and a 2-by-2 table Disease status Y N Pos pr(sick and pos) pr(well and pos) pr(pos) Test result Neg pr(sick and neg) pr(well and neg) pr(neg) pr(sick) pr(well)

Conditional probability The conditional probability of A occurring given that B has occurred is denoted pr(a B) and is defined by pr(a and B) / pr(b) This gives us the multiplicative rule for probabilities p(a and B) = P(A B) pr(b)

Conditional probability Disease status Y N Pos pr(sick and pos) pr(well and pos) pr(pos) Test result Neg pr(sick and neg) pr(well and neg) pr(neg) pr(sick) pr(well)

Conditional probability The conditional probability that a test is positive given that someone is sick is pr(positive and sick) / pr(sick) What is the probability that someone is sick given that the test is negative?