Class: Taylor January 12, 2011 (pdf version) Story time: Dan Willingham, the Cog Psyc Willingham: Professor of cognitive psychology at Harvard Why students don t like school We know lots about psychology now amazingly little about education: Perry preschool project is still a state of the art education experiment There are few other good controlled experiments 1
But we can use what we know about psychology to inform education That is the aproach of this book For example: He pushes stories People relate to humans with special hardware in their brain (cards vs liar): * with cards: the rule is, red on one side means, even number on the other side. Which do you check: A red card? An even card? An odd card? * with people: Drink means over 21. Which do you check: A drinker? An old person? A young person? so try to tell stories People pay attention at start of class new stuff is always interesting. So no need for it to have a connected theme. Hence when I start classes with a short story, blam Dan, the cognitive psychologist. 2
Admistrivia homework / cases project and a final Email statistics.assignments@gmail.com for questions and turning in assignmens Both Sathya and I get messages sent here. So that way you get ahold of who ever is currently on line sooner. Books Introductory Statistics with R by Peter Dalgaard, 2nd edition, ISBN 978-0-387-79053-4, Springer 2008. Linear Models with R by Julian J. Faraway, ISBN 1-58488- 425-8, Chapman & Hall/CRC Press 2005. Software R Allow other software but recommend R Its free, available on OS X/Linux/windows. It is what production level statisticians use Friday at noon, Sathya will give an intro to using R. (And our computer (Anand) person will be there to help load it if you have problems.) 3
The triangle of statistics Statistics has three major pieces mathematics data analysis (i.e. science) communication To be good, you need all three (or at least two of the three). Doing only one isn t as powerful: only mathematics: Terrance Tao, maybe the smartest guy on the planet. I would have recommended him for the genious award but he already has one. only data analysis: called masters level statistician. Employable at big pharm. But low pay. only communication: Called bloggers. Basically unpaid! My goal is to make sure you can make more money than any of these pure states! So MBA s closer to communication math undergrads closer to math corner 4
stat concentrators, closer to data analysis corner but by the end, I want you all to have moved a bit towards the middle. I ll present more mathematics and data analysis since that is what I know best. Todays Topic: Simple linear regression Review of the standard linear model The standard linear regression model is: Y i = α + βx i + ɛ i ɛ i iid N(0, σ 2 ) You will see this equation written in almost any research paper which uses data. The names are often changed, but it is there somewhere. For example, it is basically equation 2.17 in Berndt of the reading. The entire chapter is designed to motivate that one equation. Let s break it down into pieces. The fit: Y i = α + βx }{{} i +ɛ i ɛ i iid N(0, σ 2 ) the fit the most fun part is the fit. It describes the relationship between x and Y. This version describes a linear relationship. 5
Residuals / errors: Y i = α + βx i + ɛ i ɛ i iid N(0, σ 2 ) }{{} The residuals The residuals (aka errors) themselves. Describing them, looking at them, investigating them is the primary activity of a statistician. It is all about error! The i.d. : The i.i.d. part can be broken into two pieces, i. and i.d. The easier is the identically distributed. It means each error looks like any other error. The i. : The first i in IID is for independence. We will spend an entire class on this piece. It is the most important assumption in the entire model. the N : Means normal. Look at a q-q plot to check it. It is easy to check (hence we cover it in intro classes). We won t discuss it here since I assume you already know how to check it. Style: iid = i.i.d. = IID = I.I.D. = independent and identically distributed. It is often even left off entirely since it is always assumed. Y is upper case, x is lower case: Recall from probability that random variables are often writtten as upper case letters. This is why Y is written as an upper case it is random. The x are thought of as inputs, and hence not random. 6
i is the row index. We might even say how many rows we have by the cryptic addition to the equation: (i = 1,..., n) Y i = α + βx i + ɛ i ɛ i iid N(0, σ 2 ) Is linear good enough? The triangle answers Communication: Littlewood s principle: Almost all functions are almost continuous almost everywhere. And from Stone- Weierstrass, all continuous functions are aproximately equal to a polynomial. And all polynomials look like lines if you investigate them close enough to a zero. Mathematics: Taylor (wiki) tells us that everything can be approximated by a linear equation. So if there is a true relationship between Y and x that is non-linear, then we could say E(Y x) = f(x) (This is yet another cryptic for of our main equation. It could be written as Y = f(x) + ɛ to make it look more like our previous equation.) So Taylor s theorem says that E(Y x) α + βx and even tells us what α and β are. 7
Data analysis: Linear is esiest to look at so start there. Then use residuals to decide if it is good enough. Practice First get the data. grandfather used: For me, I use the command line, just like your wget http://www-stat.wharton.upenn.edu/~waterman/fsw/datasets/txt/clea You of course have this new fangled deviced called a mouse so use it! Now start R. First read in the file: > read.table("cleaning.txt") Oops, that generates too much output, and doesn t put it anywhere. So let s assign all this mess to a data frame. > clean = read.table("cleaning.txt") Just look at what we have by typing clean again. Oops we have the first row with the names of the variables in it. So let s try again: > clean = read.table("cleaning.txt", header = TRUE) 8
Checking with clean shows we only have numbers. How happy can you get?!? Now for the fun part, let s run a regression. > lm(clean$roomsclean ~ clean$numberofcrews) Call: lm(formula = clean$roomsclean ~ clean$numberofcrews) Coefficients: (Intercept) clean$numberofcrews 1.785 3.701 Kinda a different world view than JMP. It just gives the minimal amount of output possible. So to see a bit more, try > summary(lm(clean$roomsclean ~ clean$numberofcrews)) Call: lm(formula = clean$roomsclean ~ clean$numberofcrews) Residuals: Min 1Q Median 3Q Max -15.9990-4.9901 0.8046 4.0010 17.0010 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.7847 2.0965 0.851 0.399 9
clean$numberofcrews 3.7009 0.2118 17.472 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 7.336 on 51 degrees of freedom Multiple R-squared: 0.8569, Adjusted R-squared: 0.854 F-statistic: 305.3 on 1 and 51 DF, p-value: < 2.2e-16 That should look very similar to other tables you have seen. But what of pictures? Well, let s do a plot: > plot(lm(clean$roomsclean ~ clean$numberofcrews)) 10
10 20 30 40 50 60 20 10 0 10 20 Fitted values Residuals lm(clean$roomsclean ~ clean$numberofcrews) Residuals vs Fitted 46 31 5 11