review session gov 2000 gov 2000 () review session 1 / 38

Similar documents
Business Statistics. Lecture 9: Simple Regression

Chapter 12 - Lecture 2 Inferences about regression coefficient

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

MS&E 226: Small Data

Chapter 27 Summary Inferences for Regression

The Simple Linear Regression Model

Statistical Inference with Regression Analysis

Lecture 1: Probability Fundamentals

Simple Linear Regression for the Climate Data

Statistical Distribution Assumptions of General Linear Models

Lectures on Simple Linear Regression Stat 431, Summer 2012

Math Review Sheet, Fall 2008

DIFFERENTIAL EQUATIONS

Lecture 2: Repetition of probability theory and statistics

Probability and Statistics

Probability Theory and Simulation Methods

appstats27.notebook April 06, 2017

Multiple Linear Regression for the Supervisor Data

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Review of Statistics

CS 160: Lecture 16. Quantitative Studies. Outline. Random variables and trials. Random variables. Qualitative vs. Quantitative Studies

Warm-up Using the given data Create a scatterplot Find the regression line

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

ECON3150/4150 Spring 2015

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

Harvard University. Rigorous Research in Engineering Education

Bias Variance Trade-off

Multiple Regression Analysis

AMS 7 Correlation and Regression Lecture 8

Linear Regression with Multiple Regressors

Applied Statistics and Econometrics

Multiple Linear Regression

Probability Theory for Machine Learning. Chris Cremer September 2015

Inference in Regression Model

CS 246 Review of Proof Techniques and Probability 01/14/19

Important note: Transcripts are not substitutes for textbook assignments. 1

Correlation and Regression

Calculus II. Calculus II tends to be a very difficult course for many students. There are many reasons for this.

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

LECTURE 15: SIMPLE LINEAR REGRESSION I

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Business Statistics. Lecture 10: Course Review

Review of Statistics 101

Hypothesis Testing. We normally talk about two types of hypothesis: the null hypothesis and the research or alternative hypothesis.

Mathematical Notation Math Introduction to Applied Statistics

18.05 Practice Final Exam

A Primer on Statistical Inference using Maximum Likelihood

Outline for Today. Review of In-class Exercise Bivariate Hypothesis Test 2: Difference of Means Bivariate Hypothesis Testing 3: Correla

Unless provided with information to the contrary, assume for each question below that the Classical Linear Model assumptions hold.

Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008

Descriptive Statistics (And a little bit on rounding and significant digits)

Correlation and regression

STA Module 10 Comparing Two Proportions

Linear Regression & Correlation

Linear Regression with Multiple Regressors

Chapter 2: simple regression model

PHP2510: Principles of Biostatistics & Data Analysis. Lecture X: Hypothesis testing. PHP 2510 Lec 10: Hypothesis testing 1

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

First Midterm Examination Econ 103, Statistics for Economists February 14th, 2017

Probability and Independence Terri Bittner, Ph.D.

Note on Bivariate Regression: Connecting Practice and Theory. Konstantin Kashin

Linear Regression with one Regressor

Statistical Methods for the Social Sciences, Autumn 2012

EXAMINATION: QUANTITATIVE EMPIRICAL METHODS. Yale University. Department of Political Science

1 Least Squares Estimation - multiple regression.

Gov 2000: 9. Regression with Two Independent Variables

Introduction and Single Predictor Regression. Correlation

c 2007 Je rey A. Miron

36-463/663: Multilevel & Hierarchical Models

Probability. Table of contents

MS&E 226: Small Data

Applied Quantitative Methods II

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

2. Linear regression with multiple regressors

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

What is a random variable

PS2.1 & 2.2: Linear Correlations PS2: Bivariate Statistics

Probability Distributions

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

Chapter 1 Statistical Inference

REVIEW 8/2/2017 陈芳华东师大英语系

Introduction to bivariate analysis

Fin285a:Computer Simulations and Risk Assessment Section 2.3.2:Hypothesis testing, and Confidence Intervals

Introduction to bivariate analysis

Algebra Exam. Solutions and Grading Guide

P (A) = P (B) = P (C) = P (D) =

Introduction to Linear Regression

18.05 Final Exam. Good luck! Name. No calculators. Number of problems 16 concept questions, 16 problems, 21 pages

Biostatistics in Research Practice - Regression I

Basic Probability Reference Sheet

Can you tell the relationship between students SAT scores and their college grades?

Introduction and Overview STAT 421, SP Course Instructor

Terminology. Experiment = Prior = Posterior =

Practice Problems Section Problems

Chapter 16. Simple Linear Regression and dcorrelation

Interactions and Factorial ANOVA

Contents. Acknowledgments. xix

z-test, t-test Kenneth A. Ribet Math 10A November 28, 2017

Transcription:

review session gov 2000 gov 2000 () review session 1 / 38

Overview Random Variables and Probability Univariate Statistics Bivariate Statistics Multivariate Statistics Causal Inference gov 2000 () review session 2 / 38

Probability Why is probability important? Probability involves reasoning about samples given the truth. (Ex. You know the prob of a coin flip is 1 2. You then ask how many H s you expect to get if you flip the coin six times.) Inference is about reasoning about the truth given a sample. (Ex. You ve fliped a coin six times and gotten five heads. Could you reasonably expect to have gotten that with a fair coin?) Probability and inference are the two sides of the same coin. gov 2000 () review session 3 / 38

Random variables (From Wikipedia) A random variable can be thought of as an unknown value that may change every time it is inspected. Thus, a random variable can be thought of as a function mapping the sample space of a random process to the real numbers. A few examples will highlight this: A coin toss with H or T being the two equally likely events; The toss of a die with the numbers 1-6 being the equally likely outcomes; A spinning wheel that can choose a real number from the interval [0, 2π), with all values being equally likely. gov 2000 () review session 4 / 38

Random variables ctd For discrete distributions, the random variable X takes on a finite, or a countably infinite number of values. (Ex. The number of H s in six coin flips.) For continous distributions, the random variable X takes on continuous or infinte values. (Ex. The distribution of weights of Gov2000 students.) gov 2000 () review session 5 / 38

Random variables ctd A probability mass/density function (PMF/PDF) and a cumulative mass/density distribution function (CMF/CDF) are two common ways to define the distribution for a discrete RV. The PMF/PDF, f (x), is a function that describes the relative likelihood for this random variable to occur at a given point in the observation space. The CDF/CDF, F (x), gives the probability that the random variable X takes on a value less than or equal to x. gov 2000 () review session 6 / 38

Graphical representation It s sometimes easier to understand the PDF and CDF graphically: PDF, Normal Distribution CDF, Normal Distribution Density 0.0 0.1 0.2 0.3 0.4 sort(pnorm(x)) 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 N = 100000 Bandwidth = 0.08985 4 2 0 2 4 sort(x) gov 2000 () review session 7 / 38

Overview Random Variables and Probability Univariate Statistics Bivariate Statistics Multivariate Statistics Causal Inference gov 2000 () review session 8 / 38

Univariate statistics We started off talking about inference by covering univariate statistics when you only have one variable of interest. Here are some examples: You want to estimate level for support for one candidate in the general population; Your want to estimate the subprime lending rate in a depressed county in Florida; You want to estimate the mean democracy score for countries in South America. This was material covered up through the midterm. gov 2000 () review session 9 / 38

Terminology At this point, we introduced Parameters: characteristics of the population distribution (e.g. the mean), often denoted with a greek letter (µ, σ 2 ). Estimators: Random quantities, written as X or Y. Estimates: Realized values of an estimator; hence they are not random (e.g. x). gov 2000 () review session 10 / 38

Sampling distributions A key concept is that of a sampling distribution. Here s how it works. We have a large population ( The Truth ). We take a sample from the population (usually, as researchers, this is what we see a sample). We calculate our estimate (using our estimator) from this sample. And then put the sample back, draw another, and calculate our estimate again. We repeat. Be sure you understand this concept, as it drives our thinking about confidence intervals and hypothesis testing. gov 2000 () review session 11 / 38

Estimating population mean We usually want to estimate the population mean. Let s use as our estimator the mean of the sample, x. Our point estimate will be x. Under repeated sampling, x will be normally distributed Why? Central Limit Theorem! Because x is normally distributed, we can construct confidence intervals and hypothesis tests. Takeaway: The Central Limit is amazing and makes many inferences possible. gov 2000 () review session 12 / 38

Confidence intervals To construct confidence intervals for the true population parameter: For a large sample, use the we construct the (1 α) confidence interval with x ± z α/2 SE. For a small sample (n < 32), we use x ± t α/2 SE. Remember the correct interpretation for a confidence interval if you repeatedly drew samples, and for each sample constructed a confidence interval, then 1 α percent of the intervals would contain the true population parameter. gov 2000 () review session 13 / 38

Hypothesis testing The CLT also allows us to construct hypothesis tests. Most frequently, we want to test the null that the parameter is zero. Here s how it works: We assume that the observed estimate comes from the null distribution. We then transform the null (which comes from a normal) into a standard normal by substracting the mean and dividing by the standard error. The mean under the null is zero. This gives you your test statistic, x 0 SE. Then you compare this to a standard normal to see how unusual this observation would be. If it s pretty unlikely to happen, then we reject the null. Note that the p-value just quantifies this it s the probability of obtaining a test statistic at least as extreme as the one that was actually observed gov 2000 () review session 14 / 38

Hypothesis testing graphically It s again sometimes easier to see this graphically: Density 0.0 0.1 0.2 0.3 0.4 6 4 2 0 2 4 N = 500000 Bandwidth = 0.06522 gov 2000 () review session 15 / 38

Overview Random Variables and Probability Univariate Statistics Bivariate Statistics Multivariate Statistics Causal Inference gov 2000 () review session 16 / 38

Bivariate regression We went from being interested in just one variable to being interested in two variables and the relationship between them. Univariate examples: Estimating the support for a candidate in a population, estimating the true subprime mortgage lending rate Bivariate examples: The relationship between the unemployment rate and suicide, the relationship between perceptions of CEO pay and ideal CEO pay. Note: The term regression, which we used moving forward from this point, can be used to describe the relationship between any independent and dependent variables. gov 2000 () review session 17 / 38

Introducing least squares We discussed various techniques for summarizing the relationship between two variables (LOESS), etc., but we decided that ordinary least squares regression would be the best. How do we derive OLS? We plot the data, then fit the line that minizes the vertical distance between the line and the actual observations. Check out the calculus in Adam s slides. The point is that we can get estimates for the line s intercept, β 0 and slope, β 1. We are especially interested in the slope, as it summarizes nicely the relationship between X and Y. gov 2000 () review session 18 / 38

Least squares ctd. More specifically, We end up with a line that looks like this: Ŷ = ˆβ 0 + ˆβ 1 X i. ˆβ 0 = ȳ ˆβ 1 x ˆβ 1 = (xi x)(y i ȳ) x i x You can also calculate r 2, how much variation in Y is accounted for by X. gov 2000 () review session 19 / 38

A graphical representation of OLS It s sometimes easier to understand this graphically: Y 70 80 90 100 110 120 130 70 80 90 100 110 120 130 X gov 2000 () review session 20 / 38

OLS repeated sampling Again, think about this as being in the context of sampling. We draw one sample from the true population, calculate OLS estimates, put the same back, draw another sample, calculate OLS estimates again, etc. In repeated sampling, ˆβ 1 and ˆβ 0 will be normally distributed due to the Central Limit Theorem. σ More specifically, ˆβ 1 N(β 1, 2 (xi x) ). This means that we can use the same techniques as before to express uncertainty around our ˆβ 1 and ˆβ 0 estimates. gov 2000 () review session 21 / 38

Confidence intervals for OLS estimates Because ˆβ 1 and ˆβ 0 are normally distributed, we can construct confidence intervals. For the slope: ˆβ 1 ± t α/2 SE[ˆβ ˆ 1 ] For the intercept: ˆβ 0 ± t α/2se[ˆβ ˆ 0 ] The interpretation is the same as before with repeated calculations, we d expect that 1 α percent of the confidence intervals we create would contain the true values of β 1 and β 0. gov 2000 () review session 22 / 38

Hypothesis testing for OLS estimates The most tested null is that the slope is equal to zero; that is, that there is no relationship between the independent and dependent variables (H 0 : β 1 = 0). As before, we know the null is normal. We want to get everything to a standard (N(0, 1)) normal, so we standardize everything, including our coefficient. We use the same formula, ˆβ 1 0 ˆ SE[ ˆβ 1 ] We then compare this to a standard normal to see how likely (or unlikely) the coefficient would be under the null being true. gov 2000 () review session 23 / 38

Overview Random Variables and Probability Univariate Statistics Bivariate Statistics Multivariate Statistics Causal Inference gov 2000 () review session 24 / 38

Multivariate regression We went from being interested in just one variable to being interested in two variables to now being interested in many independent variables (the covariates, or X variables) and one dependent variable (Y ). Univariate examples: Estimating the support for a candidate in a population. Bivariate examples: Relationship between the unemployment rate and suicide. Multivariate examples: Relationship between British colonial history, Islam, and democracy scores; relationship between unemployment, inflation, and candidate support. Note: Most social science research is multivariate. gov 2000 () review session 25 / 38

Multivariate regression ctd. Everything basically works the same way, except that now we minimize the square of the residual with respect to the regression plane (not line). The regression line looks like ŷ i = ˆβ 0 + ˆβ 1 X 1i + ˆβ 2 X 2i +... + ˆβ k X ki. Also, all of the ˆβ s are normally distributed, so we can make the same kinds of infereces as before. (Just note that the standard error is now adjusted by the variance inflation factor see lecture notes!) As before, we can calculate r 2, but we might want to use an ajudsted r 2 to copensate downward for the number of additional covariates. gov 2000 () review session 26 / 38

Interaction terms Something that comes up a lot in political science is the use of interaction terms. Interactions should be used when you think the effect of one variable increases or decreases along with another variable. ŷ i = ˆβ 0 + ˆβ 1 X i + ˆβ 2 X i + ˆβ 3 (X i Z i ) ˆβ 1 now represents the coefficient on X when Z = 0, and ˆβ 2 is the coefficient on Z when X = 0. When you are unusure of how to interpret a regression output involving an interaction term, write out the regression equation and use algebra to help you. Never fail to include lower order terms unless you think carefully about the substance of the constraints you are placing on your model. gov 2000 () review session 27 / 38

Matrix algebra A quick note about matrix notation: We use matrix notation (y, x, β) because it gets really cumbersome to write out y 1, y 2,..., y n. Using matrix notation we get the same OLS estimator as before, only now it s in matrix form: ˆβ = (X X) 1 X y It also makes it a lot easier to do things like F tests. gov 2000 () review session 28 / 38

F test The F test makes it handy to test multiple hypothesis for example, testing a null that a subset of your coefficients equal zero, or if a subset of your coefficients equal each other. To see the specifics of how to do an F test, go through Adam s notes. You can also use the linear.hypothesis function in the car library in R. gov 2000 () review session 29 / 38

Diagnostics Diagnostics are an important final step for both bivariate and multivariate regressions. After you re done with your analysis, check for possible leverage points, influence points, and outliers. Also check that the modeling assumptions hold. In particular, check possible non-constant error variance. If you have evidence of non-constant error variance, then you might want to use Huber-White standard errors. gov 2000 () review session 30 / 38

Overview Random Variables and Probability Univariate Statistics Bivariate Statistics Multivariate Statistics Causal Inference gov 2000 () review session 31 / 38

Causal inference So far, this review session has dealt exclusively with predictive inferences, not causal ones. Making causal inferences is much harder. Why? Making causal inferences involves many more assumptions. You have to think carefully about what your treatment is and what your treatment and control groups are. You have to think carefully about the moment of treatment and which of your variables and pre-treatment and which are post-treatment. You have to consider whether other factors not included in your model could be driving the results. gov 2000 () review session 32 / 38

Key assumptions Ignorability. You must be able to assume that there are no confounding variables that could affect both the probability of treatment and the outcome variable. SUTVA (Stable Unit Treatment Variance Assumption). Make sure that your treatment isn t seeping from one subject to another. Post-treatment issues. Also be wary of introducing into your model variables that take place after the treatment has been administered. gov 2000 () review session 33 / 38

Causal inference bottom line If you have a randomized experiment then you can assume that there are no confounders. Political scientists often look for natural experiments like the one in Andy s paper. These can be really useful, but be sure to justify the rasons why you think the treatment is still random. If you have an observational study, you ll want to control for all the confounders that you possibly can. Making causal inferences becomes a lot more difficult. In both contexts, you ll want to think about possible SUTVA violations and be careful about not introducing post-treatment bias. In general, be very careful about making causal claims and consider other techniques matching, regression discontinuity, experimentation. gov 2000 () review session 34 / 38

Conclusion Random Variables and Probability Univariate Statistics Bivariate Statistics Multivariate Statistics Causal Inference gov 2000 () review session 35 / 38

Immediately following this course Gov 2001 will cover dichotomous dep variables, maximum likelihood, as well as key topics (missing data, matching). You ll go through a replication of a published paper from beginning to end. Completing the 2000/2001 sequence will put you at the forefront of mainstream political scientists. gov 2000 () review session 36 / 38

Down the road Stat 110/210 probability theory (essential for understanding the foundations of likelihood and bayesian; no computation) Stat 111/211 inference (a lot of proofs involving what we learned in this class and will learn in Gov 2001; no computation) Gov 2003 Bayesian (when offered; lots of computation) Completing this sequence in addition to the 2000/2001 sequence will put you at the forefront of quant-oriented political scientists. gov 2000 () review session 37 / 38

Other courses to take Stat 245 statistics and litigation Econ 1128 Don Rubin s causal inference class Gov 2010 research methodology Gov 2002 topics (when offered, usually more advanced causal inference, textual analysis) Stat 139/149 generalized linear models (similar to what s covered in 2001, but more focus on the math, less on computation) Stat 220 Bayesian Stat 221 Bayesian computing Taking all these courses would basically make you a methodologist. gov 2000 () review session 38 / 38