Statistics, inference and ordinary least squares. Frank Venmans

Size: px

Start display at page:

Download "Statistics, inference and ordinary least squares. Frank Venmans"

Hugh Walton
5 years ago
Views:

1 Statistics, inference and ordinary least squares Frank Venmans

2 Statistics

3 Conditional probability Consider 2 events: A: die shows 1,3 or 5 => P(A)=3/6 B: die shows 3 or 6 =>P(B)=2/6 A B : A and B occur: die shows 3 =>P(A&B)=1/6 AUB : A or B occur: die shows 1,3, 5 or 6 =>P(AorB)=4/6 Addition rule: P(AorB)=P(A)+P(B)-P(A&B) (~ venn diagram) P A B = P A&B P B (~ venn diagram) P(A B): prob of event A given that B occurs=1/2 P(B A): prob of event B given that A occurs=1/3 Bayes Law: P A&B = P(A B)P(B)=P(B A)P(A) Event can be any set of outcomes. Example A: Random draw from belgian population with income >30,000 B: Random draw from Belgian population with education >12 years P(A B) P(A) Income>30,000 Education>12

4 Independence 2 events A and B: P A B = P A P B A = P B A and B are independent Two variables X and Y f x y = f x f y x = f y x and y are independent X and Y are independent if the conditional distribution of X given Y is the same as the unconditional distribution of X. Independent variables do not necessarily have a zero correlation. Example: height of my sun and Indian GDP are correlated (both affected by time) Dependent variables may have a zero correlation in exceptional cases. Example: selection bias may compensate a causal effect (see further)

5 Cumulative Distribution Function CDF Probability Density Function PDF Notation: Random variables X,Y: ex. Yearly earnings and level of eduction Discrete if earnings are multiples of 100 and eduction in years ~Continuous if earnings are expressed un eurocent and education in seconds Specific values of random variables: a,b or x,y Cumulative Distrubtion Function: probability that X is smaller than or equal to a F a = P X a Probability Density Function For discrete variables: f(a)=p(x=a) For continuous variables f a = df a da a F a = f X dx Area under the pdf =1 because F = 1

6 Joint Cumulative Distribution Function Assume Y Yearly earnings and X level of education F x, y = P X < x &Y < y

disregarding y y= Continuous variables f x = f x, y dy (red and blue line) Conditional Density

7 Density function Joint Density Function For discrete variables:f x, y Continuous variables: f x, y Marginal Denstity Function = P X = x&y = y = 2 F x,y x y Discrete variables f x = P X = x disregarding y y= Continuous variables f x = f x, y dy (red and blue line) Conditional Density Function Discrete variables f x y f x y = f x,y y= = P X = x Y = y f y (intersections through the joint density function)

8 Regression as a conditional density function

9 Expected value Unconditional expected value For a discrete random variable : E X = x i P x i = μ For a continuous random variable : E X = xf x dx = μ Conditional expected value (in finance many expectations are conditional on the information set at time t) E X Y = E Y [X]= x i P x i Y E X Y = xf x y dx Variance= σ 2 = E[ X μ 2 ] Covariance between X and Y= σ X,Y = E X μ X Y μ Y Skewness= E Kurtosis= E X μ σ X μ σ 4 3

10 Normal distribution 2 f x = 1 exp 1 x μ σ 2π 2 σ Notation X~N(μ, σ 2 ) Skewness=0 Kurtosis=3 Jacques-Berra test for normality: tests if skewness and kurtosis are close to 0 and 3. Any linear combination of normally distributed variables (correlated or not) is normally distributed Central limit theorem: the probability distribution of a variable that is the sum of an infinite number of independent random variables with any distribution will be normally distributed.

11 Chi square distribution n Y = i=1 X 2 i with X i ~N 0,1 and all X i independent follows a χ 2 distribution with n degrees of freedom. Y~χ2 n

12 Student t distribution Z = X with X~N 0,1 and Y~χ2 Y n and X independent from Y n follows a student or t-distribution with n degrees of freedom Z~t n Higher variance and kurtosis than the standardized normal distribution Converges to the normal distribution for large n: t = N 0,1

13 F distribution Z= X/n Y/m with X~χ n 2 and Y~χ m 2 and X independent from Y follows an F distribution with n and m degrees of freedom. Z~F n,m

14 Inference

15 Statistical inference Try to say something about the real distribution of a random variable based on a sample. The real distribution corresponds to an infinitely repeated event (ex dice), the entire population, entire set of possible states of the world in a future period etc.

16 3 types of inference Point estimator: Ex: sample mean, sample variance, marginal effect in a linear regression (beta), correlation Concept of repeated sampling: every sample gives another estimator θ=> θwill follow a prob distribution Unbiased: Expected value of estimator corresponds to the real parameter E θ = θ Consistent: The estimator can get arbitrarily close to the real parameter by increasing the sample size plim θ = θ n Ex: sample variance estimator s ² = 1 y n i i y 2 is a biased but consistent estimator of the variance Efficient estimator: var(θ) is small Interval estimation: Ex: given the observed sample, the real mean lays between 1 and 3 with 95% probability Hypothesis testing: Ex: if the null hypothesis is true (μ = 2), what is the probability of a random sample to have a more extreme (less likely) outcome than the observed sample mean of 4 and sample variance of 2.

17 Example: Sample mean Income of Belgian households: a random variable following a distribution with mean μ and variance σ² (distribution is skewed, not normal) You have a sample of n individuals. You want to say something about μ and σ² Estimator of μ: sample mean y = y 1+y 2 +y 3 y n n Estimator will be different each time you draw a different sample=>sample mean will follow a distribution, which is different from the distribution of y. Central limit theorem =>the sample mean converges to a normal distribution even if y does not follow a normal distribution.

18 Sample mean: variance known y assymptotic ~N μ, σ2 n y μ σ n assymptotic ~N(0,1) This allows to determine a 95%confidence interval P 1,96 < y μ σ/ n < 1,96 = 0,95 P y 1,96 σ n < μ < y + 1,96 σ n =0,95 When interval includes zero we say that the sample mean is not significantly different from zero at the 5% confidence level.

19 Sample mean: variance unknown and y normally distributed Both mean and variance will need to be estimated. Estimator for variance: s 2 = 1 y n 1 n i y 2 If Y follows a normal distribution (n 1)s 2 = y i y σ 2 n σ σ/ n y μ y μ = s / n s σ = y μ σ/ n ~ N 0,1 n 1 s 2 (n 1)σ 2 χ 2 n 1 n 1 = t n ~χ n 1 (no proof but intuitive) This allows to determine a 95% confidence interval (ex. n=21) P 2,086 < y μ s < 2,086 = 0,95 P y 2,086 s < μ < y + 2,086 s =0,95 n n n For large n, the t distribution converges to the normal distribution

20 Hypothesis testing Null hypothesis H 0 : θ = θ 0 ex: H 0 : θ = 0 One sided test H A : θ > θ 0 (or θ < θ 0 ) ex: H A : θ > 0 Two sided test H A : θ θ 0 ex: H A : θ 0 2 regions: If observed data (test statistic) falls in rejection region =>reject H 0 If observed data (test statistic) falls in acceptence region =>accept H 0 Imagine you have 10 months of data and you observe a mean monthly return of the stock of Apple of 0,8% and you want to test if this mean is different from a zero return. Assume the standard error of the return is observed to be 1,58%, so the standard error of the mean is 1,58% 10 = 0,5%

21 One sided test vs 2-sided test One sided test: if the real mean was zero, what would be the probability to observe an estimator larger than 0,8%(1,58%)? Standardize your outcome H 0 : μ = 0 y 0 P Y > y 0 s n s / n ~t n = P Y > 1,6 =0,05 (Pvalue given by Stata) Two sided test: if the real mean was zero, what would be the probability that the sample mean was outside the interval of [ y s /, y ] n s / n y y s / n ) H 0 : μ = 0 1 P( < X < s / n = 1- P(-1,6<X<1,6)=0,10 (Pvalue given by Stata) Remark: n is small so assumption of normally distributed returns is needed

22 Type I and type II errors Do not reject H0 Reject H0 H0 true Type I error, α (ex 5%) H0 false Type II error, β 1-β=power of test Level of significance: probability to reject the null if the null is true Power of the test: probability to reject the null if the nulle is false Reduce probability of type I error =>increase probability of Type II error More efficient estimator=>reduce probability of Type II error=increase power of the test Increase sample size=>reduce probability of Type II error=increase power of the test General rule: go for a large sample, in small samples you may only see phenomenons big as an elephant, that you knew allready before doing the test, all the rest has an insignificant effect.

23 Ordinary least squares

24 Regression Assume we want to know the relationship between sales and advertising expenditure OLS: minimize squared distance between points and regresssion line Y=Sales Y = α + βx Slope= β ε i α + βx i α Y i X=advertising expenditure

25 Population regression line vs sampling regression line Estimators and regression line (orange) will be different for each sample. Would you use a one-sided or two-sided t-test for beta? Y=Sales Y = α + βx Slope= β Y = α + βx Slope= ε i β εi α α + βx i α + βx Y i Advertising expenditures

26 Assumptions of OLS 5 Gauss-Markov assumptions: The true model is Y = α 0 + β 1 X 1 + β 2 X ε with E ε = 0 (linearity) No perfect collinearity (you cannot write X 1 as a linear combination of the other X j s) Homoscedastic errors E ε 2 i = σ² matrix notation E εε Uncorrelated errors E ε i ε j = 0 E ε X 1, X 2 = 0 (exogenous explainatory variables, no endogeneïty) If 5 assumptions are met, OLS is Best Linear Unbiased Estimator (BLUE) If the errors are normally distributed, OLS is Best Unbiased Estimator (BUE) OLS with non-normal errors is still unbiased and consistent! β follows a t-distribution only if errors are normal => be prudent with interpreting confidence intervals in small samples = σi

27 The math behind OLS (optional, only for those who like it) Consider Matrix notation of model: Y = Xβ + ε If there is an intercept, X contains a column of one s Minimum distance estimator min 2 n ε β i = min ε ε = min Y Xβ Y Xβ = min Y Y 2β X Y + β X Xβ β β β first derivative: 2X Y + 2X Xβ = 0 β = X X 1 X Y Method of moments: errors must be uncorrelated with the regressors X ε = 0 X Y Xβ = 0 β = X X 1 X Y Maximum likelihood under normal distribution of error term 1 2 likelihood = exp ε i i 2πσ 2 2σ 2 Loglikelihood = n log 2 2πσ2 1 ε 2σ i² 2 n Minimising the loglikelihood boils down to minimum distance estimator=>ols is BUE

28 OLS inference If errors are normally distributed, the estimate β follows a student t distribution (only assymptotically the case if errors are not normally distributed) If Errors are correlated or heteroscedastic, the variance of beta σ β 2 can be increased to take that into account (option robust in stata) Stata command: regress Y X1 X2, robust Default includes a constant, you can add option noconstant Exogeneity of X s cannot be tested. The problem of endogeneïty is most important condition for causal interpretation of beta s (see next week)

29 Avoid endogeneïty Conditions 1 and 5 imply E Y X 1, X 2 = α 0 + β 1 X 1 + β 2 X 2. This allows a causal interpretation of beta s: an increase of X1 by one unit, all other relevant factors being equal, will have an effect β 1 on Y. Intuition: all other factors being equal implies that all factors that drive the error term (and thus Y), are uncorrelated to the variables of interest X.

30 The effect of marketing «all else being equal» Innovative company Sales Marketing expenditures Error= all other factors Competitors Quality of product Delivery time Business cycle E ε X 0 cov ε, X 0 ε and X are driven by common factors.

31 Fixed effect panel regression to avoid endogeneïty Marketing expenditures Innovative company Sales Fixed effect= all factors that are constant over time Competitors Quality of product Idiosyncratic error= all other factors that change over time Delivery time Business cycle

32 Fixed effects - random effect - pooled panel 3 ways of writing the same fixed effect model: Y= Xβ + i γ i D i + ε with D i a dummy variable for company i. Y it = X it β + γ i + ε it Y it Y i = (X it X i )β + ε it εi = within estimator (obtained by subtracting the sum of eq 2 over time periods) Beta measures the effect of a deviation from mean marketing expenditures on the deviation of mean sales within a company i. =>The difference in mean sales between a company with high and low average marketing expenditures does not drive the estimation of beta => be careful with measurement errors and lagged effects because part of variability is filtered out. If theory indicates that you may avoid a source of endogeneity and you have enough data to find significant effects => use fixed effects! Random Effect model and Pooled panel regression assume that none of the factors that drives a company specific effect drives any of the X s as well: fixed effects are uncorrelated with X s Pooled panel regression is an OLS as if there was no panel structure: every observation has equal weight Random effect model is a Generalized Least Squares (GLS) estimator: the observed heterogeneity and serial correlation is used to make estimator more efficient compared to the pooled panel regression

33 Alternative functional forms If X increases by 1, Y will increase by β Y = α + βx + ε dy = β= marginal effect dx If X increases by 1%, Y will increase by β% lny = α + βlnx + ε Y = e α X β e ε dlny dlnx = β = dy/y dx/x = elasticity If X increases by 1, Y will increase by β% lny = α + βx + ε Y = e α e βx e ε dlny dy/y = β = = gowth rate (think of X as time) dx dx If X increases by 1%, Y will increase by β Y = α + βlnx + ε e Y = e α X β e ε dy lnx = β = dy dx X Any transformation (lnx, 1/X, X², X³, expx) is in principle allowed Transformation can be justified by theory (in most cases) or by the data (see graphs) Y X Y X Y X

34 Some useful commands in Stata Type help in the command windows for the following commands: summarize: summarize information about a variable or dataset tabulate var1 var2 : tabulation table to explore data destring : define a variable as numerical if it would be imported as string (text) generate var1=var2+var3 : generates a new variable (also for log transformation) replace var1=0 if var2==. & var3>36 : logical expressions with == ; dot= missing (or ) => if var3 is missing, it satisfies var3>36! replace var1=1 if l.var2==25 var3==var4 : lag operator only if dataset defined as time series or panel. regress y x1 x2 : regression with many options tset year : define dataset to be a time series with year name of time variable

35 Commands in Stata: xtset i t : define a database to be a panel with i name of person or company xtreg : fixed effect or random effect panel regression Create an id per companyname: egen id= group(companyname) egen stands for extensions to generate, operates on groups of observations. generate only applies to one observation at a time Calculate mean (over time) of variable income per company id: by id: egen meanincome = mean(income) Eliminate extreme values beyond 99th percentile of variable income: Summarize income, detail Replace income=. if income>`r(p99) Most commands create different macro s (mentioned at the end of help document). You can use them with `. Since they are local macro s, they are erased at some point (in this case until the next command that uses r to store results.

Job Training Partnership Act (JTPA)

Job Training Partnership Act (JTPA) Causal inference Part I.b: randomized experiments, matching and regression (this lecture starts with other slides on randomized experiments) Frank Venmans Example of a randomized experiment: Job Training