Lecture Data Science

Size: px

Start display at page:

Download "Lecture Data Science"

Amelia Phillips
5 years ago
Views:

1 Web Science & Technologies University of Koblenz Landau, Germany Lecture Data Science Regression Analysis JProf. Dr.

2 Last Time How to find parameter of a regression model Normal Equation Gradient Decent Probabilistic View on Regression Least Square Estimate and MLE Multivariate Linear Regression 2

3 Today Model Quality Bias Variance Tradeoff Solutions Variable Selection Regularization Logistic Regression Causality 3

4 MODEL QUALITY 4

5 General Problem Problem: how to find the optimal model? Using all predictor variables best fit on trainings data Leaving away one variable can only increase the residuals (error on trainings data) Using all predictor variables large error due to variance all predictor variables are afflicted with errors The more complex the model the more possible solutions exist for which error on training data is equal Bias-variance trade-off! 5

6 Bias and Variance Each blue point is one θ, the red point is θ 6

7 Linear Regression High bias, low variance Variance = sensitivity to small fluctuations in the training set Polynomial Regression Low bias, high variance Small changes to the data, solution changes a lot 7

8 Bias and Variance Tradeoff Parametric algorithms such as linear regression have a high bias, low variance underfitting overfitting Decision Trees or k- nearest neighbours have a low bias, high variance 8

9 Polynomial Rregression H = b0 + b1 W + e H = b0 + b1 W + b2 W 2 e H = b0 + b1 W + b2 W 2 + b3 W 3 + b4 W 4 + b5 W 5 + e Training Error is a function of model complexity! 9

10 Solutions Model Selection Identify a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables. Adjusted R2, cross-validated MSE, Mallow s C p Regularization Fit a model involving all p predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance and can also perform variable selection. Ridge Regression and Lasso 12

11 Adjusted R2 R2 never decreases if more variables are added Ajust R2 for the number of parameters p Unexplained variability: N i=1(y i Y i ) 2 (n p 1) Total variability: N i=1(y i Y) 2 (n 1) 13

12 Cross-validated Prediction Error K-fold Cross Validation Mean squared error (MSE) on test data Src: 14

13 Mean Squared Error (MSE) MSE is the difference between the estimator and what is estimated See proof here: Bias(θ )= E θ θ Sampling deviation: d x = θ (x) E θ x is the sample, θ (x) refers to one sample Var θ = E[θ E θ ] 2 15

14 Mallow s C p Mallow s C p is a technique for model selection in regression (Mallows 1973). C p statistic is a criteria to assess fits of models with different numbers of parameters. Sum of Squared Errors of model with p parameter Variance of full model with n parameter N is the sample size, P is the number of predictors 20

15 Using Mallows C p Search for best model wrt Mallows C p statistic 21

16 Using Mallows C p Search for best model wrt Mallows C p statistic 22

17 Regularization Estimate coefficient: β L p loss penalty Choose coefficients so that LSE is minimal but also coefficients are minimal (shrink coefficients) Here λ 0 is a tuning parameter, which controls the strength of the penalty term. Estimate λ via cross validation 23

18 Regularization More general, the L p regularizer is defined as L p = Ridge 24

19 Find balance between 2 loss functions: LSE of data and regularization term 25

20 L1 and L2 Norm L1 (Lasso) tends to generate sparser soluation L2 (Tikhonov regularization, Ridge Regression) L1 (Lasso regularization) 26

21 LOGISTIC REGRESSION 29

22 OLS & Dichotomous Variables Strategy 1: Linear Probability Model (LPM) P P Use linear regression with P(Y=1) as dependent variable rather than Y gun Coef. Std. Err. t P> t [95% Conf. Interval] male educ income south liberal _cons Y 1 1Male 1i 2Educ 2i 3IncKi 4South Ki 5 Liberal Ki P Y (1).015(12).038(6).15(1).03(0) Y

23 OLS & Dichotomous Variables What kinds of problems come up? The model offers nonsensical predicted values Instead of predicting pass (1) or fail (0), the regression line might predict Better Alternative: Logistic Regression Also called Logit A non-linear form of regression that works well for dichotomous dependent variables Based on odds rather than probability Rather than model P(Y=1), we model log odds of Y=1 Logit refers to the natural log of odds 31

24 Logistic Regression Logit = natural logarithm of odds Natural log means base e, not base 10 We can model a logit as a function of independent variables: Just as we model Y or a probability (the LPM) ln P(Y = 1) 1 P(Y = 1) = β 0 + β 1 X + ε 34

Interpreting Coefficients Raw coefficients show effect of 1-unit change in X on the log odds of Y=1 Positive coefficients make Y=1 more likely; Negative coefficients mean less likely But, effects are

25 Interpreting Coefficients Raw coefficients show effect of 1-unit change in X on the log odds of Y=1 Positive coefficients make Y=1 more likely; Negative coefficients mean less likely But, effects are not linear Effect of unit change of X on P(Y=1) isn t same for all values of X! Rather, one unit change of X has a linear effect on the log odds of Y=1 But, it is hard to think in units of log odds, so we need to do further calculations P (Y=1) ln[p/(1-p)] 35

26 Interpreting Coefficients Best way to interpret logit coefficients is to exponentiate them This converts from log odds to simple odds Exponentiated coefficients are called odds ratios An odds ratio of 3.0 indicates odds are 3 times higher for each unit change in X Or, you can say the odds increase by a factor of 3. An odds ratio of 0.5 indicates odds decrease by ½ for each unit change in X. Odds ratios < 1 indicate negative effects. 36

27 Interpreting Coefficients (Example) Do you drink coffee? Y=1 indicates coffee drinkers; Y=0 indicates no coffee Independent variable: Year in graduation program Observed raw coefficient: b = 0.67 Each year increases log odds by.67 Exponentiation: e.67 = 1.95 Each year increases odds by 1.95 The odds nearly double for each unit change in X 37

28 COMMON PROBLEMS WITH REGRESSION 39

29 Common Problems I Outliers Extreme values can have a big impact on the line which we fit since we try to minimize the squared error Inclusion/exclusion of extreme values often changes results significantly Solutions: Use log to reduce impact of large values Use bootstrapped coefficients 40

30 Common Problems II Omitted Variable Bias If omitted variable is a determinant of the DVs (outcome) AND omitted variable is correlated with the included IVs The effect of omitted variable is attributed to the variable that is added and related to the omitted one Heteroskedastic error can indicate that related variables have been omitted 41

31 Common Problems III Multicolinearity exists when two or more of the predictors in a regression model are highly correlate The estimated regression coefficient of any one variable depends on which other predictor variables are included in the model Why? Detect multicolineary : analyze variance of coefficients in different models 42

32 CAUSALITY 43

33 Correlation Dependent Variable (Cause) Independent Variable (Effect) 44

34 This graph tells a different story 45

35 Feedback cycles No undirectional monocausal link 46

36 Storks deliver babies 47

37 48

38 Common Cause Country Size Number of Babies Number of Storks 49

39 Example 8 times more people watched the inauguration of Trump via video stream Can we conclude that he is more popular? Or people were more interested in his inauguration? 50

40 What is Causality? Basic Idea goes back to John Stuart Mil ( ) Counterfactual model of Causal Inference (also known as Rubin Causal Model or Potential Outcomes Model of Causal Inference) Individual level effect! 51

Lecture 14: Shrinkage

Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the