Python 데이터분석 보충자료. 윤형기

Size: px
Start display at page:

Download "Python 데이터분석 보충자료. 윤형기"

Transcription

1 Python 데이터분석 보충자료 윤형기

2 단순 / 다중회귀분석 Logistic Regression 회귀분석 REGRESSION

3 Regression 개요 single numeric D.V. (value to be predicted) 과 one or more numeric I.V. (predictors) 간의관계식. "regression" = process of fitting lines to data (Galton) also used for hypothesis testing, determining whether data indicate that a presupposition is more likely to be true or false. 다양한모델에적용 SLR MLR GLM Link functions Logistic regression, Poisson regression, 3

4 단순회귀분석 단순회귀분석 종속변수 = the variable to be predicted (y). 독립변수 = explanatory variable = The predictor (x). 대상 : only a straight-line relationship between 2 variables 회귀직선식의결정 deterministic regression model is y = β 0 + β 1 x probabilistic regression model is y = β 0 + β 1 x + ε 4

5 OLS (Ordinary Least Squares) 5

6 잔차분석 6

7 추정값의표준오차 error 분석을위해잔차 (= 개별 point 에대한 estimation errors) 계산대신 standard error of the estimate 이용. SSE is in part a function of the number of pairs of data being used to compute the sum, which lessens the value of SSE as a measurement of error. 더좋은지표 = standard error of the estimate (s e ) is a standard deviation of the error of the regression model. ( 정규분포 empirical rule: 68% 가 μ+ 1σ 범위, 95% 가 μ+ 2σ 범위. regression 의 assumption 도 for a given x, error terms ~ ND() ) 이제 error terms ~ ND(), s e 는 error 의 s.d., AVG error =0 이므로» 68% of the error values (residuals) should be within 0 ±1s e» 95% of the error values (residuals) should be within 0 ±2s e. s e provides a single measure of magnitude of errors in model. 또한 outlier 식별에이용. ( 예 : outside ±2s e or ±3s e ) 7

8 결정계수 R 2 = I.V. (x) 가 variability of D.V. (y) 를얼마나설명하는가» r 2 =0 r 2 = 1 D.V. (y) has a variation, measured by SS of y (SS yy ):» SS yy =SSR +SSE» If each term is divided by SSyy, the resulting equation is r 2 is proportion of y variability explained by regression model: Relationship Between r and r 2 r 2 = (r) 2» coeff t of correlation & determination 회귀모델기울기의가설검정 & 모델전반의 Testing 기울기 r = (r) 2 8

9 계수추정 OLS chooses β 0 and β 1 to minimize the RSS, using some calculus

10 추정된계수의 Accuracy (Q) μ 추정치가얼마나정확한가? (A) SE(μ) (=standard error of μ) 를계산 즉, β 0 와 β 1 에대한표준오차의계산 residual standard error RSE = RSS (n 2).

11 선형모델의 Accuracy Residual Standard Error R 2 Statistic r = Cor(x, y) R 2 = r 2

12 다중회귀분석 SLR 과 MLR 단순회귀모델 : y =β 0 + β 1 x +ε 다중회귀모델 : y =β 0 + β 1 x 1 + β 2 x β k x k +ε 독립변수를가진 MR Model (First Order) y = β 0 + β 1 x 1 + β 2 x 2 +ε Constant & coefficients 는표본으로부터추출 : y =b 0 +b 1 x 1 +b 2 x 2 response surface / response plane 회귀모델과계수에대한유의성검정 <Regression 모델의 adequacy 분석 > 모델전반의검정 단순회귀 ; t test of slope of the regression line to see if 0. ( 즉, whether I.V. contribute significantly in predicting D.V. ) 다중회귀 ; an analogous test makes use of F statistic. 12

13 13

14 회귀계수에대한 Significance Tests individual significance tests for each regression coefficient with t test. H 0 : β 1 =0 H 0 : β 2 =0 H 0 : β k =0 H a : β 1 0 H a : β 2 0 H a : β k 0 d.f. for each of individual tests of regression coefficients are n - k - 1. 추정치의잔차와표준오차및 R 2 Residuals = error of the regression model 활용 : outlier 탐지, regression 분석시 assumptions 검정 SSE 와 Standard Error of the Estimate = 추정값의표준오차 = 추정표준오차 ( 표준추정오차 )= 차이의표준오차 = 최적선에대한산포도에서점들의분산도 = y 를중심으로실제 y 점수분포가 ( 회귀선에의한 ) 어느정도인가표시 SSE =Σ(y - y) 2 회귀분석의가정 (error terms ~ ND(0) + 경험칙 ( 대략잔차의 68% 가 ±1se 범위, 95% 가 ±2se 범위 ) 회귀모델의데이터 fitting 정도를측정하는데 standard error of estimate 가유용. 14

15 계수추정 (MLR)

16 주요이슈 (1) Response-Predictors 간의관계성여부? 가설검정 H 0 : β 1 = β 2 = = β p =0 Ha: at least one β j is non-zero. F-statistic 계산 : 단, TSS = (y i y) 2 and RSS = (y i y i ) 2. IF H 0 is true (=response-predictors 간 no relationship) THEN F 값은 1 에근접 IF H a is true, THEN E{(TSS - RSS)/p} >σ 2, so we expect F > 1.

17 (2) 변수별중요도결정 Variable Selection Mallow s Cp, Akaike information criterion (AIC), Bayesian information criterion (BIC), adjusted R 2 그런데 2 p 모델 Forward selection Backward selection Mixed selection

18 (3) Model Fit In SLR, R 2 = 설명변수와상관계수간의상관계수의제곱 In MLR, it equals Cor(Y, Y.) 2 fitted linear model 의특징 : maximizes this correlation among all possible linear models. p-value 를통해 R 2 의개선정도를계수화 RSE 의정의 : Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.

19 (4) Predictions β 0, β 1,..., β p 의 true value 를안다해도 random error 로인해완벽한예측은불가능. ( 즉, irreducible error) confidence interval prediction interval

20 기타의주요이슈 Interaction terms Non-linear effects Multicollinearity Model Selection 20

21 mathematical transformation 을통한 Non-linear models <first-order model> one independent variable: y = β 0 + β 1 x 1 +ε two independent variables: y = β 0 + β 1 x 1 + β 2 x 2 +ε <polynomial regression model> ; contain squared, cubed, or higher powers of the predictor variable(s) and contain response surfaces that are curvilinear. Yet, they are still special cases of the general linear model given in formula: y = β 0 + β 1 x 1 + β 2 x β k x k +ε <second-order model with one independent variable> y = β 0 + β 1 x 1 + β 2 x ε <Quadratic model> 次數가 2 차 (=polynomial equation of degree 2) = a special case of the general linear model curvilinear regression by recoding the data before the multiple regression analysis is attempted. Quadratic form (2 차형식 ) quadratic curve (2 차곡선 ) X T AX = x1 x2 a b c d x1 x2 = ax bx 1 x 2 + xc 1 x 2 + dx 2 2 Non-linear 특화모델 ( 後述 ) 21

22 Model Transformation 개념 exponential model log antilog inverse model Tukey 의 Ladder of Transformations 22

23 Non-linear Relationships 의예 Polynomial regression

24 Regression 분석에서의 interaction Interaction 항목을별도의독립변수로검토 interaction predictor variable can be designed by multiplying the data values of one variable by the values of another variable, y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 +ε (x 1 x 2 term = interaction term). Even though this model has 1 as the highest power of any one variable, it is considered to be a second-order equation because of the x 1 x 2 term. 24

25 모델구축 : Search 절차 회귀모델개발 : (i) maximize explained proportion of the deviation of y values. (ii) Be as parsimonious as possible. Search 절차 All Possible Regressions ( 모든가능한조합의회귀분석 ) If a data set contains k independent variables, all possible regressions will determine 2 k -1 different models. Stepwise Regression ( 단계적회귀분석 ) single predictor variable 에서시작해서 adds and deletes predictors one step at a time, examining the fit of the model at each step until no more significant predictors remain outside the model. STEP 1/2/3: Forward Selection ( 전진선택법 ) = stepwise regression 과동일. 단, once a variable is entered into the process, it is never dropped out. Backward Elimination ( 후진제거법 ) 25

26 Multicollinearity ( 다중공선성 ) = 2 이상독립변수가 highly correlated. (2 개 : collinearity; 여러개 : multicollinearity) 1. It is difficult to interpret the estimates of the regression coeff ts. 2. Inordinately small t values for regression coefficients may result. 3. S.D. of regression coefficients are overestimated. 4. The algebraic sign of estimated regression coefficients may be the opposite of what would be expected for a particular predictor value. multicollinearity 문제는 regression 계수를평가하는 t 값에도영향. Multicollinearity can result in an overestimation of s.d. of the regression coefficients t values tend to be underrepresentative when multicollinearity is present. ( 접근법 ) correlation matrix 를조사하여가능한 intercorrelations 를탐색. Stepwise regression to prevent the problem of multicollinearity. 26

27 Interaction 개념 예 : When the effect on Y of increasing X1 depends on another X2. Advertising 예 : TV and radio advertising both increase sales. Sales = b 0 + b 1 TV + b 2 Radio+ b 3 TV Radio Parameter Estimates Term Estimate Std Error t Ratio Prob> t Intercept <.0001 * TV <.0001 * Radio * TV*Radio e <.0001 * 27

28 Dummy coding 예 : men and women (category listings) Code as indicator variables (dummy variables); Male=0, Female=1. Suppose we want to include income and gender. β 2 = average extra balance each month that females have for given income level. Males are the baseline. 28

29 Line for women Regression equation female: salary = position males: salary = position Different intercepts Same slopes Line for men Position Regression coefficients Coefficient Std Err t-value p-value Constant Income Gender_Female

30 모델평가와변수선택 개념 독립변수의개수가많을경우이를축소하여단순화 즉, OLS fitting 에대한 alternative fitting 을통한 MSE 최소화 필요성 Prediction Accuracy Model Interpretability Subset Selection Stepwise Selection Choosing the Optimal Model Shrinkage Methods Ridge Regression The Lasso

31 1. Prediction Accuracy X 와 Y 의관계가선형이고 n >>p 일때는비교적 low bias, low variance ( 단, n= # of observations, p= # of predictors) 그러나 n» p when, OLS fit can have high variance and may result in overfitting and poor estimates on unseen observations, when n < p, the variability of the least squares fit increases dramatically, and the variance of these estimates in infinite 2. Model Interpretability 독립변수 X 의개수가많을경우이들의 Y 에대한효과가감소 Leaving these variables in the model makes it harder to see the big picture, i.e., the effect of the important variables The model would be easier to interpret by removing (i.e. setting the coefficients to zero) the unimportant variables 31

32 Solution Subset Selection 전체 p 개의설명변수 X 의일부분 (subset) 을식별해낸후이를이용해서모델 fitting 예 : best subset selection, stepwise selection Shrinkage Shrink the estimates coefficients towards zero reduces variance Some of the coefficients may shrink to exactly zero, and hence shrinkage methods can also perform variable selection 예 : Ridge regression, Lasso 차원축소 (Dimension Reduction) Involves projecting all p predictors into an M-dimensional space where M < p, and then fitting linear regression model 예 : Principle Components Regression 32

33 Best Subset Selection One simple approach = take the subset with the smallest RSS or the largest R 2. 단, 모델의변수가많아질수록 R 2 증가 (== smallest RSS) 예 33

34 Measures of Comparison Add penalty to RSS for the number of variables (complexity) 종류 Adjusted R 2 AIC (Akaike information criterion) BIC (Bayesian information criterion) C p (equivalent to AIC for linear regression) 34

35 Stepwise Selection 배경 Best Subset Selection is computationally intensive especially when we have a large number of predictors (large p) More attractive methods: Forward Stepwise Selection: Begins with the model containing no predictor, and then adds one predictor at a time that improves the model the most until no further improvement is possible Backward Stepwise Selection: Begins with the model containing all predictors, and then deleting one predictor at a time that improves the model the most until no further improvement is possible 35

36 Shrinkage Methods Ridge Regression Ordinary Least Squares (OLS) estimates β by minimizing Ridge Regression uses a slightly different equation Tuning parameter λ is a positive value. has the effect of shrinking large values of β towards zero. It turns out that such a constraint should improve the fit, because shrinking the coefficients can significantly reduce their variance Notice that when λ = 0, we get the OLS! 36

37 As λ increases, standardized coefficients shrinks towards 0. 효과 OLS estimates generally have low bias but can be highly variable. In particular when n and p are of similar size or when n < p, then the OLS estimates will be extremely variable The penalty term makes the ridge regression estimates biased but can also substantially reduce variance 37

38 효과 일반적으로, RR estimates will be more biased than OLS but have lower variance Ridge regression will work best in situations where the OLS estimates have high variance If p is large, using best subset selection approach requires searching through enormous numbers of possible models For Ridge Regression, for any given λ, we only need to fit one model and the computations turn out to be very simple Ridge Regression can even be used even when p > n! 38

39 Lasso 개념 ( 배경 ) Ridge Regression isn t perfect the penalty term will never force any of the coefficients to be exactly zero. Thus, the final model will include all variables, which makes it harder to interpret LASSO 역시유사하지만 penalty term 이다름 Penalty term Ridge Regression minimizes The LASSO estimates β by minimizing the 39

40 Tuning parameter λ 의선택 Select a grid of potential values, use cross validation to estimate the error rate on test data (for each value of λ) and select the value that gives the least error rate 40

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Data Analysis 1 LINEAR REGRESSION. Chapter 03 Data Analysis 1 LINEAR REGRESSION Chapter 03 Data Analysis 2 Outline The Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression Other Considerations in Regression Model Qualitative

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. 1/48 Linear regression Linear regression is a simple approach

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Lecture 6: Linear Regression (continued)

Lecture 6: Linear Regression (continued) Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence

More information

Chapter 14 Student Lecture Notes 14-1

Chapter 14 Student Lecture Notes 14-1 Chapter 14 Student Lecture Notes 14-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter 14 Multiple Regression Analysis and Model Building Chap 14-1 Chapter Goals After completing this

More information

Linear Regression In God we trust, all others bring data. William Edwards Deming

Linear Regression In God we trust, all others bring data. William Edwards Deming Linear Regression ddebarr@uw.edu 2017-01-19 In God we trust, all others bring data. William Edwards Deming Course Outline 1. Introduction to Statistical Learning 2. Linear Regression 3. Classification

More information

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague Multiple (non) linear regression Jiří Kléma Department of Computer Science, Czech Technical University in Prague Lecture based on ISLR book and its accompanying slides http://cw.felk.cvut.cz/wiki/courses/b4m36san/start

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

Chapter 13. Multiple Regression and Model Building

Chapter 13. Multiple Regression and Model Building Chapter 13 Multiple Regression and Model Building Multiple Regression Models The General Multiple Regression Model y x x x 0 1 1 2 2... k k y is the dependent variable x, x,..., x 1 2 k the model are the

More information

Lecture 6: Linear Regression

Lecture 6: Linear Regression Lecture 6: Linear Regression Reading: Sections 3.1-3 STATS 202: Data mining and analysis Jonathan Taylor, 10/5 Slide credits: Sergio Bacallado 1 / 30 Simple linear regression Model: y i = β 0 + β 1 x i

More information

Chapter 3 Multiple Regression Complete Example

Chapter 3 Multiple Regression Complete Example Department of Quantitative Methods & Information Systems ECON 504 Chapter 3 Multiple Regression Complete Example Spring 2013 Dr. Mohammad Zainal Review Goals After completing this lecture, you should be

More information

Model Selection Procedures

Model Selection Procedures Model Selection Procedures Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Model Selection Procedures Consider a regression setting with K potential predictor variables and you wish to explore

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Statistical Machine Learning I

Statistical Machine Learning I Statistical Machine Learning I International Undergraduate Summer Enrichment Program (IUSEP) Linglong Kong Department of Mathematical and Statistical Sciences University of Alberta July 18, 2016 Linglong

More information

How the mean changes depends on the other variable. Plots can show what s happening...

How the mean changes depends on the other variable. Plots can show what s happening... Chapter 8 (continued) Section 8.2: Interaction models An interaction model includes one or several cross-product terms. Example: two predictors Y i = β 0 + β 1 x i1 + β 2 x i2 + β 12 x i1 x i2 + ɛ i. How

More information

Lecture 14: Shrinkage

Lecture 14: Shrinkage Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the

More information

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Multiple Regression. Peerapat Wongchaiwat, Ph.D. Peerapat Wongchaiwat, Ph.D. wongchaiwat@hotmail.com The Multiple Regression Model Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (X i ) Multiple Regression Model

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error? Simple linear regression Linear Regression Nicole Beckage y " = β % + β ' x " + ε so y* " = β+ % + β+ ' x " Method to assess and evaluate the correlation between two (continuous) variables. The slope of

More information

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project MLR Model Selection Author: Nicholas G Reich, Jeff Goldsmith This material is part of the statsteachr project Made available under the Creative Commons Attribution-ShareAlike 3.0 Unported License: http://creativecommons.org/licenses/by-sa/3.0/deed.en

More information

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues Overfitting Categorical Variables Interaction Terms Non-linear Terms Linear Logarithmic y = a +

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

holding all other predictors constant

holding all other predictors constant Multiple Regression Numeric Response variable (y) p Numeric predictor variables (p < n) Model: Y = b 0 + b 1 x 1 + + b p x p + e Partial Regression Coefficients: b i effect (on the mean response) of increasing

More information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore What is Multiple Linear Regression Several independent variables may influence the change in response variable we are trying to study. When several independent variables are included in the equation, the

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Lecture Data Science

Lecture Data Science Web Science & Technologies University of Koblenz Landau, Germany Lecture Data Science Regression Analysis JProf. Dr. Last Time How to find parameter of a regression model Normal Equation Gradient Decent

More information

9. Linear Regression and Correlation

9. Linear Regression and Correlation 9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,

More information

STATISTICS 110/201 PRACTICE FINAL EXAM

STATISTICS 110/201 PRACTICE FINAL EXAM STATISTICS 110/201 PRACTICE FINAL EXAM Questions 1 to 5: There is a downloadable Stata package that produces sequential sums of squares for regression. In other words, the SS is built up as each variable

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

10. Alternative case influence statistics

10. Alternative case influence statistics 10. Alternative case influence statistics a. Alternative to D i : dffits i (and others) b. Alternative to studres i : externally-studentized residual c. Suggestion: use whatever is convenient with the

More information

Simple Linear Regression. Y = f (X) + ε. Regression as a term. Linear Regression. Basic structure. Ivo Ugrina. September 29, 2016

Simple Linear Regression. Y = f (X) + ε. Regression as a term. Linear Regression. Basic structure. Ivo Ugrina. September 29, 2016 Regression as a term Linear Regression Ivo Ugrina King s College London // University of Zagreb // University of Split September 29, 2016 Galton Ivo Ugrina Linear Regression September 29, 2016 1 / 56 Ivo

More information

LECTURE 03: LINEAR REGRESSION PT. 1. September 18, 2017 SDS 293: Machine Learning

LECTURE 03: LINEAR REGRESSION PT. 1. September 18, 2017 SDS 293: Machine Learning LECTURE 03: LINEAR REGRESSION PT. 1 September 18, 2017 SDS 293: Machine Learning Announcements Need help with? Visit the Stats TAs! Sunday Thursday evenings 7 9 pm in Burton 301 (SDS293 alum available

More information

General Linear Model (Chapter 4)

General Linear Model (Chapter 4) General Linear Model (Chapter 4) Outcome variable is considered continuous Simple linear regression Scatterplots OLS is BLUE under basic assumptions MSE estimates residual variance testing regression coefficients

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

Regression Model Building

Regression Model Building Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated

More information

Topic 18: Model Selection and Diagnostics

Topic 18: Model Selection and Diagnostics Topic 18: Model Selection and Diagnostics Variable Selection We want to choose a best model that is a subset of the available explanatory variables Two separate problems 1. How many explanatory variables

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the

More information

Multiple regression: Model building. Topics. Correlation Matrix. CQMS 202 Business Statistics II Prepared by Moez Hababou

Multiple regression: Model building. Topics. Correlation Matrix. CQMS 202 Business Statistics II Prepared by Moez Hababou Multiple regression: Model building CQMS 202 Business Statistics II Prepared by Moez Hababou Topics Forward versus backward model building approach Using the correlation matrix Testing for multicolinearity

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

Introduction to Regression

Introduction to Regression Regression Introduction to Regression If two variables covary, we should be able to predict the value of one variable from another. Correlation only tells us how much two variables covary. In regression,

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression Scenario: 31 counts (over a 30-second period) were recorded from a Geiger counter at a nuclear

More information

Applied Statistics and Econometrics

Applied Statistics and Econometrics Applied Statistics and Econometrics Lecture 6 Saul Lach September 2017 Saul Lach () Applied Statistics and Econometrics September 2017 1 / 53 Outline of Lecture 6 1 Omitted variable bias (SW 6.1) 2 Multiple

More information

The prediction of house price

The prediction of house price 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Multiple linear regression S6

Multiple linear regression S6 Basic medical statistics for clinical and experimental research Multiple linear regression S6 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/42 Introduction Two main motivations for doing multiple

More information

Unit 7: Multiple linear regression 1. Introduction to multiple linear regression

Unit 7: Multiple linear regression 1. Introduction to multiple linear regression Announcements Unit 7: Multiple linear regression 1. Introduction to multiple linear regression Sta 101 - Fall 2017 Duke University, Department of Statistical Science Work on your project! Due date- Sunday

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model Checking/Diagnostics Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns

More information

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model Checking/Diagnostics Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns

More information

Chapter 14 Simple Linear Regression (A)

Chapter 14 Simple Linear Regression (A) Chapter 14 Simple Linear Regression (A) 1. Characteristics Managerial decisions often are based on the relationship between two or more variables. can be used to develop an equation showing how the variables

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

Regression Analysis. Scott Richter UNCG-Statistical Consulting Center Department of Mathematics and Statistics. UNCG Quantitative Methodology Series

Regression Analysis. Scott Richter UNCG-Statistical Consulting Center Department of Mathematics and Statistics. UNCG Quantitative Methodology Series Regression Analysis Scott Richter UNCG-Statistical Consulting Center Department of Mathematics and Statistics UNCG Quantitative Methodology Series I. Simple linear regression i. Motivating example-runtime

More information

Unit 11: Multiple Linear Regression

Unit 11: Multiple Linear Regression Unit 11: Multiple Linear Regression Statistics 571: Statistical Methods Ramón V. León 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 1 Main Application of Multiple Regression Isolating the effect of a variable

More information

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models 1. Isfahan University of Technology Fall Semester, 2014 Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and

More information

Model Selection. Frank Wood. December 10, 2009

Model Selection. Frank Wood. December 10, 2009 Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 1 / 13 Chapter 8: Polynomials & Interactions

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 .0.0 5 5 1.0 7 5 X2 X2 7 1.5 1.0 0.5 3 1 2 Hierarchical clustering

More information

Multiple Linear Regression. Chapter 12

Multiple Linear Regression. Chapter 12 13 Multiple Linear Regression Chapter 12 Multiple Regression Analysis Definition The multiple regression model equation is Y = b 0 + b 1 x 1 + b 2 x 2 +... + b p x p + ε where E(ε) = 0 and Var(ε) = s 2.

More information

An Introduction to Mplus and Path Analysis

An Introduction to Mplus and Path Analysis An Introduction to Mplus and Path Analysis PSYC 943: Fundamentals of Multivariate Modeling Lecture 10: October 30, 2013 PSYC 943: Lecture 10 Today s Lecture Path analysis starting with multivariate regression

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Multiple Regression Analysis. Part III. Multiple Regression Analysis Part III Multiple Regression Analysis As of Sep 26, 2017 1 Multiple Regression Analysis Estimation Matrix form Goodness-of-Fit R-square Adjusted R-square Expected values of the OLS estimators Irrelevant

More information

Lab 10 - Binary Variables

Lab 10 - Binary Variables Lab 10 - Binary Variables Spring 2017 Contents 1 Introduction 1 2 SLR on a Dummy 2 3 MLR with binary independent variables 3 3.1 MLR with a Dummy: different intercepts, same slope................. 4 3.2

More information

Chapter 7 Student Lecture Notes 7-1

Chapter 7 Student Lecture Notes 7-1 Chapter 7 Student Lecture Notes 7- Chapter Goals QM353: Business Statistics Chapter 7 Multiple Regression Analysis and Model Building After completing this chapter, you should be able to: Explain model

More information

Matematické Metody v Ekonometrii 7.

Matematické Metody v Ekonometrii 7. Matematické Metody v Ekonometrii 7. Multicollinearity Blanka Šedivá KMA zimní semestr 2016/2017 Blanka Šedivá (KMA) Matematické Metody v Ekonometrii 7. zimní semestr 2016/2017 1 / 15 One of the assumptions

More information

Variable Selection in Predictive Regressions

Variable Selection in Predictive Regressions Variable Selection in Predictive Regressions Alessandro Stringhi Advanced Financial Econometrics III Winter/Spring 2018 Overview This chapter considers linear models for explaining a scalar variable when

More information

Sociology 593 Exam 2 Answer Key March 28, 2002

Sociology 593 Exam 2 Answer Key March 28, 2002 Sociology 59 Exam Answer Key March 8, 00 I. True-False. (0 points) Indicate whether the following statements are true or false. If false, briefly explain why.. A variable is called CATHOLIC. This probably

More information

Statistics 262: Intermediate Biostatistics Model selection

Statistics 262: Intermediate Biostatistics Model selection Statistics 262: Intermediate Biostatistics Model selection Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics p.1/?? Today s class Model selection. Strategies for model selection.

More information

Making sense of Econometrics: Basics

Making sense of Econometrics: Basics Making sense of Econometrics: Basics Lecture 4: Qualitative influences and Heteroskedasticity Egypt Scholars Economic Society November 1, 2014 Assignment & feedback enter classroom at http://b.socrative.com/login/student/

More information

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Regression Models. Chapter 4. Introduction. Introduction. Introduction Chapter 4 Regression Models Quantitative Analysis for Management, Tenth Edition, by Render, Stair, and Hanna 008 Prentice-Hall, Inc. Introduction Regression analysis is a very valuable tool for a manager

More information

Chapter 4: Regression Models

Chapter 4: Regression Models Sales volume of company 1 Textbook: pp. 129-164 Chapter 4: Regression Models Money spent on advertising 2 Learning Objectives After completing this chapter, students will be able to: Identify variables,

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 Hierarchical clustering Most algorithms for hierarchical clustering

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

4. Nonlinear regression functions

4. Nonlinear regression functions 4. Nonlinear regression functions Up to now: Population regression function was assumed to be linear The slope(s) of the population regression function is (are) constant The effect on Y of a unit-change

More information

Prediction of Bike Rental using Model Reuse Strategy

Prediction of Bike Rental using Model Reuse Strategy Prediction of Bike Rental using Model Reuse Strategy Arun Bala Subramaniyan and Rong Pan School of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USA. {bsarun, rong.pan}@asu.edu

More information

Introduction to Regression

Introduction to Regression Introduction to Regression ιατµηµατικό Πρόγραµµα Μεταπτυχιακών Σπουδών Τεχνο-Οικονοµικά Συστήµατα ηµήτρης Φουσκάκης Introduction Basic idea: Use data to identify relationships among variables and use these

More information

2. Linear regression with multiple regressors

2. Linear regression with multiple regressors 2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions

More information

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures FE661 - Statistical Methods for Financial Engineering 9. Model Selection Jitkomut Songsiri statistical models overview of model selection information criteria goodness-of-fit measures 9-1 Statistical models

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same

More information

STAT Chapter 11: Regression

STAT Chapter 11: Regression STAT 515 -- Chapter 11: Regression Mostly we have studied the behavior of a single random variable. Often, however, we gather data on two random variables. We wish to determine: Is there a relationship

More information

Multiple Linear Regression

Multiple Linear Regression Andrew Lonardelli December 20, 2013 Multiple Linear Regression 1 Table Of Contents Introduction: p.3 Multiple Linear Regression Model: p.3 Least Squares Estimation of the Parameters: p.4-5 The matrix approach

More information

The simple linear regression model discussed in Chapter 13 was written as

The simple linear regression model discussed in Chapter 13 was written as 1519T_c14 03/27/2006 07:28 AM Page 614 Chapter Jose Luis Pelaez Inc/Blend Images/Getty Images, Inc./Getty Images, Inc. 14 Multiple Regression 14.1 Multiple Regression Analysis 14.2 Assumptions of the Multiple

More information

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017 Linear Models: Comparing Variables Stony Brook University CSE545, Fall 2017 Statistical Preliminaries Random Variables Random Variables X: A mapping from Ω to ℝ that describes the question we care about

More information

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont. TCELL 9/4/205 36-309/749 Experimental Design for Behavioral and Social Sciences Simple Regression Example Male black wheatear birds carry stones to the nest as a form of sexual display. Soler et al. wanted

More information

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning LECTURE 10: LINEAR MODEL SELECTION PT. 1 October 16, 2017 SDS 293: Machine Learning Outline Model selection: alternatives to least-squares Subset selection - Best subset - Stepwise selection (forward and

More information

Regression. Simple Linear Regression Multiple Linear Regression Polynomial Linear Regression Decision Tree Regression Random Forest Regression

Regression. Simple Linear Regression Multiple Linear Regression Polynomial Linear Regression Decision Tree Regression Random Forest Regression Simple Linear Multiple Linear Polynomial Linear Decision Tree Random Forest Computational Intelligence in Complex Decision Systems 1 / 28 analysis In statistical modeling, regression analysis is a set

More information

Multiple Linear Regression CIVL 7012/8012

Multiple Linear Regression CIVL 7012/8012 Multiple Linear Regression CIVL 7012/8012 2 Multiple Regression Analysis (MLR) Allows us to explicitly control for many factors those simultaneously affect the dependent variable This is important for

More information

Föreläsning /31

Föreläsning /31 1/31 Föreläsning 10 090420 Chapter 13 Econometric Modeling: Model Speci cation and Diagnostic testing 2/31 Types of speci cation errors Consider the following models: Y i = β 1 + β 2 X i + β 3 X 2 i +

More information

Sociology Research Statistics I Final Exam Answer Key December 15, 1993

Sociology Research Statistics I Final Exam Answer Key December 15, 1993 Sociology 592 - Research Statistics I Final Exam Answer Key December 15, 1993 Where appropriate, show your work - partial credit may be given. (On the other hand, don't waste a lot of time on excess verbiage.)

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

The general linear regression with k explanatory variables is just an extension of the simple regression as follows 3. Multiple Regression Analysis The general linear regression with k explanatory variables is just an extension of the simple regression as follows (1) y i = β 0 + β 1 x i1 + + β k x ik + u i. Because

More information