You can use numeric categorical predictors. A categorical predictor is one that takes values from a fixed set of possibilities.

Size: px
Start display at page:

Download "You can use numeric categorical predictors. A categorical predictor is one that takes values from a fixed set of possibilities."

Transcription

1 CONTENTS Linear Regression Prepare Data To begin fitting a regression, put your data into a form that fitting functions expect. All regression techniques begin with input data in an array X and response data in a separate vector y, or input data in a table or dataset array tbl and response data as a column in tbl. Each row of the input data represents one observation. Each column represents one predictor (variable). For a table or dataset array tbl, indicate the response variable with the 'ResponseVar' name value pair: mdl = fitlm(tbl,'responsevar','bloodpressure'); % or mdl = fitglm(tbl,'responsevar','bloodpressure'); The response variable is the last column by default. You can use numeric categorical predictors. A categorical predictor is one that takes values from a fixed set of possibilities. For a numeric array X, indicate the categorical predictors using the 'Categorical' name value pair. For example, to indicate that predictors 2 and 3 out of six are categorical: mdl = fitlm(x,y,'categorical',[2,3]); % or mdl = fitglm(x,y,'categorical',[2,3]); % or equivalently mdl = fitlm(x,y,'categorical',logical([ ])); For a table or dataset array tbl, fitting functions assume that these data types are categorical: Logical Categorical (nominal or ordinal) String or character array If you want to indicate that a numeric predictor is categorical, use the 'Categorical' name value pair. Represent missing numeric data as NaN. To represent missing data for other data types, see Missing Group Values. Dataset Array for Input and Response Data To create a dataset array from an Excel spreadsheet: ds = dataset('xlsfile','hospital.xls',... 'ReadObsNames',true); To create a dataset array from workspace variables: load carsmall ds = dataset(mpg,weight); ds.year = ordinal(model_year); Table for Input and Response Data To create a table from an Excel spreadsheet: regression model workflow.html#btb50q6 1/28

2 tbl = readtable('hospital.xls',... 'ReadRowNames',true); To create a table from workspace variables: load carsmall tbl = table(mpg,weight); tbl.year = ordinal(model_year); Numeric Matrix for Input Data, Numeric Vector for Response For example, to create numeric arrays from workspace variables: load carsmall X = [Weight Horsepower Cylinders Model_Year]; y = MPG; To create numeric arrays from an Excel spreadsheet: [X Xnames] = xlsread('hospital.xls'); y = X(:,4); % response y is systolic pressure X(:,4) = []; % remove y from the X matrix Notice that the nonnumeric entries, such as sex, do not appear in X. Choose a Fitting Method There are three ways to fit a model to data: Least Squares Fit Use fitlm to construct a least squares fit of a model to the data. This method is best when you are reasonably certain of the model's form, and mainly need to find its parameters. This method is also useful when you want to explore a few models. The method requires you to examine the data manually to discard outliers, though there are techniques to help (see Residuals Model Quality for Training Data). Robust Fit Use fitlm with the RobustOpts name value pair to create a model that is little affected by outliers. Robust fitting saves you the trouble of manually discarding outliers. However, step does not work with robust fitting. This means that when you use robust fitting, you cannot search stepwise for a good model. Stepwise Fit Use stepwiselm to find a model, and fit parameters to the model. stepwiselm starts from one model, such as a constant, and adds or subtracts terms one at a time, choosing an optimal term each time in a greedy fashion, until it cannot improve further. Use stepwise fitting to find a good model, which is one that has only relevant terms. The result depends on the starting model. Usually, starting with a constant model leads to a small model. Starting with more terms can lead to a more complex model, but one that has lower mean squared error. See Compare large and small stepwise models. You cannot use robust options along with stepwise fitting. So after a stepwise fit, examine your model for outliers (see Residuals Model Quality for Training Data). Choose a Model or Range of Models There are several ways of specifying a model for linear regression. Use whichever you find most convenient. regression model workflow.html#btb50q6 2/28

3 For fitlm, the model specification you give is the model that is fit. If you do not give a model specification, the default is 'linear'. For stepwiselm, the model specification you give is the starting model, which the stepwise procedure tries to improve. If you do not give a model specification, the default starting model is 'constant', and the default upper bounding model is 'interactions'. Change the upper bounding model using the Upper name value pair. Note: There are other ways of selecting models, such as using lasso, lassoglm, sequentialfs, or plsregress. Brief String String 'constant' 'linear' 'interactions' 'purequadratic' 'quadratic' 'polyijk' Model Type Model contains only a constant (intercept) term. Model contains an intercept and linear terms for each predictor. Model contains an intercept, linear terms, and all products of pairs of distinct predictors (no squared terms). Model contains an intercept, linear terms, and squared terms. Model contains an intercept, linear terms, interactions, and squared terms. Model is a polynomial with all terms up to degree i in the first predictor, degree j in the second predictor, etc. Use numerals 0 through 9. For example, 'poly2111' has a constant plus all linear and product terms, and also contains terms with predictor 1 squared. For example, to specify an interaction model using fitlm with matrix predictors: mdl = fitlm(x,y,'interactions'); To specify a model using stepwiselm and a table or dataset array tbl of predictors, suppose you want to start from a constant and have a linear model upper bound. Assume the response variable in tbl is in the third column. mdl2 = stepwiselm(tbl,'constant',... 'Upper','linear','ResponseVar',3); Terms Matrix A terms matrix is a t by (p + 1) matrix specifying terms in a model, where t is the number of terms, p is the number of predictor variables, and plus one is for the response variable. The value of T(i,j) is the exponent of variable j in term i. Suppose there are three predictor variables A, B, and C: [ ] % Constant term or intercept [ ] % B; equivalently, A^0 * B^1 * C^0 [ ] % A*C [ ] % A^2 [ ] % B*(C^2) The 0 at the end of each term represents the response variable. In general, If you have the variables in a table or dataset array, then 0 must represent the response variable depending on the position of the response variable. The following example illustrates this. Load the sample data and define the dataset array. load hospital ds = dataset(hospital.sex,hospital.bloodpressure(:,1),hospital.age,... hospital.smoker,'varnames',{'sex','bloodpressure','age','smoker'}); regression model workflow.html#btb50q6 3/28

4 Represent the linear model 'BloodPressure ~ 1 + Sex + Age + Smoker' in a terms matrix. The response variable is in the second column of the dataset array, so there must be a column of 0s for the response variable in the second column of the terms matrix. T = [ ; ; ; ] T = Redefine the dataset array. ds = dataset(hospital.bloodpressure(:,1),hospital.sex,hospital.age,... hospital.smoker,'varnames',{'bloodpressure','sex','age','smoker'}); Now, the response variable is the first term in the dataset array. Specify the same linear model, 'BloodPressure ~ 1 + Sex + Age + Smoker', using a terms matrix. T = [ ; ; ; ] T = If you have the predictor and response variables in a matrix and column vector, then you must include 0 for the response variable at the end of each term. The following example illustrates this. Load the sample data and define the matrix of predictors. load carsmall X = [Acceleration,Weight]; Specify the model 'MPG ~ Acceleration + Weight + Acceleration:Weight + Weight^2' using a term matrix and fit the model to the data. This model includes the main effect and two way interaction terms for the variables, Acceleration and Weight, and a second order term for the variable, Weight. T = [0 0 0;1 0 0;0 1 0;1 1 0;0 2 0] T = regression model workflow.html#btb50q6 4/28

5 Fit a linear model. mdl = fitlm(x,mpg,t) mdl = Linear regression model: y ~ 1 + x1*x2 + x2^2 Estimated Coefficients: Estimate SE tstat pvalue (Intercept) x x x1:x x2^ e e Number of observations: 94, Error degrees of freedom: 89 Root Mean Squared Error: 4.1 R squared: 0.751, Adjusted R Squared F statistic vs. constant model: 67, p value = 4.99e 26 Only the intercept and x2 term, which correspond to the Weight variable, are significant at the 5% significance level. Now, perform a stepwise regression with a constant model as the starting model and a linear model with interactions as the upper model. T = [0 0 0;1 0 0;0 1 0;1 1 0]; mdl = stepwiselm(x,mpg,[0 0 0],'upper',T) 1. Adding x2, FStat = , pvalue = e 28 mdl = Linear regression model: y ~ 1 + x2 Estimated Coefficients: Estimate SE tstat pvalue (Intercept) e 49 x e 28 Number of observations: 94, Error degrees of freedom: 92 Root Mean Squared Error: 4.13 R squared: 0.738, Adjusted R Squared F statistic vs. constant model: 259, p value = 1.64e 28 The results of the stepwise regression are consistent with the results of fitlm in the previous step. Formula A formula for a model specification is a string of the form 'Y ~ terms', Y is the response name. regression model workflow.html#btb50q6 5/28

6 terms contains Variable names + to include the next variable to exclude the next variable : to define an interaction, a product of terms * to define an interaction and all lower order terms ^ to raise the predictor to a power, exactly as in * repeated, so ^ includes lower order terms as well () to group terms Tip Formulas include a constant (intercept) term by default. To exclude a constant term from the model, include 1 in the formula. Examples: 'Y ~ A + B + C' is a three variable linear model with intercept. 'Y ~ A + B + C 1' is a three variable linear model without intercept. 'Y ~ A + B + C + B^2' is a three variable model with intercept and a B^2 term. 'Y ~ A + B^2 + C' is the same as the previous example, since B^2 includes a B term. 'Y ~ A + B + C + A:B' includes an A*B term. 'Y ~ A*B + C' is the same as the previous example, since A*B = A + B + A:B. 'Y ~ A*B*C A:B:C' has all interactions among A, B, and C, except the three way interaction. 'Y ~ A*(B + C + D)' has all linear terms, plus products of A with each of the other variables. For example, to specify an interaction model using fitlm with matrix predictors: mdl = fitlm(x,y,'y ~ x1*x2*x3 x1:x2:x3'); To specify a model using stepwiselm and a table or dataset array tbl of predictors, suppose you want to start from a constant and have a linear model upper bound. Assume the response variable in tbl is named 'y', and the predictor variables are named 'x1', 'x2', and 'x3'. mdl2 = stepwiselm(tbl,'y ~ 1','Upper','y ~ x1 + x2 + x3'); Fit Model to Data The most common optional arguments for fitting: For robust regression in fitlm, set the 'RobustOpts' name value pair to 'on'. Specify an appropriate upper bound model in stepwiselm, such as set 'Upper' to 'linear'. Indicate which variables are categorical using the 'CategoricalVars' name value pair. Provide a vector with column numbers, such as [1 6] to specify that predictors 1 and 6 are categorical. Alternatively, give a logical vector the same length as the data columns, with a 1 entry indicating that variable is categorical. If there are seven predictors, and predictors 1 and 6 are categorical, specify logical([1,0,0,0,0,1,0]). For a table or dataset array, specify the response variable using the 'ResponseVar' name value pair. The default is the last column in the array. For example, mdl = fitlm(x,y,'linear',... 'RobustOpts','on','CategoricalVars',3); mdl2 = stepwiselm(tbl,'constant',... 'ResponseVar','MPG','Upper','quadratic'); Examine Quality and Adjust the Fitted Model After fitting a model, examine the result. regression model workflow.html#btb50q6 6/28

7 Model Display A linear regression model shows several diagnostics when you enter its name or enter disp(mdl). This display gives some of the basic information to check whether the fitted model represents the data adequately. For example, fit a linear model to data constructed with two out of five predictors not present and with no intercept term: X = randn(100,5); y = X*[1;0;3;0; 1]+randn(100,1); mdl = fitlm(x,y) mdl = Linear regression model: y ~ 1 + x1 + x2 + x3 + x4 + x5 Estimated Coefficients: Estimate SE tstat pvalue (Intercept) x e 18 x x e 48 x x e 13 Number of observations: 100, Error degrees of freedom: 94 Root Mean Squared Error: R squared: 0.93, Adjusted R Squared F statistic vs. constant model: 248, p value = 1.5e 52 Notice that: There is a standard error column for the coefficient estimates. The display contains the estimated values of each coefficient in the Estimate column. These values are reasonably near the true values [0;1;0;3;0; 1]. The reported pvalue (which are derived from the t statistics under the assumption of normal errors) for predictors 1, 3, and 5 are extremely small. These are the three predictors that were used to create the response data y. The pvalue for (Intercept), x2 and x4 are much larger than These three predictors were not used to create the response data y. 2 2 The display contains R, adjusted R, and F statistics. ANOVA To examine the quality of the fitted model, consult an ANOVA table. For example, use anova on a linear model with five predictors: X = randn(100,5); y = X*[1;0;3;0; 1]+randn(100,1); mdl = fitlm(x,y); tbl = anova(mdl) regression model workflow.html#btb50q6 7/28

8 tbl = SumSq DF MeanSq F pvalue x e 18 x x e 48 x x e 13 Error This table gives somewhat different results than the default display (see Model Display). The table clearly shows that the effects of x2 and x4 are not significant. Depending on your goals, consider removing x2 and x4 from the model. Diagnostic Plots Diagnostic plots help you identify outliers, and see other problems in your model or fit. For example, load the carsmall data, and make a model of MPG as a function of Cylinders (nominal) and Weight: load carsmall tbl = table(weight,mpg,cylinders); tbl.cylinders = ordinal(tbl.cylinders); mdl = fitlm(tbl,'mpg ~ Cylinders*Weight + Weight^2');ˋ Make a leverage plot of the data and model. plotdiagnostics(mdl) There are a few points with high leverage. But this plot does not reveal whether the high leverage points are outliers. Look for points with large Cook's distance. plotdiagnostics(mdl,'cookd') regression model workflow.html#btb50q6 8/28

9 There is one point with large Cook's distance. Identify it and remove it from the model. You can use the Data Cursor to click the outlier and identify it, or identify it programmatically: [~,larg] = max(mdl.diagnostics.cooksdistance); mdl2 = fitlm(tbl,'mpg ~ Cylinders*Weight + Weight^2',... 'Exclude',larg); Residuals Model Quality for Training Data There are several residual plots to help you discover errors, outliers, or correlations in the model or data. The simplest residual plots are the default histogram plot, which shows the range of the residuals and their frequencies, and the probability plot, which shows how the distribution of the residuals compares to a normal distribution with matched variance. Load the carsmall data, and make a model of MPG as a function of Cylinders (nominal) and Weight: load carsmall tbl = table(weight,mpg,cylinders); tbl.cylinders = ordinal(tbl.cylinders); mdl = fitlm(tbl,'mpg ~ Cylinders*Weight + Weight^2'); Examine the residuals: plotresiduals(mdl) regression model workflow.html#btb50q6 9/28

10 The observations above 12 are potential outliers. plotresiduals(mdl,'probability') The two potential outliers appear on this plot as well. Otherwise, the probability plot seems reasonably straight, meaning a reasonable fit to normally distributed residuals. You can identify the two outliers and remove them from the data: outl = find(mdl.residuals.raw > 12) regression model workflow.html#btb50q6 10/28

11 outl = To remove the outliers, use the Exclude name value pair: mdl2 = fitlm(tbl,'mpg ~ Cylinders*Weight + Weight^2',... 'Exclude',outl); Examine a residuals plot of mdl2: plotresiduals(mdl2) The new residuals plot looks fairly symmetric, without obvious problems. However, there might be some serial correlation among the residuals. Create a new plot to see if such an effect exists. plotresiduals(mdl2,'lagged') regression model workflow.html#btb50q6 11/28

12 The scatter plot shows many more crosses in the upper right and lower left quadrants than in the other two quadrants, indicating positive serial correlation among the residuals. Another potential issue is when residuals are large for large observations. See if the current model has this issue. plotresiduals(mdl2,'fitted') There is some tendency for larger fitted values to have larger residuals. Perhaps the model errors are proportional to the measured values. Plots to Understand Predictor Effects regression model workflow.html#btb50q6 12/28

13 This example shows how to understand the effect each predictor has on a regression model using a variety of available plots. 1. Create a model of mileage from some predictors in the carsmall data. load carsmall tbl = table(weight,mpg,cylinders); tbl.cylinders = ordinal(tbl.cylinders); mdl = fitlm(tbl,'mpg ~ Cylinders*Weight + Weight^2'); 2. Examine a slice plot of the responses. This displays the effect of each predictor separately. plotslice(mdl) You can drag the individual predictor values, which are represented by dashed blue vertical lines. You can also choose between simultaneous and non simultaneous confidence bounds, which are represented by dashed red curves. 3. Use an effects plot to show another view of the effect of predictors on the response. regression model workflow.html#btb50q6 13/28

14 ploteffects(mdl) This plot shows that changing Weight from about 2500 to 4732 lowers MPG by about 30 (the location of the upper blue circle). It also shows that changing the number of cylinders from 8 to 4 raises MPG by about 10 (the lower blue circle). The horizontal blue lines represent confidence intervals for these predictions. The predictions come from averaging over one predictor as the other is changed. In cases such as this, where the two predictors are correlated, be careful when interpreting the results. 4. Instead of viewing the effect of averaging over a predictor as the other is changed, examine the joint interaction in an interaction plot. plotinteraction(mdl,'weight','cylinders') regression model workflow.html#btb50q6 14/28

15 The interaction plot shows the effect of changing one predictor with the other held fixed. In this case, the plot is much more informative. It shows, for example, that lowering the number of cylinders in a relatively light car (Weight = 1795) leads to an increase in mileage, but lowering the number of cylinders in a relatively heavy car (Weight = 4732) leads to a decrease in mileage. 5. For an even more detailed look at the interactions, look at an interaction plot with predictions. This plot holds one predictor fixed while varying the other, and plots the effect as a curve. Look at the interactions for various fixed numbers of cylinders. plotinteraction(mdl,'cylinders','weight','predictions') Now look at the interactions with various fixed levels of weight. regression model workflow.html#btb50q6 15/28

16 plotinteraction(mdl,'weight','cylinders','predictions') Plots to Understand Terms Effects This example shows how to understand the effect of each term in a regression model using a variety of available plots. 1. Create a model of mileage from some predictors in the carsmall data. load carsmall tbl = table(weight,mpg,cylinders); tbl.cylinders = ordinal(tbl.cylinders); mdl = fitlm(tbl,'mpg ~ Cylinders*Weight + Weight^2'); 2. Create an added variable plot with Weight^2 as the added variable. plotadded(mdl,'weight^2') regression model workflow.html#btb50q6 16/28

17 This plot shows the results of fitting both Weight^2 and MPG to the terms other than Weight^2. The reason to use plotadded is to understand what additional improvement in the model you get by adding Weight^2. The coefficient of a line fit to these points is the coefficient of Weight^2 in the full model. The Weight^2 predictor is just over the edge of significance (pvalue < 0.05) as you can see in the coefficients table display. You can see that in the plot as well. The confidence bounds look like they could not contain a horizontal line (constant y), so a zeroslope model is not consistent with the data. 3. Create an added variable plot for the model as a whole. plotadded(mdl) regression model workflow.html#btb50q6 17/28

18 step Add or subtract terms one at a time, where step chooses the most important term to add or remove. The model as a whole is very significant, so the bounds don't come close to containing a horizontal line. The slope of the line is the slope of a fit to the predictors projected onto their best fitting direction, or in other words, the norm of the coefficient vector. Change Models There are two ways to change a model: addterms and removeterms Add or remove specified terms. Give the terms in any of the forms described in Choose a Model or Range of Models. If you created a model using stepwiselm, step can have an effect only if you give different upper or lower models. step does not work when you fit a model using RobustOpts. For example, start with a linear model of mileage from the carbig data: load carbig tbl = table(acceleration,displacement,horsepower,weight,mpg); mdl = fitlm(tbl,'linear','responsevar','mpg') mdl = Linear regression model: MPG ~ 1 + Acceleration + Displacement + Horsepower + Weight Estimated Coefficients: Estimate SE tstat pvalue (Intercept) e 55 Acceleration Displacement Horsepower Weight e 10 Number of observations: 392, Error degrees of freedom: 387 Root Mean Squared Error: 4.25 R squared: 0.707, Adjusted R Squared F statistic vs. constant model: 233, p value = 9.63e 102 Try to improve the model using step for up to 10 steps: mdl1 = step(mdl,'nsteps',10) regression model workflow.html#btb50q6 18/28

19 1. Adding Displacement:Horsepower, FStat = , pvalue = e 19 mdl1 = Linear regression model: MPG ~ 1 + Acceleration + Weight + Displacement*Horsepower Estimated Coefficients: Estimate SE tstat pvalue (Intercept) e 69 Acceleration Displacement e 15 Horsepower e 19 Weight Displacement:Horsepower e e 19 Number of observations: 392, Error degrees of freedom: 386 Root Mean Squared Error: 3.84 R squared: 0.761, Adjusted R Squared F statistic vs. constant model: 246, p value = 1.32e 117 step stopped after just one change. To try to simplify the model, remove the Acceleration and Weight terms from mdl1: mdl2 = removeterms(mdl1,'acceleration + Weight') mdl2 = Linear regression model: MPG ~ 1 + Displacement*Horsepower Estimated Coefficients: Estimate SE tstat pvalue (Intercept) e 121 Displacement e 39 Horsepower e 28 Displacement:Horsepower e e 25 Number of observations: 392, Error degrees of freedom: 388 Root Mean Squared Error: 3.94 R squared: 0.747, Adjusted R Squared F statistic vs. constant model: 381, p value = 3e 115 mdl2 uses just Displacement and Horsepower, and has nearly as good a fit to the data as mdl1 in the Adjusted R Squared metric. Predict or Simulate Responses to New Data There are three ways to use a linear model to predict or simulate the response to new data: predict This example shows how to predict and obtain confidence intervals on the predictions using the predict method. regression model workflow.html#btb50q6 19/28

20 1. Load the carbig data and make a default linear model of the response MPG to the Acceleration, Displacement, Horsepower, and Weight predictors. load carbig X = [Acceleration,Displacement,Horsepower,Weight]; mdl = fitlm(x,mpg); 2. Create a three row array of predictors from the minimal, mean, and maximal values. There are some NaN values, so use functions that ignore NaN values. Xnew = [nanmin(x);nanmean(x);nanmax(x)]; % new data 3. Find the predicted model responses and confidence intervals on the predictions. [NewMPG NewMPGCI] = predict(mdl,xnew) NewMPG = NewMPGCI = feval The confidence bound on the mean response is narrower than those for the minimum or maximum responses, which is quite sensible. When you construct a model from a table or dataset array, feval is often more convenient for predicting mean responses than predict. However, feval does not provide confidence bounds. This example shows how to predict mean responses using the feval method. 1. Load the carbig data and make a default linear model of the response MPG to the Acceleration, Displacement, Horsepower, and Weight predictors. load carbig tbl = table(acceleration,displacement,horsepower,weight,mpg); mdl = fitlm(tbl,'linear','responsevar','mpg'); 2. Create a three row array of predictors from the minimal, mean, and maximal values. There are some NaN values, so use functions that ignore NaN values. X = [Acceleration,Displacement,Horsepower,Weight]; Xnew = [nanmin(x);nanmean(x);nanmax(x)]; % new data The Xnew array has the correct number of columns for prediction, so feval can use it for predictions. 3. Find the predicted model responses. NewMPG = feval(mdl,xnew) regression model workflow.html#btb50q6 20/28

21 NewMPG = random The random method simulates new random response values, equal to the mean prediction plus a random disturbance with the same variance as the training data. This example shows how to simulate responses using the random method. 1. Load the carbig data and make a default linear model of the response MPG to the Acceleration, Displacement, Horsepower, and Weight predictors. load carbig X = [Acceleration,Displacement,Horsepower,Weight]; mdl = fitlm(x,mpg); 2. Create a three row array of predictors from the minimal, mean, and maximal values. There are some NaN values, so use functions that ignore NaN values. Xnew = [nanmin(x);nanmean(x);nanmax(x)]; % new data 3. Generate new predicted model responses including some randomness. rng('default') % for reproducibility NewMPG = random(mdl,xnew) NewMPG = Because a negative value of MPG does not seem sensible, try predicting two more times. NewMPG = random(mdl,xnew) NewMPG = NewMPG = random(mdl,xnew) NewMPG = regression model workflow.html#btb50q6 21/28

22 Clearly, the predictions for the third (maximal) row of Xnew are not reliable. Share Fitted Models Suppose you have a linear regression model, such as mdl from the following commands: load carbig tbl = table(acceleration,displacement,horsepower,weight,mpg); mdl = fitlm(tbl,'linear','responsevar','mpg'); To share the model with other people, you can: Provide the model display. mdl mdl = Linear regression model: MPG ~ 1 + Acceleration + Displacement + Horsepower + Weight Estimated Coefficients: Estimate SE tstat pvalue (Intercept) e 55 Acceleration Displacement Horsepower Weight e 10 Number of observations: 392, Error degrees of freedom: 387 Root Mean Squared Error: 4.25 R squared: 0.707, Adjusted R Squared F statistic vs. constant model: 233, p value = 9.63e 102 Provide just the model definition and coefficients. mdl.coefficientnames ans = '(Intercept)' 'Acceleration' 'Displacement' 'Horsepower' 'Weight' mdl.coefficients.estimate ans = regression model workflow.html#btb50q6 22/28

23 mdl.formula ans = MPG ~ 1 + Acceleration + Displacement + Horsepower + Weight Linear Regression Workflow This example shows how to fit a linear regression model. A typical workflow involves the following: import data, fit a regression, test its quality, modify it to improve the quality, and share it. Step 1. Import the data into a dataset array. hospital.xls is an Excel spreadsheet containing patient names, sex, age, weight, blood pressure, and dates of treatment in an experimental protocol. First read the data into a table. patients = readtable('hospital.xls',... 'ReadRowNames',true); Examine the first row of data. patients(1,:) ans = name sex age wgt smoke sys dia trial1 trial2 trial3 YPL 320 'SMITH' 'm' The sex and smoke fields seem to have two choices each. So change these fields to nominal. patients.smoke = nominal(patients.smoke,{'no','yes'}); patients.sex = nominal(patients.sex); Step 2. Create a fitted model. Your goal is to model the systolic pressure as a function of a patient's age, weight, sex, and smoking status. Create a linear formula for 'sys' as a function of 'age', 'wgt', 'sex', and 'smoke'. modelspec = 'sys ~ age + wgt + sex + smoke'; mdl = fitlm(patients,modelspec) regression model workflow.html#btb50q6 23/28

24 mdl = Linear regression model: sys ~ 1 + sex + age + wgt + smoke Estimated Coefficients: Estimate SE tstat pvalue (Intercept) e 28 sex_m age wgt smoke_yes e 15 Number of observations: 100, Error degrees of freedom: 95 Root Mean Squared Error: 4.81 R squared: 0.508, Adjusted R Squared F statistic vs. constant model: 24.5, p value = 5.99e 14 The sex, age, and weight predictors have rather high values, indicating that some of these predictors might be unnecessary. Step 3. Locate and remove outliers. See if there are outliers in the data that should be excluded from the fit. Plot the residuals. plotresiduals(mdl) regression model workflow.html#btb50q6 24/28

25 There is one possible outlier, with a value greater than 12. This is probably not truly an outlier. For demonstration, here is how to find and remove it. Find the outlier. outlier = mdl.residuals.raw > 12; find(outlier) ans = 84 Remove the outlier. mdl = fitlm(patients,modelspec,... 'Exclude',84); mdl.observationinfo(84,:) ans = Weights Excluded Missing Subset WXM true false false Observation 84 is no longer in the model. Step 4. Simplify the model. Try to obtain a simpler model, one with fewer predictors but the same predictive accuracy. step looks for a better model by adding or removing one term at a time. Allow step take up to 10 steps. mdl1 = step(mdl,'nsteps',10) regression model workflow.html#btb50q6 25/28

26 1. Removing wgt, FStat = e 05, pvalue = Removing sex, FStat = , pvalue = mdl1 = Linear regression model: sys ~ 1 + age + smoke Estimated Coefficients: Estimate SE tstat pvalue (Intercept) e 66 age smoke_yes e 17 Number of observations: 99, Error degrees of freedom: 96 Root Mean Squared Error: 4.61 R squared: 0.536, Adjusted R Squared F statistic vs. constant model: 55.4, p value = 1.02e 16 step took two steps. This means it could not improve the model further by adding or subtracting a single term. Plot the effectiveness of the simpler model on the training data. plotresiduals(mdl1) The residuals look about as small as those of the original model. regression model workflow.html#btb50q6 26/28

27 Step 5. Predict responses to new data. Suppose you have four new people, aged 25, 30, 40, and 65, and the first and third smoke. Predict their systolic pressure using mdl1. ages = [25;30;40;65]; smoker = {'Yes';'No';'Yes';'No'}; systolicnew = feval(mdl1,ages,smoker) systolicnew = To make predictions, you need only the variables that mdl1 uses. Step 6. Share the model. You might want others to be able to use your model for prediction. Access the terms in the linear model. coefnames = mdl1.coefficientnames coefnames = '(Intercept)' 'age' 'smoke_yes' View the model formula. mdl1.formula ans = sys ~ 1 + age + smoke Access the coefficients of the terms. coefvals = mdl1.coefficients(:,1); % table coefvals = table2array(coefvals) coefvals = The model is sys = *age *smoke, where smoke is 1 for a smoker, and 0 otherwise. regression model workflow.html#btb50q6 27/28

28 regression model workflow.html#btb50q6 28/28

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION In this lab you will first learn how to display the relationship between two quantitative variables with a scatterplot and also how to measure the strength of

More information

Chapter 3 - Linear Regression

Chapter 3 - Linear Regression Chapter 3 - Linear Regression Lab Solution 1 Problem 9 First we will read the Auto" data. Note that most datasets referred to in the text are in the R package the authors developed. So we just need to

More information

Polynomial Regression

Polynomial Regression Polynomial Regression Summary... 1 Analysis Summary... 3 Plot of Fitted Model... 4 Analysis Options... 6 Conditional Sums of Squares... 7 Lack-of-Fit Test... 7 Observed versus Predicted... 8 Residual Plots...

More information

Ratio of Polynomials Fit One Variable

Ratio of Polynomials Fit One Variable Chapter 375 Ratio of Polynomials Fit One Variable Introduction This program fits a model that is the ratio of two polynomials of up to fifth order. Examples of this type of model are: and Y = A0 + A1 X

More information

LOOKING FOR RELATIONSHIPS

LOOKING FOR RELATIONSHIPS LOOKING FOR RELATIONSHIPS One of most common types of investigation we do is to look for relationships between variables. Variables may be nominal (categorical), for example looking at the effect of an

More information

1 Introduction to Minitab

1 Introduction to Minitab 1 Introduction to Minitab Minitab is a statistical analysis software package. The software is freely available to all students and is downloadable through the Technology Tab at my.calpoly.edu. When you

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Chapter 3 Problem 1 (# 3) Suppose we have a data set with five predictors, X 1 = GP A, X 2 = IQ, X 3 = Gender (1 for Female and 0 for Male), X 4 = Interaction

More information

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues Overfitting Categorical Variables Interaction Terms Non-linear Terms Linear Logarithmic y = a +

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics Toolbox 6. Apply statistical algorithms and probability models Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of

More information

Bivariate data analysis

Bivariate data analysis Bivariate data analysis Categorical data - creating data set Upload the following data set to R Commander sex female male male male male female female male female female eye black black blue green green

More information

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION In this lab you will learn how to use Excel to display the relationship between two quantitative variables, measure the strength and direction of the

More information

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure. STATGRAPHICS Rev. 9/13/213 Calibration Models Summary... 1 Data Input... 3 Analysis Summary... 5 Analysis Options... 7 Plot of Fitted Model... 9 Predicted Values... 1 Confidence Intervals... 11 Observed

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006 Chapter 17 Simple Linear Regression and Correlation 17.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat). Statistics 512: Solution to Homework#11 Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat). 1. Perform the two-way ANOVA without interaction for this model. Use the results

More information

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp Nonlinear Regression Summary... 1 Analysis Summary... 4 Plot of Fitted Model... 6 Response Surface Plots... 7 Analysis Options... 10 Reports... 11 Correlation Matrix... 12 Observed versus Predicted...

More information

Chapter 16. Simple Linear Regression and Correlation

Chapter 16. Simple Linear Regression and Correlation Chapter 16 Simple Linear Regression and Correlation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

Ratio of Polynomials Fit Many Variables

Ratio of Polynomials Fit Many Variables Chapter 376 Ratio of Polynomials Fit Many Variables Introduction This program fits a model that is the ratio of two polynomials of up to fifth order. Instead of a single independent variable, these polynomials

More information

This manual is Copyright 1997 Gary W. Oehlert and Christopher Bingham, all rights reserved.

This manual is Copyright 1997 Gary W. Oehlert and Christopher Bingham, all rights reserved. This file consists of Chapter 4 of MacAnova User s Guide by Gary W. Oehlert and Christopher Bingham, issued as Technical Report Number 617, School of Statistics, University of Minnesota, March 1997, describing

More information

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling Linear Modelling in Stata Session 6: Further Topics in Linear Modelling Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 14/11/2017 This Week Categorical Variables Categorical

More information

5-Sep-15 PHYS101-2 GRAPHING

5-Sep-15 PHYS101-2 GRAPHING GRAPHING Objectives 1- To plot and analyze a graph manually and using Microsoft Excel. 2- To find constants from a nonlinear relation. Exercise 1 - Using Excel to plot a graph Suppose you have measured

More information

Model Building Chap 5 p251

Model Building Chap 5 p251 Model Building Chap 5 p251 Models with one qualitative variable, 5.7 p277 Example 4 Colours : Blue, Green, Lemon Yellow and white Row Blue Green Lemon Insects trapped 1 0 0 1 45 2 0 0 1 59 3 0 0 1 48 4

More information

Linear Models (continued)

Linear Models (continued) Linear Models (continued) Model Selection Introduction Most of our previous discussion has focused on the case where we have a data set and only one fitted model. Up until this point, we have discussed,

More information

Unit 11: Multiple Linear Regression

Unit 11: Multiple Linear Regression Unit 11: Multiple Linear Regression Statistics 571: Statistical Methods Ramón V. León 7/13/2004 Unit 11 - Stat 571 - Ramón V. León 1 Main Application of Multiple Regression Isolating the effect of a variable

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

ffmanova.m MANOVA for MATLAB version 2.0.

ffmanova.m MANOVA for MATLAB version 2.0. ffmanova.m 50-50 MANOVA for MATLAB version 2.0. Øyvind Langsrud The Matlab function ffmanova.m performs general linear modelling of several response variables (Y). Collinear and highly correlated response

More information

10 Model Checking and Regression Diagnostics

10 Model Checking and Regression Diagnostics 10 Model Checking and Regression Diagnostics The simple linear regression model is usually written as i = β 0 + β 1 i + ɛ i where the ɛ i s are independent normal random variables with mean 0 and variance

More information

Lecture 6: Linear Regression (continued)

Lecture 6: Linear Regression (continued) Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.

More information

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression Scenario: 31 counts (over a 30-second period) were recorded from a Geiger counter at a nuclear

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

Statistical Modelling in Stata 5: Linear Models

Statistical Modelling in Stata 5: Linear Models Statistical Modelling in Stata 5: Linear Models Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 07/11/2017 Structure This Week What is a linear model? How good is my model? Does

More information

Fractional Polynomial Regression

Fractional Polynomial Regression Chapter 382 Fractional Polynomial Regression Introduction This program fits fractional polynomial models in situations in which there is one dependent (Y) variable and one independent (X) variable. It

More information

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression Correlation and Simple Linear Regression Sasivimol Rattanasiri, Ph.D Section for Clinical Epidemiology and Biostatistics Ramathibodi Hospital, Mahidol University E-mail: sasivimol.rat@mahidol.ac.th 1 Outline

More information

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and dcorrelation Chapter 16 Simple Linear Regression and dcorrelation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

STATISTICS 110/201 PRACTICE FINAL EXAM

STATISTICS 110/201 PRACTICE FINAL EXAM STATISTICS 110/201 PRACTICE FINAL EXAM Questions 1 to 5: There is a downloadable Stata package that produces sequential sums of squares for regression. In other words, the SS is built up as each variable

More information

Analysing data: regression and correlation S6 and S7

Analysing data: regression and correlation S6 and S7 Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association

More information

Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 11. Jordan Boyd-Graber Boulder Regression 1 of 19

Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 11. Jordan Boyd-Graber Boulder Regression 1 of 19 Regression Jordan Boyd-Graber University of Colorado Boulder LECTURE 11 Jordan Boyd-Graber Boulder Regression 1 of 19 Content Questions Jordan Boyd-Graber Boulder Regression 2 of 19 Content Questions Jordan

More information

Multiple linear regression S6

Multiple linear regression S6 Basic medical statistics for clinical and experimental research Multiple linear regression S6 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/42 Introduction Two main motivations for doing multiple

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

Practical Algebra. A Step-by-step Approach. Brought to you by Softmath, producers of Algebrator Software

Practical Algebra. A Step-by-step Approach. Brought to you by Softmath, producers of Algebrator Software Practical Algebra A Step-by-step Approach Brought to you by Softmath, producers of Algebrator Software 2 Algebra e-book Table of Contents Chapter 1 Algebraic expressions 5 1 Collecting... like terms 5

More information

Regression. Marc H. Mehlman University of New Haven

Regression. Marc H. Mehlman University of New Haven Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and

More information

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College 1-Way ANOVA MATH 143 Department of Mathematics and Statistics Calvin College An example ANOVA situation Example (Treating Blisters) Subjects: 25 patients with blisters Treatments: Treatment A, Treatment

More information

Box-Cox Transformations

Box-Cox Transformations Box-Cox Transformations Revised: 10/10/2017 Summary... 1 Data Input... 3 Analysis Summary... 3 Analysis Options... 5 Plot of Fitted Model... 6 MSE Comparison Plot... 8 MSE Comparison Table... 9 Skewness

More information

Using SPSS for One Way Analysis of Variance

Using SPSS for One Way Analysis of Variance Using SPSS for One Way Analysis of Variance This tutorial will show you how to use SPSS version 12 to perform a one-way, between- subjects analysis of variance and related post-hoc tests. This tutorial

More information

ECON 497 Midterm Spring

ECON 497 Midterm Spring ECON 497 Midterm Spring 2009 1 ECON 497: Economic Research and Forecasting Name: Spring 2009 Bellas Midterm You have three hours and twenty minutes to complete this exam. Answer all questions and explain

More information

Introduction to Regression

Introduction to Regression Regression Introduction to Regression If two variables covary, we should be able to predict the value of one variable from another. Correlation only tells us how much two variables covary. In regression,

More information

Categorical Predictor Variables

Categorical Predictor Variables Categorical Predictor Variables We often wish to use categorical (or qualitative) variables as covariates in a regression model. For binary variables (taking on only 2 values, e.g. sex), it is relatively

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras Lecture - 39 Regression Analysis Hello and welcome to the course on Biostatistics

More information

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang Project Report for STAT7 Statistical Methods Instructor: Dr. Ramon V. Leon Wage Data Analysis Yuanlei Zhang 77--7 November, Part : Introduction Data Set The data set contains a random sample of observations

More information

Simple Linear Regression Using Ordinary Least Squares

Simple Linear Regression Using Ordinary Least Squares Simple Linear Regression Using Ordinary Least Squares Purpose: To approximate a linear relationship with a line. Reason: We want to be able to predict Y using X. Definition: The Least Squares Regression

More information

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression Introduction to Correlation and Regression The procedures discussed in the previous ANOVA labs are most useful in cases where we are interested

More information

Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons

Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons 1. Suppose we wish to assess the impact of five treatments while blocking for study participant race (Black,

More information

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr Regression Model Specification in R/Splus and Model Diagnostics By Daniel B. Carr Note 1: See 10 for a summary of diagnostics 2: Books have been written on model diagnostics. These discuss diagnostics

More information

Prediction of Bike Rental using Model Reuse Strategy

Prediction of Bike Rental using Model Reuse Strategy Prediction of Bike Rental using Model Reuse Strategy Arun Bala Subramaniyan and Rong Pan School of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USA. {bsarun, rong.pan}@asu.edu

More information

An automatic report for the dataset : affairs

An automatic report for the dataset : affairs An automatic report for the dataset : affairs (A very basic version of) The Automatic Statistician Abstract This is a report analysing the dataset affairs. Three simple strategies for building linear models

More information

Topic 1. Definitions

Topic 1. Definitions S Topic. Definitions. Scalar A scalar is a number. 2. Vector A vector is a column of numbers. 3. Linear combination A scalar times a vector plus a scalar times a vector, plus a scalar times a vector...

More information

Diagnostics and Transformations Part 2

Diagnostics and Transformations Part 2 Diagnostics and Transformations Part 2 Bivariate Linear Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University Multilevel Regression Modeling, 2009 Diagnostics

More information

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

with the usual assumptions about the error term. The two values of X 1 X 2 0 1 Sample questions 1. A researcher is investigating the effects of two factors, X 1 and X 2, each at 2 levels, on a response variable Y. A balanced two-factor factorial design is used with 1 replicate. The

More information

STAT 420: Methods of Applied Statistics

STAT 420: Methods of Applied Statistics STAT 420: Methods of Applied Statistics Model Diagnostics Transformation Shiwei Lan, Ph.D. Course website: http://shiwei.stat.illinois.edu/lectures/stat420.html August 15, 2018 Department

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014 Ridge Regression Summary... 1 Data Input... 4 Analysis Summary... 5 Analysis Options... 6 Ridge Trace... 7 Regression Coefficients... 8 Standardized Regression Coefficients... 9 Observed versus Predicted...

More information

Sociology 6Z03 Review II

Sociology 6Z03 Review II Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability

More information

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections 3.4 3.6 by Iain Pardoe 3.4 Model assumptions 2 Regression model assumptions.............................................

More information

IT 403 Practice Problems (2-2) Answers

IT 403 Practice Problems (2-2) Answers IT 403 Practice Problems (2-2) Answers #1. Which of the following is correct with respect to the correlation coefficient (r) and the slope of the leastsquares regression line (Choose one)? a. They will

More information

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear relationship between: - one independent variable X and -

More information

H. Diagnostic plots of residuals

H. Diagnostic plots of residuals H. Diagnostic plots of residuals 1. Plot residuals versus fitted values almost always a. or simple reg. this is about the same as residuals vs. x b. Look for outliers, curvature, increasing spread (funnel

More information

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph. Regression, Part I I. Difference from correlation. II. Basic idea: A) Correlation describes the relationship between two variables, where neither is independent or a predictor. - In correlation, it would

More information

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression INTRODUCTION TO CLINICAL RESEARCH Introduction to Linear Regression Karen Bandeen-Roche, Ph.D. July 17, 2012 Acknowledgements Marie Diener-West Rick Thompson ICTR Leadership / Team JHU Intro to Clinical

More information

y response variable x 1, x 2,, x k -- a set of explanatory variables

y response variable x 1, x 2,, x k -- a set of explanatory variables 11. Multiple Regression and Correlation y response variable x 1, x 2,, x k -- a set of explanatory variables In this chapter, all variables are assumed to be quantitative. Chapters 12-14 show how to incorporate

More information

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6 STA 8 Applied Linear Models: Regression Analysis Spring 011 Solution for Homework #6 6. a) = 11 1 31 41 51 1 3 4 5 11 1 31 41 51 β = β1 β β 3 b) = 1 1 1 1 1 11 1 31 41 51 1 3 4 5 β = β 0 β1 β 6.15 a) Stem-and-leaf

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

Logistic Regression - problem 6.14

Logistic Regression - problem 6.14 Logistic Regression - problem 6.14 Let x 1, x 2,, x m be given values of an input variable x and let Y 1,, Y m be independent binomial random variables whose distributions depend on the corresponding values

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Example name. Subgroups analysis, Regression. Synopsis

Example name. Subgroups analysis, Regression. Synopsis 589 Example name Effect size Analysis type Level BCG Risk ratio Subgroups analysis, Regression Advanced Synopsis This analysis includes studies where patients were randomized to receive either a vaccine

More information

Lecture 6: Linear Regression

Lecture 6: Linear Regression Lecture 6: Linear Regression Reading: Sections 3.1-3 STATS 202: Data mining and analysis Jonathan Taylor, 10/5 Slide credits: Sergio Bacallado 1 / 30 Simple linear regression Model: y i = β 0 + β 1 x i

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model Checking/Diagnostics Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns

More information

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model Checking/Diagnostics Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns

More information

Analysis of Covariance (ANCOVA) with Two Groups

Analysis of Covariance (ANCOVA) with Two Groups Chapter 226 Analysis of Covariance (ANCOVA) with Two Groups Introduction This procedure performs analysis of covariance (ANCOVA) for a grouping variable with 2 groups and one covariate variable. This procedure

More information

This module focuses on the logic of ANOVA with special attention given to variance components and the relationship between ANOVA and regression.

This module focuses on the logic of ANOVA with special attention given to variance components and the relationship between ANOVA and regression. WISE ANOVA and Regression Lab Introduction to the WISE Correlation/Regression and ANOVA Applet This module focuses on the logic of ANOVA with special attention given to variance components and the relationship

More information

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Lecture 2. The Simple Linear Regression Model: Matrix Approach Lecture 2 The Simple Linear Regression Model: Matrix Approach Matrix algebra Matrix representation of simple linear regression model 1 Vectors and Matrices Where it is necessary to consider a distribution

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran Assignment 3 Introduction to Machine Learning Prof. B. Ravindran 1. In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively

More information

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

CHAPTER 5 LINEAR REGRESSION AND CORRELATION CHAPTER 5 LINEAR REGRESSION AND CORRELATION Expected Outcomes Able to use simple and multiple linear regression analysis, and correlation. Able to conduct hypothesis testing for simple and multiple linear

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

10. Alternative case influence statistics

10. Alternative case influence statistics 10. Alternative case influence statistics a. Alternative to D i : dffits i (and others) b. Alternative to studres i : externally-studentized residual c. Suggestion: use whatever is convenient with the

More information

NCSS Statistical Software. Harmonic Regression. This section provides the technical details of the model that is fit by this procedure.

NCSS Statistical Software. Harmonic Regression. This section provides the technical details of the model that is fit by this procedure. Chapter 460 Introduction This program calculates the harmonic regression of a time series. That is, it fits designated harmonics (sinusoidal terms of different wavelengths) using our nonlinear regression

More information

A Scientific Model for Free Fall.

A Scientific Model for Free Fall. A Scientific Model for Free Fall. I. Overview. This lab explores the framework of the scientific method. The phenomenon studied is the free fall of an object released from rest at a height H from the ground.

More information

The Flight of the Space Shuttle Challenger

The Flight of the Space Shuttle Challenger The Flight of the Space Shuttle Challenger On January 28, 1986, the space shuttle Challenger took off on the 25 th flight in NASA s space shuttle program. Less than 2 minutes into the flight, the spacecraft

More information

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. 1/48 Linear regression Linear regression is a simple approach

More information

Week 8 Hour 1: More on polynomial fits. The AIC

Week 8 Hour 1: More on polynomial fits. The AIC Week 8 Hour 1: More on polynomial fits. The AIC Hour 2: Dummy Variables Hour 3: Interactions Stat 302 Notes. Week 8, Hour 3, Page 1 / 36 Interactions. So far we have extended simple regression in the following

More information

Classification & Regression. Multicollinearity Intro to Nominal Data

Classification & Regression. Multicollinearity Intro to Nominal Data Multicollinearity Intro to Nominal Let s Start With A Question y = β 0 + β 1 x 1 +β 2 x 2 y = Anxiety Level x 1 = heart rate x 2 = recorded pulse Since we can all agree heart rate and pulse are related,

More information

Analysis of 2x2 Cross-Over Designs using T-Tests

Analysis of 2x2 Cross-Over Designs using T-Tests Chapter 234 Analysis of 2x2 Cross-Over Designs using T-Tests Introduction This procedure analyzes data from a two-treatment, two-period (2x2) cross-over design. The response is assumed to be a continuous

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information