Introduction. Exercise 2:2. Background:

Size: px
Start display at page:

Download "Introduction. Exercise 2:2. Background:"

Transcription

1 Introduction The purpose with this writing is to solve three exercises, or more specific; to analyze three sets of data, as an examination assignment in the course Linear Models and Extensions. The exercises are taken from the book Extending the Linear Model with R Generalized Linear, Mixed Effects and Nonparametric regression, written by Julian J. Faraway (2006). The numbering in this document is consistent with the book's numbering; hence exercise 5:2 is the second exercise in chapter 5. A generalized linear model, GLM, is, as it sounds, a generalization of the standard linear regression model. Instead of directly associate the response, μ, to the predictor variables in a linear way, a GLM associates a linear predictor, η, which is a function of the response, to the covariates, x i, in the following way: η i = β 0 + β 1 x i1 + + β q x iq. This is done since there could be some kind of restriction on the response, for example that μ is a count or a proportion. By using an appropriate link function, g, we can make sure that correct values will be assigned to the response, for example a value between 0 and 1 if the response is a proportion. Thus the link function describes how the response is linked to the covariates through the linear predictor; η = g(μ). This assignment will treat three types of generalizations of the standard linear model. The first exercise concerns binominal data, or more specific; binary data, the second exercise treats ordered multinomial responses and in the third exercise the linear model is extended to include both fixed effects and random effects in a so called split-plot design. The statistical software R is used to analyze the datasets in all three exercises and the code can be found in the Appendix, once again with consistent numbering. Complete datasets are not included in this document due to their size but they can be found in the package faraway in R and references to the datasets can be found in the end of this document. Exercise 2:2 Background: The dataset wbca comes from a study of breast cancer in Wisconsin. It consists of medical data from 681 women who has potentially cancerous tumors, of which 238 are actually malignant while the remaining 443 are benign. Determining whether a tumor is really malignant is traditionally done by an invasive surgical procedure. The purpose of this study was to determine whether a new procedure, called fine needle aspiration, which draws only a small sample of tissue, could be effective in determining tumor status. The response variable in the dataset is Class, which is a binary variable that is 0 if the tumor is malignant and 1 if it is benign. The nine predictor variables are: Adhes - marginal adhesion, BNucl - bare nuclei, Chrom - bland chromatin, Epith - epithelial cell size, Mitos mitoses, NNucl - normal nucleoli, Thick - clump thickness, UShap - cell shape uniformity and USize - cell size uniformity. A doctor who has observed the cells from the small sample of tissue determines these predictor values by rating them on a scale from 1 to 10 with respect to the particular characteristic. A predictor is assigned the value 1 if it is normal and the value 10 if it is most abnormal. The first six lines of the dataset are displayed below:

2

3 Class Adhes BNucl Chrom Epith Mitos NNucl Thick UShap USize Part a) Fit a binominal regression with Class as the response and the other nine variables as predictors. Report the residual deviance and the associated degrees of freedom and explain why or why not this information can be used to determine if this model fits the data. Solution: A binomial regression is a GLM where the response is a probability, in this case the probability that a tumor is benign, so a link function that makes sure that 0 p 1 must be used. One common choice of such a function is the logit link function: η = log(p/(1-p)), which is used in this exercise. The following binomial model, called Model 1, is the one that are fitted: η = β 0 +β 1 *Adhes+β 2 *BNucl+β 3 *Chrom+β 4 *Epith+β 5 *Mitos+β 6 *NNucl+β 7 *Thick+β 8 *UShap+β 9 *USize The output below gives the estimates of the β-coefficients, the standard errors for the coefficients, the residual deviance, the corresponding degrees of freedom and also the AIC-value: Call: glm(formula = Class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick + UShap + USize, family = binomial(logit), data = wbca) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e-15 *** Adhes ** BNucl e-05 *** Chrom ** Epith Mitos NNucl * Thick e-05 *** UShap USize Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 680 degrees of freedom Residual deviance: on 671 degrees of freedom AIC: The residual deviance is for this model and the corresponding degrees of freedom are 671. The deviance is a measure of fit since it compares fitted values to the data. Provided that Y (Class in this case) is truly binomial and that the number of trials, n i, are relatively large, the deviance is approximately Χ 2 -distributed with n s degrees of freedom (where s is the number of parameters). A p-value can then be calculated for the reported deviance with the corresponding degrees of freedom, and if the p-value is larger than 0.05, conclusions that the model fits sufficiently well can be drawn. However, in this case, the Y is binary, which means that n i = 1 since Class is either 0 or 1. When n i are small, the deviance is not approximately Χ 2 -distributed so it cannot be used to judge goodness of fit for binary data. Other methods such as the Hosmer-Lemeshow test can be used instead.

4 Part b) Use AIC as the criterion to determine the best subset of variables. Solution: AIC (Akaike Information Criterion) is a measure of goodness of fit that is defined as: AIC = deviance + 2*dim(β), where a smaller AIC indicates a better fit. Starting from Model 1, the stepfunction is used to determine the best subset of main effects when AIC is used as the criterion: Call: glm(formula = Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick + UShap, family = binomial(logit), data = wbca) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e-16 *** Adhes ** BNucl e-05 *** Chrom ** Mitos NNucl * Thick e-05 *** UShap Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 680 degrees of freedom Residual deviance: on 673 degrees of freedom AIC: From the output it can be seen that Epith and USize is no longer in the model, called Model 2, and the AIC is now minimized to , compared to for Model 1. Thus Model 2 is the main effectmodel with the lowest AIC value. An ANOVA-test comparing Model 1 and Model 2 gives the p-value 0.9, indicating that Epith and Usize have no significant effect on the model and hence, Model 2 is best. A model, called Model 3, with all nine main effects and all 36 possible two-way interaction terms is also defined and once again AIC is used as the criterion for reducing the model. The lowest AIC then found is and belongs to Model 4, which includes all the nine main effects and 19 different two-way interactions. Further simplifications of Model 4 are done, where non-significant effects are deleted from the model, but it doesn t result in any lower AIC-value. A comparison between Model 2 and Model 4, using ANOVA, gives a p-value much smaller than 0.001, indicating that the included two-way interactions add some significant predictive information to the model. Despite this result, Model 2 is chosen for further analysis, since it is much smaller than Model 4 (which is overfitted) and hence easier to interpret and also since the diagnostic plots, see Figure 1 and 2, are more satisfactory for Model 2. Figure 1: Diagnostic plots for Model 2 Figure 2: Diagnostic plots for Model 4

5 Part c) Use the reduced model to predict the outcome for a new patient with predictor variables: Adhes= 1, BNucl= 1, Chrom= 3, Epith= 2, Mitos= 1, NNucl= 1, Thick= 4, UShap= 1 and USize = 1. Give a confidence interval for the prediction. Solution: The reduced model is Model 2: η = β 0 +β 1 *Adhes+β 2 *BNucl+β 3 *Chrom+β 4 *Mitos+β 5 *NNucl+β 6 *Thick+β 7 *UShap Since Epith and USize are not in the model, the values for these covariates are not taken into account for the prediction. Using the estimated coefficients for Model 2 (see part b), a predicted value on the logit scale is calculated to be , which corresponds to the response value This implies that the predicted probability that the tumor is benign for a woman with the predictor variables specified above is about 99 %, or the other way, the predicted probability that this woman s tumor is malignant is about 1 %. An approximate 95 % confidence interval for the predicted probability can be obtained by using normal approximation. The interval on logit scale is then: [η *se(η 0 ), η *se(η 0 )], where η 0 is the predicted value and the standard error is = se(η 0 ). Using these numbers and the relation p = e η /(1-e η ), a confidence interval in the probability scale is found to be: (0.9757, ). Part d) Suppose that a cancer is classified as benign if p > 0.5 and malignant if p < 0.5. Compute the number of errors of both types that will be made if this method is applied to the current data with the reduced model. Solution: The probabilities for all 681 women in the study are predicted with Model 2 and if the predicted probability is smaller than 0.5, their tumor will be classified as malignant. The predicted classification is then compared with the initial data and the result is shown in Table 1. A total of 20 errors are made, corresponding to 2.9 % misclassifications. 11 malignant tumors are classified as benign and 9 benign tumors are classified as malign. Classified as malignant Classified as benign Is malignant 227 (33.3 %) 11 (1.6 %) Is benign 9 (1.3 %) 434 (63.7 %) Table 1 Part e) Suppose the cutoff is changed to 0.9, so that p < 0.9 is classified as malignant and p > 0.9 is classified as benign. Compute the number of errors in this case and discuss the issues of determining the cutoff. Solution: Same procedure as above are done but with the cutoff set to 0.9 instead of 0.5, resulting in a total of 17 misclassifications or 2.5 %, see Table 2. One malignant tumor is classified as benign and 16 benign tumors are classified as malign. Classified as malignant Classified as benign Is malignant 237 (34.8 %) 1 (0.15 %) Is benign 16 (2.3 %) 427 (62.7 %) Table 2

6 From the results in part d) and e), it is clear that it is really important how you determine the cutoff, however, this is a really hard task to do. If the cutoff is really high, as the model are defined in this case where a high probability means a benign tumor, the risk of missing a malign tumor is small but instead more benign tumors will be classified as malignant, probably causing some unnecessary pain for the affected women. On the other hand, if the cutoff is too low, more malignant tumors will be misclassified which, at least in my opinion, is more serious. A missed malignant tumor can in worst case scenario cause the woman to die if she doesn t get the right treatment in time. My opinion is that an exact cutoff will never be correct and that an interval is better to use than a single point. If the probability is higher than the interval the tumor will be classified as benign, if it is lower the tumor is classified as malign and if the probability lies within the interval there will be further examinations. However, there will still be cases that are on the wrong side of the interval, but then I think it s better to determine a cutoff that reduces the classified as benign but actually malignant -errors instead of the opposite error. Part f) It is usually misleading to use the same data to fit a model and test its predictive ability. To investigate this, split the data into two parts in the following way: assign every third observation to a test set and the remaining two thirds of the data to a training set. Use the training set to determine the model and the test set to assess its predictive performance. Compare the outcome to the previously obtained results. Solution: Just as specified above, every third observation is assigned to a test set and the remaining two thirds are assigned to a training set. A main effects model with all nine covariates is then fitted to the data in the training set and it is called Model 6. For simplicity, both concerning calculations and interpretation, no model including any interaction terms is specified, though this would be important to do in a more comprehensive study since the result from part b) indicates that there are some significant interactions between the predictor variables. Model 6 has an AIC-value of Using the stepfunction to see which subset of the variables that gives the smallest AIC-value result in the same model as was obtained using the entire dataset, which is a model where Epith and USize are not included as predictors. The output below shows a summary of that model, called Model 7: Call: glm(formula = Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + Thick + UShap, family = binomial(logit), data = trainingset) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e-10 *** Adhes ** BNucl ** Chrom * Mitos NNucl ** Thick ** UShap Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 453 degrees of freedom Residual deviance: on 446 degrees of freedom AIC:

7 When Model 6 and Model 7 are compared using ANOVA, the resulting p-value is 0.39, indicating that the larger model is not significant and that the smaller Model 7 is preferable. Hence Model 7 is the one that are determined as best for the training set and used, together with the test set, for prediction. Predictions are made for the 227 women in the test set using Model 7 and the result is then compared to the initial classification. Figure 3 shows the predicted probabilities plotted against the true classifications. Remember that the true classification only takes the value 0 if the tumor is malignant and 1 if it is benign, whiles the predicted values is in the range from 0 to 1. Hence values close to the points (0, 0) and (1, 1) are good predictions while the further away from these points, the larger is the risk for a misclassification depending on the determined cutoff. Most of the cases are in the desirable corners but there are also some cases that are far from them and hence will be misclassified. Figure 3 Another way to assess the predictive performance and to compare it with the results when the entire dataset is used, is to use the same classification methods as in part d) and part e), with cutoffs set to first 0.5 and then to 0.9. The resulting errors are shown in Table 3 and Table 4. Classified as malignant Classified as benign Classified as malignant Classified as benign Is malignant 70 (30.8 %) 5 (2.2 %) Is malignant 73 (32.2 %) 2 (0.88 %) Is benign 2 (0.88 %) 150 (66.1 %) Is benign 3 (1.3 %) 149 (65.6 %) Table 3: Cutoff = 0.5 Table 4: Cutoff = 0.9 When the cutoff is 0.5 misclassifications is done in 3.1 % of the cases and when the cutoff is % of the tumors are misclassified. Compared to the result when the entire dataset is used, the total error proportion is larger with the test set for the cutoff 0.5 (3.1 % compared to 2.9 %) but smaller for the cutoff 0.9 (2.2 % compared to 2.5 %). The differences are not so large and conclusions can be drawn that the predictions are rather robust.

8 Exercise 5:2 Background: The data happy is collected from 39 students at the University of Chicago Graduate School of Business to test the hypothesis that love and work are the important factors in determining an individual's happiness. The variables money and sex are also included in the study, where sex refers to sexual activity rather than gender. The first six lines of the data are displayed below: happy money sex love work The response variable happy, representing the students happiness, is measured on a 10-point scale with 1 representing a suicidal state, 5 representing a feeling of "just muddling along," and 10 representing a euphoric state. The variable money is measured by annual family income in thousands of dollars. Sex is measured by a dummy variable taking the value 1 if the student has a satisfactory level of sexual activity and 0 if it hasn t. Love is measured on a 3-point scale, with 1 representing loneliness and isolation, 2 representing a secure relationships, and 3 representing a deep feeling of belonging and caring in the context of some family or community. The last variable work is measured on a 5-point scale, with 1 indicating that an individual has no job or is seeking other employment, 3 indicating the job is "OK," and 5 indicating that the job is great. Part a) Build a model for the level of happiness as a function of the other variables. Solution: The multinomial distribution is an extension of the binomial distribution where the response can take more than two values. The response variable happy in the dataset happy (unfortunately they have the same name) is multinomial distributed since it can take one of the finite values 1,2,,10 depending on how happy the student is. The happy-variable is ordered, since it is a 10-point scale where 1 is the lowest happiness and 10 the highest. This means that happy is an ordered multinomial response variable and methods that takes this into account must be used for building the model. With an ordered response, Y i, it is often easier to work with the cumulative probabilities, γ ij = P(Y i j), where i is the individual and j is the category, j=1,, J. The γ:s are then linked to the covariates x in the following way: g(γ ij ) = θ j - x i T β. θ j are the intercepts so the vector x i does not include any intercept and β does not depend on the category j. The latent variable Z i is a continuous variable that can be thought of as the real underlying response and a discretized version of Z i is observed in the form of the Y i, where Y i = j if θ j-1 < Z i θ j. This means that the θ j :s define a grid of thresholds in Z scale that separates the different response categories and the effect of the covariates is then to move that grid in different directions. When the link function g is the logit link, the model is called proportional odds model and is defined as:, j = 1,, J-1 where γ j (x i ) = P(Y i j x i ). A proportional odds model with happy as the response and the other four variables; money, sex, love and work, as covariates are fitted for the dataset using functions in R. An AIC-based variable selection is then used to reduce the model:

9 Start: AIC= happyf ~ money + sexf + lovef + workf Df AIC - sexf <none> money workf lovef Step: AIC= happyf ~ money + lovef + workf Df AIC <none> money workf lovef The output shows that the model with the smallest AIC-value, , is the model that doesn t include sex as a covariate. The two models, the one with and the one without the variable sex, is then compared using ANOVA and the resulting p-value is 0.27, indicating that sex has no significant effect on the model and can be deleted. The summary for the chosen model is shown below: Call: polr(formula = happyf ~ money + lovef + workf, data = happy) Coefficients: Value Std. Error t value money lovef[t.2] lovef[t.3] workf[t.2] workf[t.3] workf[t.4] workf[t.5] Intercepts: Value Std. Error t value Residual Deviance: AIC: The proportional odds model assumes that the relative odds for moving from one response category to another are the same independent of the current response category. If the assumption holds for this dataset is questionable. For example, an extra $10000 in annual income for an individual that doesn t have so high income to start with will probably cause him a larger change in happiness than an extra $10000 for an individual that are very rich. Another model for ordered multinomial data is the ordered probit model. If the latent variable Z i is assumed to have a standard normal distribution, the probit link function is used, i.e: Φ -1 (γ j (x i )) =, j = 1,, J-1. This model is also fitted to the dataset happy, first with all four predictor variables but once again both the AIC-based variable selection procedure and the ANOVA-test indicate that the variable sex could be deleted from the model. The AIC for the larger model is and for

10 the smaller model is and the p-value from the ANOVA is The summary for the model without the covariate sex is shown below: Call: polr(formula = happyf ~ money + lovef + workf, data = happy, method = "probit") Coefficients: Value Std. Error t value money lovef[t.2] lovef[t.3] workf[t.2] workf[t.3] workf[t.4] workf[t.5] Intercepts: Value Std. Error t value Residual Deviance: AIC: The conclusions from the analysis are that money, love and work has an effect on the individual s happiness, while the sexual activity doesn t. No models with interactions are fitted since these would be overfitted due to a small dataset compared to all possible parameters. Notice that since no one in the original dataset had a happiness level of 1, the model only fits values for the response levels 2,, 10. Part b) Interpret the parameters of your chosen model. Solution: The interpretations are done using the proportional odds model with the covariates money, love and work. For coefficient and intercept numbers, see the second output in part a). The proportional odds model is:, so the intercept terms in the output corresponds to the θ j :s. The chosen model is created so that the default level is money = 0, love = 1 and work = 1, corresponding to a person that has no annual family income, is lonely and has no job. The log-odds for this default person to be in happiness category 2 or smaller against 3 or higher is , hence the odds is exp(0.0389) = This means that the odds is larger for the person to be in the lower categories than the higher, and the corresponding probability for being in category 2 or lower is ilogit(0.0389) = 0.510, where ilogit is the inverse logit function. The odds for the same default person to be in category 3 or smaller against 4 or higher is exp(0.9184) = 2.51 and the probability of being in category 3 is ilogit( ) -ilogit(0.0389)= The intercepts for the remaining categories can be interpreted in a similar way where a positive log-odds corresponds to an odds larger than 1 and hence the odds is higher to be in the lower happiness categories than the higher ones. From the output it can be seen that all the intercept terms, i.e. the log-odds, for the default person are positive, hence there are a larger predicted possibility to be in the lower categories than in the higher, which seems logical since the default person has the lowest values possible on all three covariates.

11 The coefficients in the output corresponds to the β:s and can be interpreted in the following way. If the income is increased by one unit ($1000) the odds of moving from a given happiness category to one category higher increase by a factor of exp( ) = This is equivalent as to say that standing in happiness category 2 (for example), the log-odds for being in that category or lower will be smaller if the money-variable is increased with, say, 3 units, i.e. log-odds = * In a similar way, if changing from love level 1 to love level 2, the odds for moving one category higher in happiness level will be increased by a factor of exp( ), and if changing from love level 1 to 3, the odds for being a level happier will be increased by a factor of exp( ). The coefficients for the different work levels are interpreted in the same way. Part c) Predict the happiness distribution for an individual whose parents earn $30000 a year, who is lonely, not sexually active and has no job. Solution: That a person is lonely corresponds to the covariate value love = 1, that he or she doesn t have a job corresponds to the covariate value work = 1 and that the parent s annual income is $30000 corresponds to money = 30, since it is measured in thousands of dollars. That a person is not sexually active corresponds to the predictor value sex = 0, but since sex is not included in the model, this will be ignored. The happiness distribution is predicted for a person with these values, both for the proportional odds model and for the ordered probit model and the results are: Proportional odds model: Ordered probit model: Even if the coefficients for the two model differs a bit (see part a), the predictions are rather similar. A person who is lonely, has no job and whose parents earn $30000 a year is most likely to be in the happiness category 2, 3, or 4. To test the two models predictive performance, predictions is done using the original dataset and the predicted happiness is compared to the true happiness level. The results are shown in Table 5 and 6: Predicted \ True Table 5: Proportional odds model Predicted \ True Table 6: Ordered probit model Once again it is clear that the predictions for the two different models are rather similar and also quite good. Not all predictions are correct but the biggest difference, or error, between a true and a predicted happiness category is 2 levels, for example one true category 10 is predicted as a category 8.

12 Exercise 8:7 Background: The dataset semicond is from an experiment that was conducted to optimize the manufacture of semiconductors. The response variable is the numeric variable resistance, which is the resistance recorded on the wafer. The experiment was conducted during four different time periods with three different wafers during each period, so the variable ET is a factor with levels 1 to 4 representing the etching time period and the variable wafer is a factor with levels 1 to 3. The Grp variable is a combination of ET and wafer. The last variable is position, which is a factor with levels 1 to 4. The first six lines of the data are shown below: resistance ET Wafer position Grp / / / / / /2 Exercise: Analyze the semicond data as a split plot experiment where ET and position are considered as fixed effects. Since the wafers are different in experimental time periods, the Grp variable should be regarded as the block or group variable. Determine the best model for the data and check all appropriate diagnostics. Solution: Split plot designs originated in agriculture but are frequently used in other areas as well. The design arises as a result of restriction of a full randomization. The idea is that main plots are split into several subplots, where the main plot is treated with a level of one factor while the levels on some other factor are allowed to vary with the subplots. In this case the etching time, ET, could be considered as the main plots, hence the levels are fixed, while the Wafer could be considered as the subplots and are therefore random. This implies that the combination of ET and Wafer, Grp, is also random because one of the components is random, and the Wafer variable alone should not be included in the model since it is accounted for in Grp. The position variable could actually be thought of as a subsubplot, but in this exercise it should be considered as a fixed effect. The model first fitted is: Y ijk = μ + ET i + position j + (ET*position) ij + (ET*Wafer) ik + ε ijk, where μ, ET i, position j and (ET*position) ij are fixed effects and the rest is random effects having variances σ w 2 and σ ε 2 respectively. The output below is a summary of the fitted model together with an ANOVA analysis to check the significance of the fixed effects: Linear mixed model fit by REML Formula: resistance ~ ET * position + (1 Grp) Data: semicond AIC BIC loglik deviance REMLdev Random effects: Groups Name Variance Std.Dev. Grp (Intercept) Residual Number of obs: 48, groups: Grp, 12

13 Fixed effects: Estimate Std. Error t value (Intercept) ET[T.2] ET[T.3] ET[T.4] position[t.2] position[t.3] position[t.4] ET[T.2]:position[T.2] ET[T.3]:position[T.2] ET[T.4]:position[T.2] ET[T.2]:position[T.3] ET[T.3]:position[T.3] ET[T.4]:position[T.3] ET[T.2]:position[T.4] ET[T.3]:position[T.4] ET[T.4]:position[T.4] Analysis of Variance Table Df Sum Sq Mean Sq F value ET position ET:position From the summary it can be seen that the two estimated variance components, w2 = and ε2 = 0.111, are similar and hence the variation between wafers and the variation due to random errors contribute about equally to the model variation. The ANOVA-table indicates that the interaction term, (ET* position) has little effect on the model since an F-value smaller than 1 is not significant. A look at the t-values for the interaction coefficients gives the same conclusions since none of them are larger than 1. This implies that the interaction term can be removed and a model without interaction is fitted: Linear mixed model fit by REML Formula: resistance ~ ET + position + (1 Grp) Data: semicond AIC BIC loglik deviance REMLdev Random effects: Groups Name Variance Std.Dev. Grp (Intercept) Residual Number of obs: 48, groups: Grp, 12 Fixed effects: Estimate Std. Error t value (Intercept) ET[T.2] ET[T.3] ET[T.4] position[t.2] position[t.3] position[t.4] Analysis of Variance Table Df Sum Sq Mean Sq F value ET position

14 The difference between the two variance components is now even smaller and hence the conclusion that the wafer and the error contribute equally to the model variation is substantiated. Even if the AICcriterion for model selection could be questioned it suggests that the model without interaction is better since it has an AIC-value of compared to Together with the previous result that the interaction was non-significant, the conclusion is that the following model is considered best : Y ijk = μ + ET i + position j + (ET*Wafer) ik + ε ijk. The diagnostic plots for both models (with and without interaction) are shown in Figure 4. The plots are rather similar and no obvious problems with the model assumptions can be seen from the plots. The residuals don t show any obvious trend or pattern (which they shouldn t) and the normality assumption look appropriate according to the qq-plots. Figure 4: Diagnostic plots for the split plot models. A look at the estimated coefficients and their t-values, together with the ANOVA-table, reveals that the fixed effects ET and position are probably both significant. Time period 4 seems to differ most from the other time periods and position 3 differs most from the other three positions. The intercept term, 5.64, corresponds to the resistance when in time period 1 and position 1.

15 References Theory and methods: Faraway, J. J. (2006). Extending the Linear Model with R. Chapman & Hall/CRC. Dataset in Exercise 2:2 Bennett, K. P., & Mangasarian, O.L. (1992). Neural network training via linear programming. In P. M. Pardalos, editor, Advances in Optimization and Parallel Computing, pages Elsevier Science. Dataset in Exercise 5:2 George, E. I., & McCulloch, L.E. (1993). Variable Selection via Gibbs Sampling. Journal of the American Statistical Association, 88, Dataset in Exercise 8:7 Littel, R. C., & Milliken, G. A., & Stroup, W. W., & and Wolfinger, R. D. (1996). SAS System for Mixed Models. SAS Institute (Data Set 2.2(b)).

16 Appendix R-Code to exercise 2:2 #################################### ####### Ch 2: Exercise 2 ####### #################################### library(faraway) data(wbca) help(wbca) attributes(wbca) head(wbca) wbca summary(wbca) ########## (a) ########## # Model with all main effects: modell1 <- glm(class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick +UShap + USize, family=binomial(logit), data=wbca) summary(modell1) # Diagnostic plots: halfnorm(residuals(modell1)) par(mfrow=c(2,2)) plot(modell1) ########## (b) ########## # Reduced model with only main effects: modell2 <- step(modell1,trace=f) summary(modell2) anova(modell2,modell1) pchisq(0.2,2,lower=f) # Model 2 is better! # Model with all two-way interactions: modell3 <- glm(class ~ (Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick +UShap + USize)^2, family=binomial(logit), data=wbca) summary(modell3) # Reduced model with two-way interactions: modell4 <- step(modell3, trace=f) summary(modell4) # Further reduced model with two-way interactions: modell5 <- glm(class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick +UShap + USize + Adhes:BNucl + Adhes:Epith + Adhes:Thick + Adhes:UShap + BNucl:Chrom + BNucl:UShap + BNucl:USize + Chrom:UShap + Chrom:USize + Epith:Thick, family=binomial(logit),data=wbca) summary(modell5) anova(modell2,modell4) pchisq(62,21,lower=f) # Diagnostic plots: halfnorm(residuals(modell2)) par(mfrow=c(2,2)) plot(modell2) halfnorm(residuals(modell4)) par(mfrow=c(2,2)) plot(modell4) ########## (c) ########## # Prediction: x0 <- c(1,1,1,3,1,1,4,1) eta0 <- sum(x0*coef(modell2)) # Model 4 is better!

17 ilogit(eta0) #Alternative: predict(modell2, newdata=data.frame(adhes=1,bnucl=1, Chrom=3, Mitos=1, NNucl=1, Thick=4, UShap=1),type="response", se=t) # Confidensinterval for prediction: modell2sum <- summary(modell2) (cm <- modell2sum$cov.unscaled) se <- sqrt(t(x0) %*% cm %*% x0) #Standard error on the logit scale. ilogit(c(eta0-1.96*se,eta0+1.96*se)) #Confidence interval on the probability scale. # Alternative: predict(modell2, newdata=data.frame(adhes=1,bnucl=1, Chrom=3, Mitos=1, NNucl=1, Thick=4, UShap=1), se=t) ilogit(c( * , * )) # Same as above. ########## (d) ########## predsann <- predict(modell2, type = "response") prsa0.5 <- predsann>0.5 test0.5=wbca$class-prsa0.5 sum(test0.5==-1) sum(test0.5==1) sum(test0.5==0) # Alternative: table(wbca$class,1*(predsann>0.5)) ########## (e) ########## prsa0.9 <- predsann>0.9 test0.9=wbca$class-prsa0.9 sum(test0.9==-1) sum(test0.9==1) sum(test0.9==0) # Alternative: table(wbca$class,1*(predsann>0.9)) ########## (f) ########## # Every third observation in a test set: vartredje <- which((1:681)%%3 == 0) testset <- wbca[vartredje,] # The remaining two thirds in a training set: trainingset <- wbca[-vartredje,] # -1 = Initial malign but classified as benign. # 1 = Initial benign but classified as malign. # 0 = Correct classification. # -1 = Initial malign but classified as benign. # 1 = Initial benign but classified as malign. # 0 = Correct classification. # Fit a main effects-model: modell6 <- glm(class ~ Adhes + BNucl + Chrom + Epith + Mitos + NNucl + Thick +UShap + USize, family=binomial(logit), data=trainingset) summary(modell6) # Reduce the main effects-model: modell7 <- step(modell6,trace=f) summary(modell7) anova(modell7,modell6) pchisq(1.89,2,lower=f) # The smaller model, Model 7, is better. # Predict values for test set: skattn <- predict(modell7, newdata=testset,type="response", se=t) # Plot predicted values against true values: plot(skattn$fit,testset$class, xlab="predicted probabilities", ylab="true classification") # Test its predictive performance: table(testset$class,1*(skattn$fit>0.5)) table(testset$class,1*(skattn$fit>0.9)) ####################################

18 R-Code to exercise 5:2 #################################### ####### Ch 5: Exercise 2 ####### #################################### library(faraway) data(happy) help(happy) attributes(happy) head(happy) happy summary(happy) library(mass) ########## (a) ########## # Creates factors of the variables that are factors: happy$happyf <- factor(happy$happy) happy$sexf <- factor(happy$sex) happy$lovef <- factor(happy$love) happy$workf <- factor(happy$work) # A proportional odds model: modell1 <- polr(happyf ~ money + sexf + lovef + workf, happy) summary(modell1) c(deviance(modell1),modell1$edf) # AIC-based variable selection method: modell2 <- step(modell1) # The variable sex doesn't seem to be significant. summary(modell2) c(deviance(modell2),modell2$edf) # Comparison: anova(modell1,modell2) # Model 2 is better! # An ordered probit model: modell3 <- polr(happyf ~ money + sexf + lovef + workf, method="probit", happy) summary(modell3) c(deviance(modell3),modell3$edf) # AIC-based variable selection method: modell4 <- step(modell3) # The variable sex doesn't seem to be significant. summary(modell4) c(deviance(modell4),modell4$edf) # Comparison: anova(modell3, modell4) ########## (b) ########## # Interpret the variables! ########## (c) ########## # Predict with the proportional odds model: round(predict(modell2,data.frame(money=30, sexf="0", lovef="1", workf="1"),type="probs"),3) # Check the predictive performance for the proportional odds model: skattningar1 <- predict(modell2) table(skattningar1,happy$happy) # Predict with the ordered probit model: round(predict(modell4,data.frame(money=30, sexf="0", lovef="1", workf="1"),type="probs"),3) # Check the predictive performance for the ordered probit model: skattningar2 <- predict(modell4) table(skattningar2,happy$happy) #################################### # Model 4 is better!

19 R-Code to exercise 8:7 #################################### ####### Ch 8: Exercise 7 ####### #################################### library(faraway) data(semicond) help(semicond) attributes(semicond) head(semicond) semicond summary(semicond) library(mass) library(lme4) ####### Analysis ####### str(semicond) contrasts(semicond$et) <- contr.treatment(4,1) contrasts(semicond$position) <- contr.treatment(4,1) contrasts(semicond$wafer) contrasts(semicond$grp) # Model with interaction between ET and position: modell1 <- lmer(resistance ~ ET * position + (1 Grp), semicond) summary(modell1) # Check the fixed effects for significance: anova(modell1) # The interaction are not significant since F < 1. # Model without interaction: modell2 <- lmer(resistance ~ ET + position + (1 Grp), semicond) summary(modell2) # Check the fixed effects for significance: anova(modell2) # Check diagnostic plots: par(mfrow=c(2,2)) plot(fitted(modell1), resid(modell1), xlab="fitted interaction", ylab="residuals") abline(0,0) qqnorm(resid(modell1),main="interaction") plot(fitted(modell2), resid(modell2), xlab="fitted no interaction", ylab="residuals") abline(0,0) qqnorm(resid(modell2),main="no interaction") ####################################

Homework 5 - Solution

Homework 5 - Solution STAT 526 - Spring 2011 Homework 5 - Solution Olga Vitek Each part of the problems 5 points 1. Agresti 10.1 (a) and (b). Let Patient Die Suicide Yes No sum Yes 1097 90 1187 No 203 435 638 sum 1300 525 1825

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

( ) ( ) 2. ( ) = e b 0 +b 1 x. logistic function: P( y = 1) = eb 0 +b 1 x. 1+ e b 0 +b 1 x. Linear model: cancer = b 0. + b 1 ( cigarettes) b 0 +b 1 x

( ) ( ) 2. ( ) = e b 0 +b 1 x. logistic function: P( y = 1) = eb 0 +b 1 x. 1+ e b 0 +b 1 x. Linear model: cancer = b 0. + b 1 ( cigarettes) b 0 +b 1 x Lesson #13: Generalized Linear Models: Logistic, Poisson Regression So far, we ve fit linear models to predict continuous dependent variables. In this lesson, we ll learn how to use the Generalized Linear

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

Generalised linear models. Response variable can take a number of different formats

Generalised linear models. Response variable can take a number of different formats Generalised linear models Response variable can take a number of different formats Structure Limitations of linear models and GLM theory GLM for count data GLM for presence \ absence data GLM for proportion

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

BMI 541/699 Lecture 22

BMI 541/699 Lecture 22 BMI 541/699 Lecture 22 Where we are: 1. Introduction and Experimental Design 2. Exploratory Data Analysis 3. Probability 4. T-based methods for continous variables 5. Power and sample size for t-based

More information

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION

More information

lme4 Luke Chang Last Revised July 16, Fitting Linear Mixed Models with a Varying Intercept

lme4 Luke Chang Last Revised July 16, Fitting Linear Mixed Models with a Varying Intercept lme4 Luke Chang Last Revised July 16, 2010 1 Using lme4 1.1 Fitting Linear Mixed Models with a Varying Intercept We will now work through the same Ultimatum Game example from the regression section and

More information

Neural networks (not in book)

Neural networks (not in book) (not in book) Another approach to classification is neural networks. were developed in the 1980s as a way to model how learning occurs in the brain. There was therefore wide interest in neural networks

More information

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review

More information

PAPER 218 STATISTICAL LEARNING IN PRACTICE

PAPER 218 STATISTICAL LEARNING IN PRACTICE MATHEMATICAL TRIPOS Part III Thursday, 7 June, 2018 9:00 am to 12:00 pm PAPER 218 STATISTICAL LEARNING IN PRACTICE Attempt no more than FOUR questions. There are SIX questions in total. The questions carry

More information

9 Generalized Linear Models

9 Generalized Linear Models 9 Generalized Linear Models The Generalized Linear Model (GLM) is a model which has been built to include a wide range of different models you already know, e.g. ANOVA and multiple linear regression models

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

Generalized Linear Models: An Introduction

Generalized Linear Models: An Introduction Applied Statistics With R Generalized Linear Models: An Introduction John Fox WU Wien May/June 2006 2006 by John Fox Generalized Linear Models: An Introduction 1 A synthesis due to Nelder and Wedderburn,

More information

Logistic Regression 21/05

Logistic Regression 21/05 Logistic Regression 21/05 Recall that we are trying to solve a classification problem in which features x i can be continuous or discrete (coded as 0/1) and the response y is discrete (0/1). Logistic regression

More information

Lecture 3.1 Basic Logistic LDA

Lecture 3.1 Basic Logistic LDA y Lecture.1 Basic Logistic LDA 0.2.4.6.8 1 Outline Quick Refresher on Ordinary Logistic Regression and Stata Women s employment example Cross-Over Trial LDA Example -100-50 0 50 100 -- Longitudinal Data

More information

Multivariate Statistics in Ecology and Quantitative Genetics Summary

Multivariate Statistics in Ecology and Quantitative Genetics Summary Multivariate Statistics in Ecology and Quantitative Genetics Summary Dirk Metzler & Martin Hutzenthaler http://evol.bio.lmu.de/_statgen 5. August 2011 Contents Linear Models Generalized Linear Models Mixed-effects

More information

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis Lecture 6: Logistic Regression Analysis Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox Review Homework 2 Overview Logistic regression model conceptually Logistic regression

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

R Output for Linear Models using functions lm(), gls() & glm()

R Output for Linear Models using functions lm(), gls() & glm() LM 04 lm(), gls() &glm() 1 R Output for Linear Models using functions lm(), gls() & glm() Different kinds of output related to linear models can be obtained in R using function lm() {stats} in the base

More information

Section 4.6 Simple Linear Regression

Section 4.6 Simple Linear Regression Section 4.6 Simple Linear Regression Objectives ˆ Basic philosophy of SLR and the regression assumptions ˆ Point & interval estimation of the model parameters, and how to make predictions ˆ Point and interval

More information

Log-linear Models for Contingency Tables

Log-linear Models for Contingency Tables Log-linear Models for Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Log-linear Models for Two-way Contingency Tables Example: Business Administration Majors and Gender A

More information

PAPER 206 APPLIED STATISTICS

PAPER 206 APPLIED STATISTICS MATHEMATICAL TRIPOS Part III Thursday, 1 June, 2017 9:00 am to 12:00 pm PAPER 206 APPLIED STATISTICS Attempt no more than FOUR questions. There are SIX questions in total. The questions carry equal weight.

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Introduction to Within-Person Analysis and RM ANOVA

Introduction to Within-Person Analysis and RM ANOVA Introduction to Within-Person Analysis and RM ANOVA Today s Class: From between-person to within-person ANOVAs for longitudinal data Variance model comparisons using 2 LL CLP 944: Lecture 3 1 The Two Sides

More information

A brief introduction to mixed models

A brief introduction to mixed models A brief introduction to mixed models University of Gothenburg Gothenburg April 6, 2017 Outline An introduction to mixed models based on a few examples: Definition of standard mixed models. Parameter estimation.

More information

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics Linear, Generalized Linear, and Mixed-Effects Models in R John Fox McMaster University ICPSR 2018 John Fox (McMaster University) Statistical Models in R ICPSR 2018 1 / 19 Linear and Generalized Linear

More information

Let s see if we can predict whether a student returns or does not return to St. Ambrose for their second year.

Let s see if we can predict whether a student returns or does not return to St. Ambrose for their second year. Assignment #13: GLM Scenario: Over the past few years, our first-to-second year retention rate has ranged from 77-80%. In other words, 77-80% of our first-year students come back to St. Ambrose for their

More information

Introduction and Background to Multilevel Analysis

Introduction and Background to Multilevel Analysis Introduction and Background to Multilevel Analysis Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Background and

More information

Multiple Linear Regression. Chapter 12

Multiple Linear Regression. Chapter 12 13 Multiple Linear Regression Chapter 12 Multiple Regression Analysis Definition The multiple regression model equation is Y = b 0 + b 1 x 1 + b 2 x 2 +... + b p x p + ε where E(ε) = 0 and Var(ε) = s 2.

More information

Analysing categorical data using logit models

Analysing categorical data using logit models Analysing categorical data using logit models Graeme Hutcheson, University of Manchester The lecture notes, exercises and data sets associated with this course are available for download from: www.research-training.net/manchester

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Methods@Manchester Summer School Manchester University July 2 6, 2018 Generalized Linear Models: a generic approach to statistical modelling www.research-training.net/manchester2018

More information

HW1 Roshena MacPherson Feb 1, 2017

HW1 Roshena MacPherson Feb 1, 2017 HW1 Roshena MacPherson Feb 1, 2017 This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. Question 1: In this question we will consider some real

More information

A Re-Introduction to General Linear Models (GLM)

A Re-Introduction to General Linear Models (GLM) A Re-Introduction to General Linear Models (GLM) Today s Class: You do know the GLM Estimation (where the numbers in the output come from): From least squares to restricted maximum likelihood (REML) Reviewing

More information

Learning Bayesian Networks for Biomedical Data

Learning Bayesian Networks for Biomedical Data Learning Bayesian Networks for Biomedical Data Faming Liang (Texas A&M University ) Liang, F. and Zhang, J. (2009) Learning Bayesian Networks for Discrete Data. Computational Statistics and Data Analysis,

More information

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator

More information

Chapter 1. Modeling Basics

Chapter 1. Modeling Basics Chapter 1. Modeling Basics What is a model? Model equation and probability distribution Types of model effects Writing models in matrix form Summary 1 What is a statistical model? A model is a mathematical

More information

R Hints for Chapter 10

R Hints for Chapter 10 R Hints for Chapter 10 The multiple logistic regression model assumes that the success probability p for a binomial random variable depends on independent variables or design variables x 1, x 2,, x k.

More information

Investigating Models with Two or Three Categories

Investigating Models with Two or Three Categories Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

More information

Chapter 19: Logistic regression

Chapter 19: Logistic regression Chapter 19: Logistic regression Self-test answers SELF-TEST Rerun this analysis using a stepwise method (Forward: LR) entry method of analysis. The main analysis To open the main Logistic Regression dialog

More information

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

 M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2 Notation and Equations for Final Exam Symbol Definition X The variable we measure in a scientific study n The size of the sample N The size of the population M The mean of the sample µ The mean of the

More information

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam ECLT 5810 Linear Regression and Logistic Regression for Classification Prof. Wai Lam Linear Regression Models Least Squares Input vectors is an attribute / feature / predictor (independent variable) The

More information

Introduction to the Analysis of Tabular Data

Introduction to the Analysis of Tabular Data Introduction to the Analysis of Tabular Data Anthropological Sciences 192/292 Data Analysis in the Anthropological Sciences James Holland Jones & Ian G. Robertson March 15, 2006 1 Tabular Data Is there

More information

12 Modelling Binomial Response Data

12 Modelling Binomial Response Data c 2005, Anthony C. Brooms Statistical Modelling and Data Analysis 12 Modelling Binomial Response Data 12.1 Examples of Binary Response Data Binary response data arise when an observation on an individual

More information

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Poisson Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Poisson Regression 1 / 49 Poisson Regression 1 Introduction

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

Logistic Regression. Continued Psy 524 Ainsworth

Logistic Regression. Continued Psy 524 Ainsworth Logistic Regression Continued Psy 524 Ainsworth Equations Regression Equation Y e = 1 + A+ B X + B X + B X 1 1 2 2 3 3 i A+ B X + B X + B X e 1 1 2 2 3 3 Equations The linear part of the logistic regression

More information

Regression and Generalized Linear Models. Dr. Wolfgang Rolke Expo in Statistics C3TEC, Caguas, October 9, 2015

Regression and Generalized Linear Models. Dr. Wolfgang Rolke Expo in Statistics C3TEC, Caguas, October 9, 2015 Regression and Generalized Linear Models Dr. Wolfgang Rolke Expo in Statistics C3TEC, Caguas, October 9, 2015 Example: Predicting Success of UPR Students Data: information from application forms of 25495

More information

Correlated Data: Linear Mixed Models with Random Intercepts

Correlated Data: Linear Mixed Models with Random Intercepts 1 Correlated Data: Linear Mixed Models with Random Intercepts Mixed Effects Models This lecture introduces linear mixed effects models. Linear mixed models are a type of regression model, which generalise

More information

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates Madison January 11, 2011 Contents 1 Definition 1 2 Links 2 3 Example 7 4 Model building 9 5 Conclusions 14

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

STA 450/4000 S: January

STA 450/4000 S: January STA 450/4000 S: January 6 005 Notes Friday tutorial on R programming reminder office hours on - F; -4 R The book Modern Applied Statistics with S by Venables and Ripley is very useful. Make sure you have

More information

Matched Pair Data. Stat 557 Heike Hofmann

Matched Pair Data. Stat 557 Heike Hofmann Matched Pair Data Stat 557 Heike Hofmann Outline Marginal Homogeneity - review Binary Response with covariates Ordinal response Symmetric Models Subject-specific vs Marginal Model conditional logistic

More information

Binary Logistic Regression

Binary Logistic Regression The coefficients of the multiple regression model are estimated using sample data with k independent variables Estimated (or predicted) value of Y Estimated intercept Estimated slope coefficients Ŷ = b

More information

Regression Methods for Survey Data

Regression Methods for Survey Data Regression Methods for Survey Data Professor Ron Fricker! Naval Postgraduate School! Monterey, California! 3/26/13 Reading:! Lohr chapter 11! 1 Goals for this Lecture! Linear regression! Review of linear

More information

Ronald Heck Week 14 1 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Nov. 17, 2012

Ronald Heck Week 14 1 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Nov. 17, 2012 Ronald Heck Week 14 1 From Single Level to Multilevel Categorical Models This week we develop a two-level model to examine the event probability for an ordinal response variable with three categories (persist

More information

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper Student Name: ID: McGill University Faculty of Science Department of Mathematics and Statistics Statistics Part A Comprehensive Exam Methodology Paper Date: Friday, May 13, 2016 Time: 13:00 17:00 Instructions

More information

Bayesian Classification Methods

Bayesian Classification Methods Bayesian Classification Methods Suchit Mehrotra North Carolina State University smehrot@ncsu.edu October 24, 2014 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 1 / 33 How do you define

More information

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto. Introduction to Dalla Lana School of Public Health University of Toronto olli.saarela@utoronto.ca September 18, 2014 38-1 : a review 38-2 Evidence Ideal: to advance the knowledge-base of clinical medicine,

More information

Simple, Marginal, and Interaction Effects in General Linear Models

Simple, Marginal, and Interaction Effects in General Linear Models Simple, Marginal, and Interaction Effects in General Linear Models PRE 905: Multivariate Analysis Lecture 3 Today s Class Centering and Coding Predictors Interpreting Parameters in the Model for the Means

More information

Introducing Generalized Linear Models: Logistic Regression

Introducing Generalized Linear Models: Logistic Regression Ron Heck, Summer 2012 Seminars 1 Multilevel Regression Models and Their Applications Seminar Introducing Generalized Linear Models: Logistic Regression The generalized linear model (GLM) represents and

More information

Logistic & Tobit Regression

Logistic & Tobit Regression Logistic & Tobit Regression Different Types of Regression Binary Regression (D) Logistic transformation + e P( y x) = 1 + e! " x! + " x " P( y x) % ln$ ' = ( + ) x # 1! P( y x) & logit of P(y x){ P(y

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

Lecture 9 STK3100/4100

Lecture 9 STK3100/4100 Lecture 9 STK3100/4100 27. October 2014 Plan for lecture: 1. Linear mixed models cont. Models accounting for time dependencies (Ch. 6.1) 2. Generalized linear mixed models (GLMM, Ch. 13.1-13.3) Examples

More information

STAT 526 Advanced Statistical Methodology

STAT 526 Advanced Statistical Methodology STAT 526 Advanced Statistical Methodology Fall 2017 Lecture Note 10 Analyzing Clustered/Repeated Categorical Data 0-0 Outline Clustered/Repeated Categorical Data Generalized Linear Mixed Models Generalized

More information

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson Lecture 10: Alternatives to OLS with limited dependent variables PEA vs APE Logit/Probit Poisson PEA vs APE PEA: partial effect at the average The effect of some x on y for a hypothetical case with sample

More information

Workshop 9.3a: Randomized block designs

Workshop 9.3a: Randomized block designs -1- Workshop 93a: Randomized block designs Murray Logan November 23, 16 Table of contents 1 Randomized Block (RCB) designs 1 2 Worked Examples 12 1 Randomized Block (RCB) designs 11 RCB design Simple Randomized

More information

Random and Mixed Effects Models - Part II

Random and Mixed Effects Models - Part II Random and Mixed Effects Models - Part II Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Two-Factor Random Effects Model Example: Miles per Gallon (Neter, Kutner, Nachtsheim, & Wasserman, problem

More information

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Generalized Linear Models. Last time: Background & motivation for moving beyond linear Generalized Linear Models Last time: Background & motivation for moving beyond linear regression - non-normal/non-linear cases, binary, categorical data Today s class: 1. Examples of count and ordered

More information

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours Instructions: STATS216v Introduction to Statistical Learning Stanford University, Summer 2017 Remember the university honor code. Midterm Exam (Solutions) Duration: 1 hours Write your name and SUNet ID

More information

NELS 88. Latent Response Variable Formulation Versus Probability Curve Formulation

NELS 88. Latent Response Variable Formulation Versus Probability Curve Formulation NELS 88 Table 2.3 Adjusted odds ratios of eighth-grade students in 988 performing below basic levels of reading and mathematics in 988 and dropping out of school, 988 to 990, by basic demographics Variable

More information

Categorical Predictor Variables

Categorical Predictor Variables Categorical Predictor Variables We often wish to use categorical (or qualitative) variables as covariates in a regression model. For binary variables (taking on only 2 values, e.g. sex), it is relatively

More information

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression Logistic Regression 1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression 5. Target Marketing: Tabloid Data

More information

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates 2011-03-16 Contents 1 Generalized Linear Mixed Models Generalized Linear Mixed Models When using linear mixed

More information

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October

More information

Non-Gaussian Response Variables

Non-Gaussian Response Variables Non-Gaussian Response Variables What is the Generalized Model Doing? The fixed effects are like the factors in a traditional analysis of variance or linear model The random effects are different A generalized

More information

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation Background Regression so far... Lecture 23 - Sta 111 Colin Rundel June 17, 2014 At this point we have covered: Simple linear regression Relationship between numerical response and a numerical or categorical

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Generalized Linear Models

Generalized Linear Models York SPIDA John Fox Notes Generalized Linear Models Copyright 2010 by John Fox Generalized Linear Models 1 1. Topics I The structure of generalized linear models I Poisson and other generalized linear

More information

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3 STA 303 H1S / 1002 HS Winter 2011 Test March 7, 2011 LAST NAME: FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 303 STA 1002 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator. Some formulae

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key Statistical Methods III Statistics 212 Problem Set 2 - Answer Key 1. (Analysis to be turned in and discussed on Tuesday, April 24th) The data for this problem are taken from long-term followup of 1423

More information

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Statistics 203: Introduction to Regression and Analysis of Variance Course review Statistics 203: Introduction to Regression and Analysis of Variance Course review Jonathan Taylor - p. 1/?? Today Review / overview of what we learned. - p. 2/?? General themes in regression models Specifying

More information

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection Model Selection in GLMs Last class: estimability/identifiability, analysis of deviance, standard errors & confidence intervals (should be able to implement frequentist GLM analyses!) Today: standard frequentist

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 12: Logistic regression (v1) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 30 Regression methods for binary outcomes 2 / 30 Binary outcomes For the duration of this

More information

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 7: Dummy Variable Regression So far, we ve only considered quantitative variables in our models. We can integrate categorical predictors by constructing artificial

More information

Logistic Regression - problem 6.14

Logistic Regression - problem 6.14 Logistic Regression - problem 6.14 Let x 1, x 2,, x m be given values of an input variable x and let Y 1,, Y m be independent binomial random variables whose distributions depend on the corresponding values

More information

Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions

Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions Douglas Bates Department of Statistics University of Wisconsin - Madison Madison January 11, 2011

More information

36-463/663: Multilevel & Hierarchical Models

36-463/663: Multilevel & Hierarchical Models 36-463/663: Multilevel & Hierarchical Models (P)review: in-class midterm Brian Junker 132E Baker Hall brian@stat.cmu.edu 1 In-class midterm Closed book, closed notes, closed electronics (otherwise I have

More information

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R 2nd Edition Brian S. Everitt and Torsten Hothorn CHAPTER 7 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, Colonic

More information