HW1 Roshena MacPherson Feb 1, 2017

Size: px
Start display at page:

Download "HW1 Roshena MacPherson Feb 1, 2017"

Transcription

1 HW1 Roshena MacPherson Feb 1, 2017 This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. Question 1: In this question we will consider some real life application of statistical learning a) Unsupervised Learning 1) inferring driving modes (lane changing, lane keeping, merging, etc. ) from unlabeled highway data 2) threat detection from unlabeled video data 3) classifying politicians based on voting records b) Regression 1) Trying to fit a polynomial to the dynamics of a system. Predictors are initial conditions and control input, response is future state. The application is prediction. We want to be able to predict how the system will react if perturbed. 2) Trying to fit a function to relate student reported happiness (the response) to different predictors such as the number of hours spent working a day, number of hours of exercise, etc. The goal is inference because we would like to draw conclusions about correlations between certain behaviors and reported happiness. 3) Trying to fit a function to relate number of attempts a baby has taken at completing a certain task (predictor) and how well it performs the task (response). The goal is inference because we would like to draw conclusions about the learning rate of the baby. c) Classification 1) Trying to identify classes of drivers such as aggressive, timid, distracted, etc. based on average number of lane changes per time (response) give their age, ethnicity, car type (predictors). The goal is prediction because we would like to be able to predict in the future what sorts of actions the driver will take based on their classification 2) Trying to determine groups of patients that a drug is effective on based on their how well they say they are feeling worse, the same, better (response) after being given different amounts of a drug (predictor) 3) Classifying tv shows into different genres based on the ratings given to them by different types of viewers (response) given the age, ethnicity, and viewing habits of the viewers (predictor) Question 2: Explain whether each scenario below is a regression, classification, or unsupervised learning problem, and indicate for each supervised learning scenario whether we are more interested in inference or prediction. Finally, provide n and p. a) This is a classification problem. We are interested in inference because the school wants to understand how different elements are predictors of admittance. n is 42,000 (the number of students that we have data for) and p is 7. b) This is an unsupervised learning problem. We are trying to infer subtypes of consumers from unlabeled data. n is 1.5 million (the number of consumers we have data on ) and p is 500,000 ( the number of products we have data on ) c) This is a regression problem. The outputs (good sell, bad sell, horrible sell, etc.) have a clear ordering so it makes sense to use regression instead of classification. We could also do classification however. We are interested in inference to understand how these different factors affect whether a book will sell well. n is 4,000 (number of books we have data on) and p is 5. d) This is a regression problem. We are interested in prediction because we would like to predict by how much global temperatures will rise in the coming years. n is 116 ( the number of years we have data for ) and p is 6. Question 3: 1

2 a) The advantages of a very flexible approach for regression are that it allows for a variety of underlying effects to be modeled. For instance, a simple linear model will almost definitely not capture what s truly going on, whereas a more flexible model that has third order and second order terms will allow for the capture of higher order effects that might be going on. The disadvantage of a very flexible model is that there are many more parameters to fit and you could easily end up overfitting your model to your training data. You may end up capturing the noise in your training data rather than the underlying trends b) If you know that the noise in your system is very low (you have already characterized and calibrated the signal to noise ratio in your sensing setup and know that the noise is very small) then it may be appropriate to use a very flexible model since we know that most of the variance we see will be due to actual trends. If we are interested in prediction rather than inference (aka we don t care that much about the interpretability of the fit), and we have a very large number of samples (aka n is very large) then using a more flexible model would be appropriate (this could be trying to generate a predictor for which stocks will increase if we have decades of data). c) If we know that there is a decent amount of noise in our system, then we would prefer a less flexible method to make sure we aren t fitting higher order terms to the noise in our system. If the number of data points n is small relative to the number of variables p, then we would probably want to use a less flexible method. Question 4: library(mass) attach(boston) a) There are 506 rows and 14 columns. The rows represent the different suburbs. The columns represent the different predictors that have been measured nrow(boston) [1] 506 ncol(boston) [1] 14 b) In the two plots below I have plotted the weighted mean of distances to five Boston employment centers vs the proportion of owner-occupied units built prior to 1940 (plot A) and nitrogen oxides concentration vs proportion of non-retail business acres per town (plot B). For A, the Rˆ2 value was , meaning about 55% of the variance is explained by our model. Considering where our data came from, it s reasonable to assume that there is a decent amount of noise in the data, so this Rˆ2 value seems as if our model is a decent fit. Additionally, the p value for the slope coefficient is less than 2ˆ-16, meaning it is very unlikely that the null hypothesis is true. For B, the Rˆ2 value was , meaning that about 58% of the variance is explained by our model. Again, this seems like a reasonably good model considering we do expect our data to be pretty noisy. For the slope coefficient the p value again is <2e-16, meaning that it is very unlikely that the null hypothesis is true. We can be fairly certain that there is a non-zero relationship between the two variables. model1 = lm(dis~age) model2 = lm(nox~indus) summary(model1) Call: lm(formula = dis ~ age) Residuals: Min 1Q Median 3Q Max

3 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** age <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 504 DF, p-value: < 2.2e-16 summary(model2) Call: lm(formula = nox ~ indus) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** indus <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 504 DF, p-value: < 2.2e-16 plot(age,dis) abline(model1, col = "blue") 3

4 dis plot(indus, nox) abline(model2, col = "blue") age nox indus c) It appears that all towns with a non-zero per capita e rate have no residential land zoned for lots over 25,000 sq. ft. So if you have any residential land zoned for lots over 25,000 sq. ft., it appears very likely that your e rate is close to zero. It also appears that almost all towns with a e rate above 5% per capita have a proportion of non-retail business acres of 18. Any higher or lower than that and the e rates do not exceed around 5%. If the town is bounded by the Charles River, the e rate does not exceed about 15% per capita. For those not bounded by the Charles River, the e rate spans the whole range. For towns with a nox value of less than about 0.55, the e rate is essentially 0. Above that and there doesn t seem to be any relationship. There is a linear relationship between age 4

5 and e, with a p value of <2ˆ-16, though it seems as if a slightly more flexible model might fit the data better. There appears to be an inverse relationship between e and weighted mean of distances to five Boston employment centers. It appears that if your town has an index of accessibility to radial highways of less than 25, the e rate is very likely to be very small. Similarly, it appears that if your town has a full property tax rate per $10,000 of less than 650, it is very likely that the e rate is very small. Also, if your pupil to teacher ratio is less than 20, the e rate is likely to be near zero. There appears to be a linear relationship between the percentage of lower status of your population and e (p<2ˆ-16, Rˆ2=.21). There seems to be an inverse relationship between e and median value of owner-occupied homes in $1000s. par(mfrow=c(3,5)) plot(~zn) plot(~indus) plot(~chas) plot(~nox) plot(~rm) plot(~age) plot(~dis) plot(~rad) plot(~tax) plot(~ptratio) plot(~black) plot(~lstat) plot(~medv) zn indus chas nox rm age dis rad tax ptratio black plot(zn,) lstat medv 5

6 plot(age,) model=lm(~age) summary(model) zn Call: lm(formula = ~ age) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-05 *** age e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 504 DF, p-value: 2.855e-16 abline(model, col="blue") 6

7 plot(dis,) model = lm(~dis) summary(model) age Call: lm(formula = ~ dis) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** dis <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 504 DF, p-value: < 2.2e-16 abline(model,col="blue") 7

8 plot(lstat, ) model = lm(~lstat) summary(model) dis Call: lm(formula = ~ lstat) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** lstat < 2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 132 on 1 and 504 DF, p-value: < 2.2e-16 abline(model, col="blue") 8

9 lstat d) There are a few towns with e rate per capita above 40, which is not the norm. 96% are in the 0-20 range. For zn, about 75% of towns have a proportion of residential land zoned for lots over 25,000 sq. ft of less than ten. For indus, the spread is pretty even, from 0 to around 30. For chas, the distribution is obviously binary as it s a dummy variable that can only be one or zero. For nox, there are a few outliers with a nox value of For rm, it looks very much like a Gaussian distribution between 3 rooms and 9 rooms per dwelling. Age is pretty evenly distributed between 0 and 100. There are a few outliers for dis at around About 70% of towns have a rad value <= 8. The other 30% have rad values of 24. Again, about 70% of towns have a tax value <=437, while the other 30% have values >=666. ptratio is pretty evenly distributed. The vast majority of towns have a black number between 350 and 400. The rest are evenly distributed across the board from 0 to 350. Lstat looks like a slightly lopsided Gaussian, ranging from 0 to 40. Medv is similar, ranging from 0 to 50. par(mfrow=c(3,5)) hist() hist(zn) hist(indus) hist(chas) hist(nox) hist(rm) hist(age) hist(dis) hist(rad) hist(tax) hist(ptratio) hist(black) hist(lstat) hist(medv) 9

10 Histogram of Histogram of zn Histogram of indu Histogram of cha Histogram of nox zn indus chas nox Histogram of rm Histogram of age Histogram of dis Histogram of rad Histogram of tax rm age dis rad tax Histogram of ptrat Histogram of blac Histogram of lsta Histogram of med ptratio black lstat medv e) 35 towns bound the Charles River sum(chas) [1] 35 f) The median of ptratio is median(ptratio) [1] g) Town numbers 399 and 406 have the smallest median value of owner-occupied homes (tied at 5.0). The e rates are both pretty high at 38 and 68. The tax rates are in the high range at 666 each. Ptratio is also a the upper end of the spectrum for each at I would want to look at the values of these predictors across other towns with different median value of owner-occupied homes before drawing conclusions. idx1 = 399 idx2 = 406 [idx1] [1] [idx2] [1] tax[idx1] [1] 666 tax[idx2] 10

11 [1] 666 ptratio[idx1] [1] 20.2 ptratio[idx2] [1] 20.2 h) 64 towns have on average 7 rooms or more in each house. 13 towns have on average 8 rooms or more in each house. sum(rm>7) [1] 64 sum(rm>8) [1] 13 hist(rm) Histogram of rm Question 5 a. Split the data set into a training set and a test set of approximately equal size. smp_size = nrow(boston)/2 seed_num = 25 set.seed(seed_num) train_ind <- sample(seq_len(nrow(boston)), size = smp_size) train <- Boston[train_ind, ] test <- Boston[-train_ind, ] rm b. Fit a linear model using least squares on the training set, and report the mean training and mean test error obtained. 11

12 model = lm(~., train) summary(model) Call: lm(formula = ~., data = train) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) zn indus chas nox rm age dis * rad e-06 *** tax ptratio black lstat medv * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 239 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 13 and 239 DF, p-value: < 2.2e-16 predicted_vals_test = predict(model, test) predicted_vals_train = predict(model, train) res = predicted_vals_test - test$ mean(model$residuals^2) [1] mean(res^2) [1] c) It appears from our model that the index of accessibility to radial highways is extremely significant, and median value of owner occupied homes in $1000s and weighted mean of distances to five Boston employment centres are also significant but less so. Indus and age no longer appear significant, though this may be because they are correlated with dis, rad, or medv. The R squared value of our model is.42, meaning we have explained 42% of the variance in our system. Considering the MSE went from 36 to 46 which is not too big of a jump, I would say our model does a pretty good job of predicting e rate. Question 6: The most important predictors in this case are nox, rad, zn, dis, and black. The training misclassification rate is 5% and the test misclassification rate is 14%. This seems to perform much better than the linear regression model when compared to the Rˆ2 values we had. 12

13 Y = train$>=median(train$) Ytest = test$>=median(test$) for (i in 1:nrow(train)){ if(y[i]){ Y[i] = 1} else { Y[i] = 0} } for (i in 1:nrow(test)){ if(ytest[i]){ Ytest[i] = 1} else { Ytest[i] = 0} } test_changed_ = test test_changed_$ = Ytest train_changed_ = train train_changed_$ = Y model = glm(~.,family = binomial, data = train_changed_) Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred summary(model) Call: glm(formula = ~., family = binomial, data = train_changed_) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) ** zn * indus chas nox e-05 *** rm age dis * rad ** tax * ptratio black * lstat medv Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) 13

14 Null deviance: on 252 degrees of freedom Residual deviance: on 239 degrees of freedom AIC: Number of Fisher Scoring iterations: 10 yhat = model$fit>0.5 class_rate_train = mean(y==yhat) 1-class_rate_train [1] y_test = predict(model, newdata = test_changed_, type="response" ) y_test = y_test>0.5 class_rate_test = mean(ytest==y_test) 1-class_rate_test [1] When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file). 14

Multiple Regression Part I STAT315, 19-20/3/2014

Multiple Regression Part I STAT315, 19-20/3/2014 Multiple Regression Part I STAT315, 19-20/3/2014 Regression problem Predictors/independent variables/features Or: Error which can never be eliminated. Our task is to estimate the regression function f.

More information

Data Mining Techniques. Lecture 2: Regression

Data Mining Techniques. Lecture 2: Regression Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: Regression Jan-Willem van de Meent (credit: Yijun Zhao, Marc Toussaint, Bishop) Administrativa Instructor Jan-Willem van de Meent Email:

More information

GRAD6/8104; INES 8090 Spatial Statistic Spring 2017

GRAD6/8104; INES 8090 Spatial Statistic Spring 2017 Lab #5 Spatial Regression (Due Date: 04/29/2017) PURPOSES 1. Learn to conduct alternative linear regression modeling on spatial data 2. Learn to diagnose and take into account spatial autocorrelation in

More information

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing Supervised Learning Unsupervised learning: To extract structure and postulate hypotheses about data generating process from observations x 1,...,x n. Visualize, summarize and compress data. We have seen

More information

DISCRIMINANT ANALYSIS: LDA AND QDA

DISCRIMINANT ANALYSIS: LDA AND QDA Stat 427/627 Statistical Machine Learning (Baron) HOMEWORK 6, Solutions DISCRIMINANT ANALYSIS: LDA AND QDA. Chap 4, exercise 5. (a) On a training set, LDA and QDA are both expected to perform well. LDA

More information

<br /> D. Thiebaut <br />August """Example of DNNRegressor for Housing dataset.""" In [94]:

<br /> D. Thiebaut <br />August Example of DNNRegressor for Housing dataset. In [94]: sklearn Tutorial: Linear Regression on Boston Data This is following the https://github.com/tensorflow/tensorflow/blob/maste

More information

Stat 401B Final Exam Fall 2016

Stat 401B Final Exam Fall 2016 Stat 40B Final Exam Fall 0 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning

More information

Logistic Regression 21/05

Logistic Regression 21/05 Logistic Regression 21/05 Recall that we are trying to solve a classification problem in which features x i can be continuous or discrete (coded as 0/1) and the response y is discrete (0/1). Logistic regression

More information

Introduction to PyTorch

Introduction to PyTorch Introduction to PyTorch Benjamin Roth Centrum für Informations- und Sprachverarbeitung Ludwig-Maximilian-Universität München beroth@cis.uni-muenchen.de Benjamin Roth (CIS) Introduction to PyTorch 1 / 16

More information

Stat588 Homework 1 (Due in class on Oct 04) Fall 2011

Stat588 Homework 1 (Due in class on Oct 04) Fall 2011 Stat588 Homework 1 (Due in class on Oct 04) Fall 2011 Notes. There are three sections of the homework. Section 1 and Section 2 are required for all students. While Section 3 is only required for Ph.D.

More information

STK 2100 Oblig 1. Zhou Siyu. February 15, 2017

STK 2100 Oblig 1. Zhou Siyu. February 15, 2017 STK 200 Oblig Zhou Siyu February 5, 207 Question a) Make a scatter box plot for the data set. Answer:Here is the code I used to plot the scatter box in R. library ( MASS ) 2 pairs ( Boston ) Figure : Scatter

More information

SKLearn Tutorial: DNN on Boston Data

SKLearn Tutorial: DNN on Boston Data SKLearn Tutorial: DNN on Boston Data This tutorial follows very closely two other good tutorials and merges elements from both: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/skflow/boston.py

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours Instructions: STATS216v Introduction to Statistical Learning Stanford University, Summer 2017 Remember the university honor code. Midterm Exam (Solutions) Duration: 1 hours Write your name and SUNet ID

More information

Review on Spatial Data

Review on Spatial Data Week 12 Lecture: Spatial Autocorrelation and Spatial Regression Introduction to Programming and Geoprocessing Using R GEO6938 4172 GEO4938 4166 Point data Review on Spatial Data Area/lattice data May be

More information

Sparse polynomial chaos expansions as a machine learning regression technique

Sparse polynomial chaos expansions as a machine learning regression technique Research Collection Other Conference Item Sparse polynomial chaos expansions as a machine learning regression technique Author(s): Sudret, Bruno; Marelli, Stefano; Lataniotis, Christos Publication Date:

More information

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson ) Tutorial 7: Correlation and Regression Correlation Used to test whether two variables are linearly associated. A correlation coefficient (r) indicates the strength and direction of the association. A correlation

More information

ST430 Exam 2 Solutions

ST430 Exam 2 Solutions ST430 Exam 2 Solutions Date: November 9, 2015 Name: Guideline: You may use one-page (front and back of a standard A4 paper) of notes. No laptop or textbook are permitted but you may use a calculator. Giving

More information

Chapter 4 Dimension Reduction

Chapter 4 Dimension Reduction Chapter 4 Dimension Reduction Data Mining for Business Intelligence Shmueli, Patel & Bruce Galit Shmueli and Peter Bruce 2010 Exploring the data Statistical summary of data: common metrics Average Median

More information

ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS

ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS STATISTICA, anno LXXIV, n. 1, 2014 ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS Sonia Amodio Department of Economics and Statistics, University of Naples Federico II, Via Cinthia 21,

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model Lab 3 A Quick Introduction to Multiple Linear Regression Psychology 310 Instructions.Work through the lab, saving the output as you go. You will be submitting your assignment as an R Markdown document.

More information

Final Exam. Name: Solution:

Final Exam. Name: Solution: Final Exam. Name: Instructions. Answer all questions on the exam. Open books, open notes, but no electronic devices. The first 13 problems are worth 5 points each. The rest are worth 1 point each. HW1.

More information

Logistic Regressions. Stat 430

Logistic Regressions. Stat 430 Logistic Regressions Stat 430 Final Project Final Project is, again, team based You will decide on a project - only constraint is: you are supposed to use techniques for a solution that are related to

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

Statistical Prediction

Statistical Prediction Statistical Prediction P.R. Hahn Fall 2017 1 Some terminology The goal is to use data to find a pattern that we can exploit. y: response/outcome/dependent/left-hand-side x: predictor/covariate/feature/independent

More information

R Output for Linear Models using functions lm(), gls() & glm()

R Output for Linear Models using functions lm(), gls() & glm() LM 04 lm(), gls() &glm() 1 R Output for Linear Models using functions lm(), gls() & glm() Different kinds of output related to linear models can be obtained in R using function lm() {stats} in the base

More information

Using R in 200D Luke Sonnet

Using R in 200D Luke Sonnet Using R in 200D Luke Sonnet Contents Working with data frames 1 Working with variables........................................... 1 Analyzing data............................................... 3 Random

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

Checking the Poisson assumption in the Poisson generalized linear model

Checking the Poisson assumption in the Poisson generalized linear model Checking the Poisson assumption in the Poisson generalized linear model The Poisson regression model is a generalized linear model (glm) satisfying the following assumptions: The responses y i are independent

More information

R Hints for Chapter 10

R Hints for Chapter 10 R Hints for Chapter 10 The multiple logistic regression model assumes that the success probability p for a binomial random variable depends on independent variables or design variables x 1, x 2,, x k.

More information

Bayesian Classification Methods

Bayesian Classification Methods Bayesian Classification Methods Suchit Mehrotra North Carolina State University smehrot@ncsu.edu October 24, 2014 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 1 / 33 How do you define

More information

We d like to know the equation of the line shown (the so called best fit or regression line).

We d like to know the equation of the line shown (the so called best fit or regression line). Linear Regression in R. Example. Let s create a data frame. > exam1 = c(100,90,90,85,80,75,60) > exam2 = c(95,100,90,80,95,60,40) > students = c("asuka", "Rei", "Shinji", "Mari", "Hikari", "Toji", "Kensuke")

More information

Poisson Regression. The Training Data

Poisson Regression. The Training Data The Training Data Poisson Regression Office workers at a large insurance company are randomly assigned to one of 3 computer use training programmes, and their number of calls to IT support during the following

More information

Chapter 8 Conclusion

Chapter 8 Conclusion 1 Chapter 8 Conclusion Three questions about test scores (score) and student-teacher ratio (str): a) After controlling for differences in economic characteristics of different districts, does the effect

More information

Modeling Overdispersion

Modeling Overdispersion James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 1 Introduction 2 Introduction In this lecture we discuss the problem of overdispersion in

More information

Chapter 16. Simple Linear Regression and Correlation

Chapter 16. Simple Linear Regression and Correlation Chapter 16 Simple Linear Regression and Correlation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Example: 1982 State SAT Scores (First year state by state data available)

Example: 1982 State SAT Scores (First year state by state data available) Lecture 11 Review Section 3.5 from last Monday (on board) Overview of today s example (on board) Section 3.6, Continued: Nested F tests, review on board first Section 3.4: Interaction for quantitative

More information

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017 Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017 PDF file location: http://www.murraylax.org/rtutorials/regression_anovatable.pdf

More information

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT Nov 20 2015 Charlotte Wickham stat511.cwick.co.nz Quiz #4 This weekend, don t forget. Usual format Assumptions Display 7.5 p. 180 The ideal normal, simple

More information

1 The Classic Bivariate Least Squares Model

1 The Classic Bivariate Least Squares Model Review of Bivariate Linear Regression Contents 1 The Classic Bivariate Least Squares Model 1 1.1 The Setup............................... 1 1.2 An Example Predicting Kids IQ................. 1 2 Evaluating

More information

Generalised linear models. Response variable can take a number of different formats

Generalised linear models. Response variable can take a number of different formats Generalised linear models Response variable can take a number of different formats Structure Limitations of linear models and GLM theory GLM for count data GLM for presence \ absence data GLM for proportion

More information

On the Harrison and Rubinfeld Data

On the Harrison and Rubinfeld Data On the Harrison and Rubinfeld Data By Otis W. Gilley Department of Economics and Finance College of Administration and Business Louisiana Tech University Ruston, Louisiana 71272 (318)-257-3468 and R. Kelley

More information

Sample solutions. Stat 8051 Homework 8

Sample solutions. Stat 8051 Homework 8 Sample solutions Stat 8051 Homework 8 Problem 1: Faraway Exercise 3.1 A plot of the time series reveals kind of a fluctuating pattern: Trying to fit poisson regression models yields a quadratic model if

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

Variance Decomposition and Goodness of Fit

Variance Decomposition and Goodness of Fit Variance Decomposition and Goodness of Fit 1. Example: Monthly Earnings and Years of Education In this tutorial, we will focus on an example that explores the relationship between total monthly earnings

More information

LI EAR REGRESSIO A D CORRELATIO

LI EAR REGRESSIO A D CORRELATIO CHAPTER 6 LI EAR REGRESSIO A D CORRELATIO Page Contents 6.1 Introduction 10 6. Curve Fitting 10 6.3 Fitting a Simple Linear Regression Line 103 6.4 Linear Correlation Analysis 107 6.5 Spearman s Rank Correlation

More information

MATH 644: Regression Analysis Methods

MATH 644: Regression Analysis Methods MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100

More information

appstats27.notebook April 06, 2017

appstats27.notebook April 06, 2017 Chapter 27 Objective Students will conduct inference on regression and analyze data to write a conclusion. Inferences for Regression An Example: Body Fat and Waist Size pg 634 Our chapter example revolves

More information

R 2 and F -Tests and ANOVA

R 2 and F -Tests and ANOVA R 2 and F -Tests and ANOVA December 6, 2018 1 Partition of Sums of Squares The distance from any point y i in a collection of data, to the mean of the data ȳ, is the deviation, written as y i ȳ. Definition.

More information

Chapter 27 Summary Inferences for Regression

Chapter 27 Summary Inferences for Regression Chapter 7 Summary Inferences for Regression What have we learned? We have now applied inference to regression models. Like in all inference situations, there are conditions that we must check. We can test

More information

ANOVA, ANCOVA and MANOVA as sem

ANOVA, ANCOVA and MANOVA as sem ANOVA, ANCOVA and MANOVA as sem Robin Beaumont 2017 Hoyle Chapter 24 Handbook of Structural Equation Modeling (2015 paperback), Examples converted to R and Onyx SEM diagrams. This workbook duplicates some

More information

Generalized Linear Models in R

Generalized Linear Models in R Generalized Linear Models in R NO ORDER Kenneth K. Lopiano, Garvesh Raskutti, Dan Yang last modified 28 4 2013 1 Outline 1. Background and preliminaries 2. Data manipulation and exercises 3. Data structures

More information

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator

More information

Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model

Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model A1: There is a linear relationship between X and Y. A2: The error terms (and

More information

Logistic & Tobit Regression

Logistic & Tobit Regression Logistic & Tobit Regression Different Types of Regression Binary Regression (D) Logistic transformation + e P( y x) = 1 + e! " x! + " x " P( y x) % ln$ ' = ( + ) x # 1! P( y x) & logit of P(y x){ P(y

More information

Foundations of Correlation and Regression

Foundations of Correlation and Regression BWH - Biostatistics Intermediate Biostatistics for Medical Researchers Robert Goldman Professor of Statistics Simmons College Foundations of Correlation and Regression Tuesday, March 7, 2017 March 7 Foundations

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

Neural networks (not in book)

Neural networks (not in book) (not in book) Another approach to classification is neural networks. were developed in the 1980s as a way to model how learning occurs in the brain. There was therefore wide interest in neural networks

More information

Consider fitting a model using ordinary least squares (OLS) regression:

Consider fitting a model using ordinary least squares (OLS) regression: Example 1: Mating Success of African Elephants In this study, 41 male African elephants were followed over a period of 8 years. The age of the elephant at the beginning of the study and the number of successful

More information

Introduction to Statistics and R

Introduction to Statistics and R Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary

More information

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006 Chapter 17 Simple Linear Regression and Correlation 17.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Linear Regression Measurement & Evaluation of HCC Systems

Linear Regression Measurement & Evaluation of HCC Systems Linear Regression Measurement & Evaluation of HCC Systems Linear Regression Today s goal: Evaluate the effect of multiple variables on an outcome variable (regression) Outline: - Basic theory - Simple

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

The linear model. Our models so far are linear. Change in Y due to change in X? See plots for: o age vs. ahe o carats vs.

The linear model. Our models so far are linear. Change in Y due to change in X? See plots for: o age vs. ahe o carats vs. 8 Nonlinear effects Lots of effects in economics are nonlinear Examples Deal with these in two (sort of three) ways: o Polynomials o Logarithms o Interaction terms (sort of) 1 The linear model Our models

More information

Logistic Regression - problem 6.14

Logistic Regression - problem 6.14 Logistic Regression - problem 6.14 Let x 1, x 2,, x m be given values of an input variable x and let Y 1,, Y m be independent binomial random variables whose distributions depend on the corresponding values

More information

Introduction and Single Predictor Regression. Correlation

Introduction and Single Predictor Regression. Correlation Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation

More information

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Logistic Regression 1 / 38 Logistic Regression 1 Introduction

More information

Inference for Regression

Inference for Regression Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu

More information

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and dcorrelation Chapter 16 Simple Linear Regression and dcorrelation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Non-Gaussian Response Variables

Non-Gaussian Response Variables Non-Gaussian Response Variables What is the Generalized Model Doing? The fixed effects are like the factors in a traditional analysis of variance or linear model The random effects are different A generalized

More information

Math 2311 Written Homework 6 (Sections )

Math 2311 Written Homework 6 (Sections ) Math 2311 Written Homework 6 (Sections 5.4 5.6) Name: PeopleSoft ID: Instructions: Homework will NOT be accepted through email or in person. Homework must be submitted through CourseWare BEFORE the deadline.

More information

Regression Methods for Survey Data

Regression Methods for Survey Data Regression Methods for Survey Data Professor Ron Fricker! Naval Postgraduate School! Monterey, California! 3/26/13 Reading:! Lohr chapter 11! 1 Goals for this Lecture! Linear regression! Review of linear

More information

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015 Lecture 8: Fitting Data Statistical Computing, 36-350 Wednesday October 7, 2015 In previous episodes Loading and saving data sets in R format Loading and saving data sets in other structured formats Intro

More information

STA 450/4000 S: January

STA 450/4000 S: January STA 450/4000 S: January 6 005 Notes Friday tutorial on R programming reminder office hours on - F; -4 R The book Modern Applied Statistics with S by Venables and Ripley is very useful. Make sure you have

More information

Stat 5102 Final Exam May 14, 2015

Stat 5102 Final Exam May 14, 2015 Stat 5102 Final Exam May 14, 2015 Name Student ID The exam is closed book and closed notes. You may use three 8 1 11 2 sheets of paper with formulas, etc. You may also use the handouts on brand name distributions

More information

Week 7 Multiple factors. Ch , Some miscellaneous parts

Week 7 Multiple factors. Ch , Some miscellaneous parts Week 7 Multiple factors Ch. 18-19, Some miscellaneous parts Multiple Factors Most experiments will involve multiple factors, some of which will be nuisance variables Dealing with these factors requires

More information

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018 Work all problems. 60 points needed to pass at the Masters level, 75 to pass at the PhD

More information

Ch 13 & 14 - Regression Analysis

Ch 13 & 14 - Regression Analysis Ch 3 & 4 - Regression Analysis Simple Regression Model I. Multiple Choice:. A simple regression is a regression model that contains a. only one independent variable b. only one dependent variable c. more

More information

Bayesian Model Averaging (BMA) with uncertain Spatial Effects A Tutorial. Martin Feldkircher

Bayesian Model Averaging (BMA) with uncertain Spatial Effects A Tutorial. Martin Feldkircher Bayesian Model Averaging (BMA) with uncertain Spatial Effects A Tutorial Martin Feldkircher This version: October 2010 This file illustrates the computer code to use spatial filtering in the context of

More information

Stat 401B Exam 3 Fall 2016 (Corrected Version)

Stat 401B Exam 3 Fall 2016 (Corrected Version) Stat 401B Exam 3 Fall 2016 (Corrected Version) I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied

More information

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house. Exam 3 Resource Economics 312 Introductory Econometrics Please complete all questions on this exam. The data in the spreadsheet: Exam 3- Home Prices.xls are to be used for all analyses. These data are

More information

Analysis of Variance and Co-variance. By Manza Ramesh

Analysis of Variance and Co-variance. By Manza Ramesh Analysis of Variance and Co-variance By Manza Ramesh Contents Analysis of Variance (ANOVA) What is ANOVA? The Basic Principle of ANOVA ANOVA Technique Setting up Analysis of Variance Table Short-cut Method

More information

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference. Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences

More information

Swarthmore Honors Exam 2012: Statistics

Swarthmore Honors Exam 2012: Statistics Swarthmore Honors Exam 2012: Statistics 1 Swarthmore Honors Exam 2012: Statistics John W. Emerson, Yale University NAME: Instructions: This is a closed-book three-hour exam having six questions. You may

More information

Harvard University. Rigorous Research in Engineering Education

Harvard University. Rigorous Research in Engineering Education Statistical Inference Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 Statistical Inference You have a sample and want to use the data collected

More information

Linear Regression is a very popular method in science and engineering. It lets you establish relationships between two or more numerical variables.

Linear Regression is a very popular method in science and engineering. It lets you establish relationships between two or more numerical variables. Lab 13. Linear Regression www.nmt.edu/~olegm/382labs/lab13r.pdf Note: the things you will read or type on the computer are in the Typewriter Font. All the files mentioned can be found at www.nmt.edu/~olegm/382labs/

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Solutions to obligatorisk oppgave 2, STK2100

Solutions to obligatorisk oppgave 2, STK2100 Solutions to obligatorisk oppgave 2, STK2100 Vinnie Ko May 14, 2018 Disclaimer: This document is made solely for my own personal use and can contain many errors. Oppgave 1 We load packages and read data

More information

Section 3: Simple Linear Regression

Section 3: Simple Linear Regression Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction

More information

STAT 510 Final Exam Spring 2015

STAT 510 Final Exam Spring 2015 STAT 510 Final Exam Spring 2015 Instructions: The is a closed-notes, closed-book exam No calculator or electronic device of any kind may be used Use nothing but a pen or pencil Please write your name and

More information

Classification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business

Classification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business Classification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business 1 1. Classification 2. Logistic Regression, One Predictor 3. Inference:

More information

ORIE 4741: Learning with Big Messy Data. Train, Test, Validate

ORIE 4741: Learning with Big Messy Data. Train, Test, Validate ORIE 4741: Learning with Big Messy Data Train, Test, Validate Professor Udell Operations Research and Information Engineering Cornell December 4, 2017 1 / 14 Exercise You run a hospital. A vendor wants

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

Lecture 10: F -Tests, ANOVA and R 2

Lecture 10: F -Tests, ANOVA and R 2 Lecture 10: F -Tests, ANOVA and R 2 1 ANOVA We saw that we could test the null hypothesis that β 1 0 using the statistic ( β 1 0)/ŝe. (Although I also mentioned that confidence intervals are generally

More information

Cherry.R. > cherry d h v <portion omitted> > # Step 1.

Cherry.R. > cherry d h v <portion omitted> > # Step 1. Cherry.R ####################################################################### library(mass) library(car) cherry < read.table(file="n:\\courses\\stat8620\\fall 08\\trees.dat",header=T) cherry d h v 1

More information

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Data Analysis 1 LINEAR REGRESSION. Chapter 03 Data Analysis 1 LINEAR REGRESSION Chapter 03 Data Analysis 2 Outline The Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression Other Considerations in Regression Model Qualitative

More information

Multiple linear regression

Multiple linear regression Multiple linear regression Course MF 930: Introduction to statistics June 0 Tron Anders Moger Department of biostatistics, IMB University of Oslo Aims for this lecture: Continue where we left off. Repeat

More information

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam Booklet of Code and Output for STAD29/STA 1007 Midterm Exam List of Figures in this document by page: List of Figures 1 NBA attendance data........................ 2 2 Regression model for NBA attendances...............

More information