HW1 Roshena MacPherson Feb 1, 2017
|
|
- Allison Thompson
- 5 years ago
- Views:
Transcription
1 HW1 Roshena MacPherson Feb 1, 2017 This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. Question 1: In this question we will consider some real life application of statistical learning a) Unsupervised Learning 1) inferring driving modes (lane changing, lane keeping, merging, etc. ) from unlabeled highway data 2) threat detection from unlabeled video data 3) classifying politicians based on voting records b) Regression 1) Trying to fit a polynomial to the dynamics of a system. Predictors are initial conditions and control input, response is future state. The application is prediction. We want to be able to predict how the system will react if perturbed. 2) Trying to fit a function to relate student reported happiness (the response) to different predictors such as the number of hours spent working a day, number of hours of exercise, etc. The goal is inference because we would like to draw conclusions about correlations between certain behaviors and reported happiness. 3) Trying to fit a function to relate number of attempts a baby has taken at completing a certain task (predictor) and how well it performs the task (response). The goal is inference because we would like to draw conclusions about the learning rate of the baby. c) Classification 1) Trying to identify classes of drivers such as aggressive, timid, distracted, etc. based on average number of lane changes per time (response) give their age, ethnicity, car type (predictors). The goal is prediction because we would like to be able to predict in the future what sorts of actions the driver will take based on their classification 2) Trying to determine groups of patients that a drug is effective on based on their how well they say they are feeling worse, the same, better (response) after being given different amounts of a drug (predictor) 3) Classifying tv shows into different genres based on the ratings given to them by different types of viewers (response) given the age, ethnicity, and viewing habits of the viewers (predictor) Question 2: Explain whether each scenario below is a regression, classification, or unsupervised learning problem, and indicate for each supervised learning scenario whether we are more interested in inference or prediction. Finally, provide n and p. a) This is a classification problem. We are interested in inference because the school wants to understand how different elements are predictors of admittance. n is 42,000 (the number of students that we have data for) and p is 7. b) This is an unsupervised learning problem. We are trying to infer subtypes of consumers from unlabeled data. n is 1.5 million (the number of consumers we have data on ) and p is 500,000 ( the number of products we have data on ) c) This is a regression problem. The outputs (good sell, bad sell, horrible sell, etc.) have a clear ordering so it makes sense to use regression instead of classification. We could also do classification however. We are interested in inference to understand how these different factors affect whether a book will sell well. n is 4,000 (number of books we have data on) and p is 5. d) This is a regression problem. We are interested in prediction because we would like to predict by how much global temperatures will rise in the coming years. n is 116 ( the number of years we have data for ) and p is 6. Question 3: 1
2 a) The advantages of a very flexible approach for regression are that it allows for a variety of underlying effects to be modeled. For instance, a simple linear model will almost definitely not capture what s truly going on, whereas a more flexible model that has third order and second order terms will allow for the capture of higher order effects that might be going on. The disadvantage of a very flexible model is that there are many more parameters to fit and you could easily end up overfitting your model to your training data. You may end up capturing the noise in your training data rather than the underlying trends b) If you know that the noise in your system is very low (you have already characterized and calibrated the signal to noise ratio in your sensing setup and know that the noise is very small) then it may be appropriate to use a very flexible model since we know that most of the variance we see will be due to actual trends. If we are interested in prediction rather than inference (aka we don t care that much about the interpretability of the fit), and we have a very large number of samples (aka n is very large) then using a more flexible model would be appropriate (this could be trying to generate a predictor for which stocks will increase if we have decades of data). c) If we know that there is a decent amount of noise in our system, then we would prefer a less flexible method to make sure we aren t fitting higher order terms to the noise in our system. If the number of data points n is small relative to the number of variables p, then we would probably want to use a less flexible method. Question 4: library(mass) attach(boston) a) There are 506 rows and 14 columns. The rows represent the different suburbs. The columns represent the different predictors that have been measured nrow(boston) [1] 506 ncol(boston) [1] 14 b) In the two plots below I have plotted the weighted mean of distances to five Boston employment centers vs the proportion of owner-occupied units built prior to 1940 (plot A) and nitrogen oxides concentration vs proportion of non-retail business acres per town (plot B). For A, the Rˆ2 value was , meaning about 55% of the variance is explained by our model. Considering where our data came from, it s reasonable to assume that there is a decent amount of noise in the data, so this Rˆ2 value seems as if our model is a decent fit. Additionally, the p value for the slope coefficient is less than 2ˆ-16, meaning it is very unlikely that the null hypothesis is true. For B, the Rˆ2 value was , meaning that about 58% of the variance is explained by our model. Again, this seems like a reasonably good model considering we do expect our data to be pretty noisy. For the slope coefficient the p value again is <2e-16, meaning that it is very unlikely that the null hypothesis is true. We can be fairly certain that there is a non-zero relationship between the two variables. model1 = lm(dis~age) model2 = lm(nox~indus) summary(model1) Call: lm(formula = dis ~ age) Residuals: Min 1Q Median 3Q Max
3 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** age <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 504 DF, p-value: < 2.2e-16 summary(model2) Call: lm(formula = nox ~ indus) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** indus <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 504 DF, p-value: < 2.2e-16 plot(age,dis) abline(model1, col = "blue") 3
4 dis plot(indus, nox) abline(model2, col = "blue") age nox indus c) It appears that all towns with a non-zero per capita e rate have no residential land zoned for lots over 25,000 sq. ft. So if you have any residential land zoned for lots over 25,000 sq. ft., it appears very likely that your e rate is close to zero. It also appears that almost all towns with a e rate above 5% per capita have a proportion of non-retail business acres of 18. Any higher or lower than that and the e rates do not exceed around 5%. If the town is bounded by the Charles River, the e rate does not exceed about 15% per capita. For those not bounded by the Charles River, the e rate spans the whole range. For towns with a nox value of less than about 0.55, the e rate is essentially 0. Above that and there doesn t seem to be any relationship. There is a linear relationship between age 4
5 and e, with a p value of <2ˆ-16, though it seems as if a slightly more flexible model might fit the data better. There appears to be an inverse relationship between e and weighted mean of distances to five Boston employment centers. It appears that if your town has an index of accessibility to radial highways of less than 25, the e rate is very likely to be very small. Similarly, it appears that if your town has a full property tax rate per $10,000 of less than 650, it is very likely that the e rate is very small. Also, if your pupil to teacher ratio is less than 20, the e rate is likely to be near zero. There appears to be a linear relationship between the percentage of lower status of your population and e (p<2ˆ-16, Rˆ2=.21). There seems to be an inverse relationship between e and median value of owner-occupied homes in $1000s. par(mfrow=c(3,5)) plot(~zn) plot(~indus) plot(~chas) plot(~nox) plot(~rm) plot(~age) plot(~dis) plot(~rad) plot(~tax) plot(~ptratio) plot(~black) plot(~lstat) plot(~medv) zn indus chas nox rm age dis rad tax ptratio black plot(zn,) lstat medv 5
6 plot(age,) model=lm(~age) summary(model) zn Call: lm(formula = ~ age) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-05 *** age e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 504 DF, p-value: 2.855e-16 abline(model, col="blue") 6
7 plot(dis,) model = lm(~dis) summary(model) age Call: lm(formula = ~ dis) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** dis <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 504 DF, p-value: < 2.2e-16 abline(model,col="blue") 7
8 plot(lstat, ) model = lm(~lstat) summary(model) dis Call: lm(formula = ~ lstat) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** lstat < 2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 132 on 1 and 504 DF, p-value: < 2.2e-16 abline(model, col="blue") 8
9 lstat d) There are a few towns with e rate per capita above 40, which is not the norm. 96% are in the 0-20 range. For zn, about 75% of towns have a proportion of residential land zoned for lots over 25,000 sq. ft of less than ten. For indus, the spread is pretty even, from 0 to around 30. For chas, the distribution is obviously binary as it s a dummy variable that can only be one or zero. For nox, there are a few outliers with a nox value of For rm, it looks very much like a Gaussian distribution between 3 rooms and 9 rooms per dwelling. Age is pretty evenly distributed between 0 and 100. There are a few outliers for dis at around About 70% of towns have a rad value <= 8. The other 30% have rad values of 24. Again, about 70% of towns have a tax value <=437, while the other 30% have values >=666. ptratio is pretty evenly distributed. The vast majority of towns have a black number between 350 and 400. The rest are evenly distributed across the board from 0 to 350. Lstat looks like a slightly lopsided Gaussian, ranging from 0 to 40. Medv is similar, ranging from 0 to 50. par(mfrow=c(3,5)) hist() hist(zn) hist(indus) hist(chas) hist(nox) hist(rm) hist(age) hist(dis) hist(rad) hist(tax) hist(ptratio) hist(black) hist(lstat) hist(medv) 9
10 Histogram of Histogram of zn Histogram of indu Histogram of cha Histogram of nox zn indus chas nox Histogram of rm Histogram of age Histogram of dis Histogram of rad Histogram of tax rm age dis rad tax Histogram of ptrat Histogram of blac Histogram of lsta Histogram of med ptratio black lstat medv e) 35 towns bound the Charles River sum(chas) [1] 35 f) The median of ptratio is median(ptratio) [1] g) Town numbers 399 and 406 have the smallest median value of owner-occupied homes (tied at 5.0). The e rates are both pretty high at 38 and 68. The tax rates are in the high range at 666 each. Ptratio is also a the upper end of the spectrum for each at I would want to look at the values of these predictors across other towns with different median value of owner-occupied homes before drawing conclusions. idx1 = 399 idx2 = 406 [idx1] [1] [idx2] [1] tax[idx1] [1] 666 tax[idx2] 10
11 [1] 666 ptratio[idx1] [1] 20.2 ptratio[idx2] [1] 20.2 h) 64 towns have on average 7 rooms or more in each house. 13 towns have on average 8 rooms or more in each house. sum(rm>7) [1] 64 sum(rm>8) [1] 13 hist(rm) Histogram of rm Question 5 a. Split the data set into a training set and a test set of approximately equal size. smp_size = nrow(boston)/2 seed_num = 25 set.seed(seed_num) train_ind <- sample(seq_len(nrow(boston)), size = smp_size) train <- Boston[train_ind, ] test <- Boston[-train_ind, ] rm b. Fit a linear model using least squares on the training set, and report the mean training and mean test error obtained. 11
12 model = lm(~., train) summary(model) Call: lm(formula = ~., data = train) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) zn indus chas nox rm age dis * rad e-06 *** tax ptratio black lstat medv * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 239 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 13 and 239 DF, p-value: < 2.2e-16 predicted_vals_test = predict(model, test) predicted_vals_train = predict(model, train) res = predicted_vals_test - test$ mean(model$residuals^2) [1] mean(res^2) [1] c) It appears from our model that the index of accessibility to radial highways is extremely significant, and median value of owner occupied homes in $1000s and weighted mean of distances to five Boston employment centres are also significant but less so. Indus and age no longer appear significant, though this may be because they are correlated with dis, rad, or medv. The R squared value of our model is.42, meaning we have explained 42% of the variance in our system. Considering the MSE went from 36 to 46 which is not too big of a jump, I would say our model does a pretty good job of predicting e rate. Question 6: The most important predictors in this case are nox, rad, zn, dis, and black. The training misclassification rate is 5% and the test misclassification rate is 14%. This seems to perform much better than the linear regression model when compared to the Rˆ2 values we had. 12
13 Y = train$>=median(train$) Ytest = test$>=median(test$) for (i in 1:nrow(train)){ if(y[i]){ Y[i] = 1} else { Y[i] = 0} } for (i in 1:nrow(test)){ if(ytest[i]){ Ytest[i] = 1} else { Ytest[i] = 0} } test_changed_ = test test_changed_$ = Ytest train_changed_ = train train_changed_$ = Y model = glm(~.,family = binomial, data = train_changed_) Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred summary(model) Call: glm(formula = ~., family = binomial, data = train_changed_) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) ** zn * indus chas nox e-05 *** rm age dis * rad ** tax * ptratio black * lstat medv Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) 13
14 Null deviance: on 252 degrees of freedom Residual deviance: on 239 degrees of freedom AIC: Number of Fisher Scoring iterations: 10 yhat = model$fit>0.5 class_rate_train = mean(y==yhat) 1-class_rate_train [1] y_test = predict(model, newdata = test_changed_, type="response" ) y_test = y_test>0.5 class_rate_test = mean(ytest==y_test) 1-class_rate_test [1] When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file). 14
Multiple Regression Part I STAT315, 19-20/3/2014
Multiple Regression Part I STAT315, 19-20/3/2014 Regression problem Predictors/independent variables/features Or: Error which can never be eliminated. Our task is to estimate the regression function f.
More informationData Mining Techniques. Lecture 2: Regression
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: Regression Jan-Willem van de Meent (credit: Yijun Zhao, Marc Toussaint, Bishop) Administrativa Instructor Jan-Willem van de Meent Email:
More informationGRAD6/8104; INES 8090 Spatial Statistic Spring 2017
Lab #5 Spatial Regression (Due Date: 04/29/2017) PURPOSES 1. Learn to conduct alternative linear regression modeling on spatial data 2. Learn to diagnose and take into account spatial autocorrelation in
More informationSupervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing
Supervised Learning Unsupervised learning: To extract structure and postulate hypotheses about data generating process from observations x 1,...,x n. Visualize, summarize and compress data. We have seen
More informationDISCRIMINANT ANALYSIS: LDA AND QDA
Stat 427/627 Statistical Machine Learning (Baron) HOMEWORK 6, Solutions DISCRIMINANT ANALYSIS: LDA AND QDA. Chap 4, exercise 5. (a) On a training set, LDA and QDA are both expected to perform well. LDA
More information<br /> D. Thiebaut <br />August """Example of DNNRegressor for Housing dataset.""" In [94]:
sklearn Tutorial: Linear Regression on Boston Data This is following the https://github.com/tensorflow/tensorflow/blob/maste
More informationStat 401B Final Exam Fall 2016
Stat 40B Final Exam Fall 0 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning
More informationLogistic Regression 21/05
Logistic Regression 21/05 Recall that we are trying to solve a classification problem in which features x i can be continuous or discrete (coded as 0/1) and the response y is discrete (0/1). Logistic regression
More informationIntroduction to PyTorch
Introduction to PyTorch Benjamin Roth Centrum für Informations- und Sprachverarbeitung Ludwig-Maximilian-Universität München beroth@cis.uni-muenchen.de Benjamin Roth (CIS) Introduction to PyTorch 1 / 16
More informationStat588 Homework 1 (Due in class on Oct 04) Fall 2011
Stat588 Homework 1 (Due in class on Oct 04) Fall 2011 Notes. There are three sections of the homework. Section 1 and Section 2 are required for all students. While Section 3 is only required for Ph.D.
More informationSTK 2100 Oblig 1. Zhou Siyu. February 15, 2017
STK 200 Oblig Zhou Siyu February 5, 207 Question a) Make a scatter box plot for the data set. Answer:Here is the code I used to plot the scatter box in R. library ( MASS ) 2 pairs ( Boston ) Figure : Scatter
More informationSKLearn Tutorial: DNN on Boston Data
SKLearn Tutorial: DNN on Boston Data This tutorial follows very closely two other good tutorials and merges elements from both: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/skflow/boston.py
More informationClassification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More informationSTATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours
Instructions: STATS216v Introduction to Statistical Learning Stanford University, Summer 2017 Remember the university honor code. Midterm Exam (Solutions) Duration: 1 hours Write your name and SUNet ID
More informationReview on Spatial Data
Week 12 Lecture: Spatial Autocorrelation and Spatial Regression Introduction to Programming and Geoprocessing Using R GEO6938 4172 GEO4938 4166 Point data Review on Spatial Data Area/lattice data May be
More informationSparse polynomial chaos expansions as a machine learning regression technique
Research Collection Other Conference Item Sparse polynomial chaos expansions as a machine learning regression technique Author(s): Sudret, Bruno; Marelli, Stefano; Lataniotis, Christos Publication Date:
More informationcor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )
Tutorial 7: Correlation and Regression Correlation Used to test whether two variables are linearly associated. A correlation coefficient (r) indicates the strength and direction of the association. A correlation
More informationST430 Exam 2 Solutions
ST430 Exam 2 Solutions Date: November 9, 2015 Name: Guideline: You may use one-page (front and back of a standard A4 paper) of notes. No laptop or textbook are permitted but you may use a calculator. Giving
More informationChapter 4 Dimension Reduction
Chapter 4 Dimension Reduction Data Mining for Business Intelligence Shmueli, Patel & Bruce Galit Shmueli and Peter Bruce 2010 Exploring the data Statistical summary of data: common metrics Average Median
More informationON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS
STATISTICA, anno LXXIV, n. 1, 2014 ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS Sonia Amodio Department of Economics and Statistics, University of Naples Federico II, Via Cinthia 21,
More informationExam Applied Statistical Regression. Good Luck!
Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.
More informationLab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model
Lab 3 A Quick Introduction to Multiple Linear Regression Psychology 310 Instructions.Work through the lab, saving the output as you go. You will be submitting your assignment as an R Markdown document.
More informationFinal Exam. Name: Solution:
Final Exam. Name: Instructions. Answer all questions on the exam. Open books, open notes, but no electronic devices. The first 13 problems are worth 5 points each. The rest are worth 1 point each. HW1.
More informationLogistic Regressions. Stat 430
Logistic Regressions Stat 430 Final Project Final Project is, again, team based You will decide on a project - only constraint is: you are supposed to use techniques for a solution that are related to
More informationLecture 18: Simple Linear Regression
Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength
More informationStatistical Prediction
Statistical Prediction P.R. Hahn Fall 2017 1 Some terminology The goal is to use data to find a pattern that we can exploit. y: response/outcome/dependent/left-hand-side x: predictor/covariate/feature/independent
More informationR Output for Linear Models using functions lm(), gls() & glm()
LM 04 lm(), gls() &glm() 1 R Output for Linear Models using functions lm(), gls() & glm() Different kinds of output related to linear models can be obtained in R using function lm() {stats} in the base
More informationUsing R in 200D Luke Sonnet
Using R in 200D Luke Sonnet Contents Working with data frames 1 Working with variables........................................... 1 Analyzing data............................................... 3 Random
More informationLinear Regression Models P8111
Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started
More informationChecking the Poisson assumption in the Poisson generalized linear model
Checking the Poisson assumption in the Poisson generalized linear model The Poisson regression model is a generalized linear model (glm) satisfying the following assumptions: The responses y i are independent
More informationR Hints for Chapter 10
R Hints for Chapter 10 The multiple logistic regression model assumes that the success probability p for a binomial random variable depends on independent variables or design variables x 1, x 2,, x k.
More informationBayesian Classification Methods
Bayesian Classification Methods Suchit Mehrotra North Carolina State University smehrot@ncsu.edu October 24, 2014 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 1 / 33 How do you define
More informationWe d like to know the equation of the line shown (the so called best fit or regression line).
Linear Regression in R. Example. Let s create a data frame. > exam1 = c(100,90,90,85,80,75,60) > exam2 = c(95,100,90,80,95,60,40) > students = c("asuka", "Rei", "Shinji", "Mari", "Hikari", "Toji", "Kensuke")
More informationPoisson Regression. The Training Data
The Training Data Poisson Regression Office workers at a large insurance company are randomly assigned to one of 3 computer use training programmes, and their number of calls to IT support during the following
More informationChapter 8 Conclusion
1 Chapter 8 Conclusion Three questions about test scores (score) and student-teacher ratio (str): a) After controlling for differences in economic characteristics of different districts, does the effect
More informationModeling Overdispersion
James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 1 Introduction 2 Introduction In this lecture we discuss the problem of overdispersion in
More informationChapter 16. Simple Linear Regression and Correlation
Chapter 16 Simple Linear Regression and Correlation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will
More informationExample: 1982 State SAT Scores (First year state by state data available)
Lecture 11 Review Section 3.5 from last Monday (on board) Overview of today s example (on board) Section 3.6, Continued: Nested F tests, review on board first Section 3.4: Interaction for quantitative
More informationVariance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017
Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017 PDF file location: http://www.murraylax.org/rtutorials/regression_anovatable.pdf
More informationStat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov
Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT Nov 20 2015 Charlotte Wickham stat511.cwick.co.nz Quiz #4 This weekend, don t forget. Usual format Assumptions Display 7.5 p. 180 The ideal normal, simple
More information1 The Classic Bivariate Least Squares Model
Review of Bivariate Linear Regression Contents 1 The Classic Bivariate Least Squares Model 1 1.1 The Setup............................... 1 1.2 An Example Predicting Kids IQ................. 1 2 Evaluating
More informationGeneralised linear models. Response variable can take a number of different formats
Generalised linear models Response variable can take a number of different formats Structure Limitations of linear models and GLM theory GLM for count data GLM for presence \ absence data GLM for proportion
More informationOn the Harrison and Rubinfeld Data
On the Harrison and Rubinfeld Data By Otis W. Gilley Department of Economics and Finance College of Administration and Business Louisiana Tech University Ruston, Louisiana 71272 (318)-257-3468 and R. Kelley
More informationSample solutions. Stat 8051 Homework 8
Sample solutions Stat 8051 Homework 8 Problem 1: Faraway Exercise 3.1 A plot of the time series reveals kind of a fluctuating pattern: Trying to fit poisson regression models yields a quadratic model if
More informationSTA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).
STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population
More informationVariance Decomposition and Goodness of Fit
Variance Decomposition and Goodness of Fit 1. Example: Monthly Earnings and Years of Education In this tutorial, we will focus on an example that explores the relationship between total monthly earnings
More informationLI EAR REGRESSIO A D CORRELATIO
CHAPTER 6 LI EAR REGRESSIO A D CORRELATIO Page Contents 6.1 Introduction 10 6. Curve Fitting 10 6.3 Fitting a Simple Linear Regression Line 103 6.4 Linear Correlation Analysis 107 6.5 Spearman s Rank Correlation
More informationMATH 644: Regression Analysis Methods
MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100
More informationappstats27.notebook April 06, 2017
Chapter 27 Objective Students will conduct inference on regression and analyze data to write a conclusion. Inferences for Regression An Example: Body Fat and Waist Size pg 634 Our chapter example revolves
More informationR 2 and F -Tests and ANOVA
R 2 and F -Tests and ANOVA December 6, 2018 1 Partition of Sums of Squares The distance from any point y i in a collection of data, to the mean of the data ȳ, is the deviation, written as y i ȳ. Definition.
More informationChapter 27 Summary Inferences for Regression
Chapter 7 Summary Inferences for Regression What have we learned? We have now applied inference to regression models. Like in all inference situations, there are conditions that we must check. We can test
More informationANOVA, ANCOVA and MANOVA as sem
ANOVA, ANCOVA and MANOVA as sem Robin Beaumont 2017 Hoyle Chapter 24 Handbook of Structural Equation Modeling (2015 paperback), Examples converted to R and Onyx SEM diagrams. This workbook duplicates some
More informationGeneralized Linear Models in R
Generalized Linear Models in R NO ORDER Kenneth K. Lopiano, Garvesh Raskutti, Dan Yang last modified 28 4 2013 1 Outline 1. Background and preliminaries 2. Data manipulation and exercises 3. Data structures
More informationUNIVERSITY OF TORONTO Faculty of Arts and Science
UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator
More informationStart with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model
Start with review, some new definitions, and pictures on the white board. Assumptions in the Normal Linear Regression Model A1: There is a linear relationship between X and Y. A2: The error terms (and
More informationLogistic & Tobit Regression
Logistic & Tobit Regression Different Types of Regression Binary Regression (D) Logistic transformation + e P( y x) = 1 + e! " x! + " x " P( y x) % ln$ ' = ( + ) x # 1! P( y x) & logit of P(y x){ P(y
More informationFoundations of Correlation and Regression
BWH - Biostatistics Intermediate Biostatistics for Medical Researchers Robert Goldman Professor of Statistics Simmons College Foundations of Correlation and Regression Tuesday, March 7, 2017 March 7 Foundations
More informationSTA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).
STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent
More informationNeural networks (not in book)
(not in book) Another approach to classification is neural networks. were developed in the 1980s as a way to model how learning occurs in the brain. There was therefore wide interest in neural networks
More informationConsider fitting a model using ordinary least squares (OLS) regression:
Example 1: Mating Success of African Elephants In this study, 41 male African elephants were followed over a period of 8 years. The age of the elephant at the beginning of the study and the number of successful
More informationIntroduction to Statistics and R
Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary
More informationKeller: Stats for Mgmt & Econ, 7th Ed July 17, 2006
Chapter 17 Simple Linear Regression and Correlation 17.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will
More informationLinear Regression Measurement & Evaluation of HCC Systems
Linear Regression Measurement & Evaluation of HCC Systems Linear Regression Today s goal: Evaluate the effect of multiple variables on an outcome variable (regression) Outline: - Basic theory - Simple
More informationDensity Temp vs Ratio. temp
Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,
More informationThe linear model. Our models so far are linear. Change in Y due to change in X? See plots for: o age vs. ahe o carats vs.
8 Nonlinear effects Lots of effects in economics are nonlinear Examples Deal with these in two (sort of three) ways: o Polynomials o Logarithms o Interaction terms (sort of) 1 The linear model Our models
More informationLogistic Regression - problem 6.14
Logistic Regression - problem 6.14 Let x 1, x 2,, x m be given values of an input variable x and let Y 1,, Y m be independent binomial random variables whose distributions depend on the corresponding values
More informationIntroduction and Single Predictor Regression. Correlation
Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation
More informationLogistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University
Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Logistic Regression 1 / 38 Logistic Regression 1 Introduction
More informationInference for Regression
Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu
More informationChapter 16. Simple Linear Regression and dcorrelation
Chapter 16 Simple Linear Regression and dcorrelation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will
More informationGlossary. The ISI glossary of statistical terms provides definitions in a number of different languages:
Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the
More informationNon-Gaussian Response Variables
Non-Gaussian Response Variables What is the Generalized Model Doing? The fixed effects are like the factors in a traditional analysis of variance or linear model The random effects are different A generalized
More informationMath 2311 Written Homework 6 (Sections )
Math 2311 Written Homework 6 (Sections 5.4 5.6) Name: PeopleSoft ID: Instructions: Homework will NOT be accepted through email or in person. Homework must be submitted through CourseWare BEFORE the deadline.
More informationRegression Methods for Survey Data
Regression Methods for Survey Data Professor Ron Fricker! Naval Postgraduate School! Monterey, California! 3/26/13 Reading:! Lohr chapter 11! 1 Goals for this Lecture! Linear regression! Review of linear
More informationLecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015
Lecture 8: Fitting Data Statistical Computing, 36-350 Wednesday October 7, 2015 In previous episodes Loading and saving data sets in R format Loading and saving data sets in other structured formats Intro
More informationSTA 450/4000 S: January
STA 450/4000 S: January 6 005 Notes Friday tutorial on R programming reminder office hours on - F; -4 R The book Modern Applied Statistics with S by Venables and Ripley is very useful. Make sure you have
More informationStat 5102 Final Exam May 14, 2015
Stat 5102 Final Exam May 14, 2015 Name Student ID The exam is closed book and closed notes. You may use three 8 1 11 2 sheets of paper with formulas, etc. You may also use the handouts on brand name distributions
More informationWeek 7 Multiple factors. Ch , Some miscellaneous parts
Week 7 Multiple factors Ch. 18-19, Some miscellaneous parts Multiple Factors Most experiments will involve multiple factors, some of which will be nuisance variables Dealing with these factors requires
More informationUNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018
UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018 Work all problems. 60 points needed to pass at the Masters level, 75 to pass at the PhD
More informationCh 13 & 14 - Regression Analysis
Ch 3 & 4 - Regression Analysis Simple Regression Model I. Multiple Choice:. A simple regression is a regression model that contains a. only one independent variable b. only one dependent variable c. more
More informationBayesian Model Averaging (BMA) with uncertain Spatial Effects A Tutorial. Martin Feldkircher
Bayesian Model Averaging (BMA) with uncertain Spatial Effects A Tutorial Martin Feldkircher This version: October 2010 This file illustrates the computer code to use spatial filtering in the context of
More informationStat 401B Exam 3 Fall 2016 (Corrected Version)
Stat 401B Exam 3 Fall 2016 (Corrected Version) I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied
More information(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.
Exam 3 Resource Economics 312 Introductory Econometrics Please complete all questions on this exam. The data in the spreadsheet: Exam 3- Home Prices.xls are to be used for all analyses. These data are
More informationAnalysis of Variance and Co-variance. By Manza Ramesh
Analysis of Variance and Co-variance By Manza Ramesh Contents Analysis of Variance (ANOVA) What is ANOVA? The Basic Principle of ANOVA ANOVA Technique Setting up Analysis of Variance Table Short-cut Method
More informationNature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.
Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences
More informationSwarthmore Honors Exam 2012: Statistics
Swarthmore Honors Exam 2012: Statistics 1 Swarthmore Honors Exam 2012: Statistics John W. Emerson, Yale University NAME: Instructions: This is a closed-book three-hour exam having six questions. You may
More informationHarvard University. Rigorous Research in Engineering Education
Statistical Inference Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 Statistical Inference You have a sample and want to use the data collected
More informationLinear Regression is a very popular method in science and engineering. It lets you establish relationships between two or more numerical variables.
Lab 13. Linear Regression www.nmt.edu/~olegm/382labs/lab13r.pdf Note: the things you will read or type on the computer are in the Typewriter Font. All the files mentioned can be found at www.nmt.edu/~olegm/382labs/
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationSolutions to obligatorisk oppgave 2, STK2100
Solutions to obligatorisk oppgave 2, STK2100 Vinnie Ko May 14, 2018 Disclaimer: This document is made solely for my own personal use and can contain many errors. Oppgave 1 We load packages and read data
More informationSection 3: Simple Linear Regression
Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction
More informationSTAT 510 Final Exam Spring 2015
STAT 510 Final Exam Spring 2015 Instructions: The is a closed-notes, closed-book exam No calculator or electronic device of any kind may be used Use nothing but a pen or pencil Please write your name and
More informationClassification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business
Classification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business 1 1. Classification 2. Logistic Regression, One Predictor 3. Inference:
More informationORIE 4741: Learning with Big Messy Data. Train, Test, Validate
ORIE 4741: Learning with Big Messy Data Train, Test, Validate Professor Udell Operations Research and Information Engineering Cornell December 4, 2017 1 / 14 Exercise You run a hospital. A vendor wants
More informationUnit 10: Simple Linear Regression and Correlation
Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for
More informationLecture 10: F -Tests, ANOVA and R 2
Lecture 10: F -Tests, ANOVA and R 2 1 ANOVA We saw that we could test the null hypothesis that β 1 0 using the statistic ( β 1 0)/ŝe. (Although I also mentioned that confidence intervals are generally
More informationCherry.R. > cherry d h v <portion omitted> > # Step 1.
Cherry.R ####################################################################### library(mass) library(car) cherry < read.table(file="n:\\courses\\stat8620\\fall 08\\trees.dat",header=T) cherry d h v 1
More informationData Analysis 1 LINEAR REGRESSION. Chapter 03
Data Analysis 1 LINEAR REGRESSION Chapter 03 Data Analysis 2 Outline The Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression Other Considerations in Regression Model Qualitative
More informationMultiple linear regression
Multiple linear regression Course MF 930: Introduction to statistics June 0 Tron Anders Moger Department of biostatistics, IMB University of Oslo Aims for this lecture: Continue where we left off. Repeat
More informationBooklet of Code and Output for STAD29/STA 1007 Midterm Exam
Booklet of Code and Output for STAD29/STA 1007 Midterm Exam List of Figures in this document by page: List of Figures 1 NBA attendance data........................ 2 2 Regression model for NBA attendances...............
More information