Statistical Prediction
|
|
- Richard George
- 5 years ago
- Views:
Transcription
1 Statistical Prediction P.R. Hahn Fall
2 Some terminology The goal is to use data to find a pattern that we can exploit. y: response/outcome/dependent/left-hand-side x: predictor/covariate/feature/independent f(x): regression/function estimation/prediction/forecasting/ classification/curve fitting The pattern is statistical in the sense that it holds only approximately. 2
3 Signal + noise y x y = f (x) + ɛ 3
4 Minimize the average size of the noise among a certain function class (such as lines or polynomials) minimize the average deviation from the curve squared distance is common (mean squared error) as we get more data, our predictions get better y = α + xβ + ɛ f <- function(x,alpha, beta){ fx <- alpha + beta*x return(fx) } MSE = n 1 i (y i ˆf (x i )) 2 mse <- function(y,fhat){ return(mean((y - fhat)^2)) } 4
5 Load the auto data auto <- read.csv("auto.csv") str(auto) 'data.frame': 397 obs. of 9 variables: $ mpg : num $ cylinders : int $ displacement: num $ horsepower : Factor w/ 94 levels "?","100","102",..: $ weight : int $ acceleration: num $ year : int $ origin : int $ name : Factor w/ 304 levels "amc ambassador brougham",..: Next, we need to clean the data. auto[auto == "?"] <- NA auto$horsepower <- as.numeric(auto$horsepower) auto <- auto[complete.cases(auto),] auto$origin <- as.factor(auto$origin) 5
6 Examine the correlations names(auto) [1] "mpg" "cylinders" "displacement" "horsepower" [5] "weight" "acceleration" "year" "origin" [9] "name" cormat <- cor(auto[,-c(8,9)]) print(round(matrix(as.numeric(cormat),7,7),2)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] [2,] [3,] [4,] [5,] [6,] [7,]
7 Examine the scatterplots plot(auto[,c("mpg","cylinders","displacement","weight")]) mpg cylinders displacement weight 7
8 Write a function, optimize it tempf <- function(parms){ alpha <- parms[1] beta <- parms[2] y <- auto$mpg fhat <- f(auto$weight,alpha,beta) return(mse(y,fhat)) } optim(c(0,0),tempf) $par [1] $value [1] $counts function gradient 129 NA $convergence [1] 0 $message NULL 8
9 Simple linear regression We use the lm() function to predict mpg using weight. fit <- lm(mpg~weight,data = auto) summary(fit) Call: lm(formula = mpg ~ weight, data = auto) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** weight <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 390 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 390 DF, p-value: < 2.2e-16 9
10 Now we plot the fit plot(auto$weight,auto$mpg,pch=20,main='',xlab='weight',ylab='mpg') abline(a=fit$coefficients[1],b=fit$coefficients[2],col='red',lwd=3) mpg weight 10
11 Now it is your turn Fit single-variable linear models using cylinders, displacement, and year. Which one looks best? 11
12 Can the linear trend be improved? Here we add nonlinear features within a linear model. fit_nl <- lm(mpg~ poly(weight,2,raw=true),data=auto) plot(auto$weight,auto$mpg,pch=20,main='',xlab='weight',ylab='mpg') points(auto$weight,fit_nl$fitted.values,col='red',pch=20) mpg weight 12
13 The quadratic model is better summary(fit_nl) Call: lm(formula = mpg ~ poly(weight, 2, raw = TRUE), data = auto) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.226e e < 2e-16 *** poly(weight, 2, raw = TRUE) e e < 2e-16 *** poly(weight, 2, raw = TRUE) e e e-08 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 389 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 389 DF, p-value: < 2.2e-16 Now you: compare the 1, 2, and 3 degree polynomial models in terms of the residual standard error. 13
14 The predictive impact of a variable now depends on the level Here we use the predict() function auto$weight[1] [1] 3504 temp <- auto[1,] temp$weight <- temp$weight predict(fit_nl,newdata = temp) - predict(fit_nl,newdata = auto[1,]) temp <- auto[1,] temp$weight <- temp$weight predict(fit_nl,newdata = auto[1,]) - predict(fit_nl,newdata = temp)
15 What if more than one thing matters? Figure 1: With multiple variables we are now fitting a (hyper)plane. 15
16 The lm() function handles this. fit_mlr <- lm(mpg~.-name-origin+as.factor(origin) + poly(weight,2) - weight,data=auto) summary(fit_mlr) Call: lm(formula = mpg ~. - name - origin + as.factor(origin) + poly(weight, 2) - weight, data = auto) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** cylinders e e displacement 1.250e e horsepower 6.799e e acceleration 1.724e e * year 8.461e e < 2e-16 *** as.factor(origin) e e *** as.factor(origin) e e * poly(weight, 2) e e < 2e-16 *** poly(weight, 2) e e < 2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 382 degrees of freedom Multiple R-squared: 0.857, Adjusted R-squared: F-statistic: on 9 and 382 DF, p-value: < 2.2e-16-16
17 Actual-vs-Predicted plots plot(auto$mpg,fit_mlr$fitted.values,pch=20,main='', xlab='mpg',ylab='predicted mpg') abline(0,1,col='red') predicted mpg mpg 17
18 Interactions fit_mlr2 <- lm(mpg~poly(weight,2)*origin + (year + acceleration + origin)^2,data=auto) round(coef(summary(fit_mlr2)),4) Estimate Std. Error t value Pr(> t ) (Intercept) poly(weight, 2) poly(weight, 2) origin origin year acceleration poly(weight, 2)1:origin poly(weight, 2)2:origin poly(weight, 2)1:origin poly(weight, 2)2:origin year:acceleration origin2:year origin3:year origin2:acceleration origin3:acceleration summary(fit_mlr2)$sigma [1] summary(fit_mlr2)$adj.r.squared [1]
19 Training and test (validation) sets n <- nrow(auto) ntest <- floor(0.2*n) ntrain <- n - ntest testrows <- sample(1:n, ntest, replace = FALSE) auto_test <- auto[testrows,] auto_train <- auto[-testrows,] fit_mlr2 <- lm(mpg~poly(weight,2)*origin + (year + acceleration + origin)^2,data = auto_train) summary(fit_mlr2)$sigma [1] yhat <- predict(fit_mlr2,newdata = auto_test) sqrt(mse(auto_test$mpg,yhat)) [1] fit_mlr <- lm(mpg~poly(weight,2)+year,data=auto_train) summary(fit_mlr)$sigma [1] yhat <- predict(fit_mlr,newdata = auto_test) sqrt(mse(auto_test$mpg,yhat)) [1]
20 Classification auto$highmpg <- auto$mpg > as.numeric(quantile(auto$mpg,0.25)) plot(auto$weight,auto$highmpg,main='',xlab='weight',ylab='high MPG') a <- tapply(auto$weight,cut(auto$weight,seq(1500,5000,by=500)),mean) b <- tapply(auto$highmpg,cut(auto$weight,seq(1500,5000,by=500)),mean) points(a,b,col='red',pch=20,cex=2) high MPG weight 20
21 Decision boundary plot(auto$weight[auto$highmpg],auto$year[auto$highmpg],pch=20, col='red',xlim=range(auto$weight),ylim = range(auto$year), xlab = "weight", ylab = "year") points(auto$weight[!auto$highmpg],auto$year[!auto$highmpg],pch=20, col='cyan') year weight 21
22 Logistic regression These three expressions are all the same. Pr(y = 1 x) = Pr(y = 1 x) = exp (α + xβ) 1 + exp (α + xβ) exp ( (α + xβ)) Pr(y = 1 x) = (1 + exp ( (α + xβ)) 1 A model like this is a type of generalized linear model (GLM). 22
23 Fitting a logistic regression We use the glm() function fit_glm <- glm(highmpg~weight+year,family = binomial, data = auto) summary(fit_glm) Call: glm(formula = highmpg ~ weight + year, family = binomial, data = auto) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e e * weight e e e-13 *** year 4.852e e e-06 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 391 degrees of freedom Residual deviance: on 389 degrees of freedom AIC: Number of Fisher Scoring iterations: 8 23
24 Plot the fitted boundary year Let s do some algebra. weight w <- seq(1500, 5000) yr <- (fit_glm$coefficients[1] + fit_glm$coefficients[2]*w)/ (-fit_glm$coefficients[3]) lines(w,yr,col='magenta',lwd=3) 24
25 ROC curves: How sharp is the decision boundary? The receiver-operating-characteristics curve plots the False Positive Rate (FPR) against the True Positive Rate (TPR). FPR = #(INCORRECTLY PREDICTED POSITIVES)/(TOTAL #NEGATIVES) TPR = #(CORRECTLY PREDICTED POSITIVES)/(TOTAL #POSITIVES) 25
26 ROC in R simple_roc <- function(labels, scores){ labels <- labels[order(scores, decreasing=true)] data.frame(tpr=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels) } temp <- simple_roc(auto$highmpg,fit_glm$fitted.values) plot(temp[,2:1],pch=20) TPR FPR
27 We can add nonlinear terms and interactions, too year weight fit_glm2 <- glm(highmpg~poly(weight,5,raw=true)+year, family = binomial, data = auto) yr <- (fit_glm2$coefficients[1] + fit_glm2$coefficients[2:6]%*%t(cbind(w,w^2,w^3,w^4,w^5)))/ (-fit_glm2$coefficients[7]) lines(w,yr,col='magenta',lwd=3) 27
28 Feature design The process of constructing transformations and interactions is commonly referred to as feature design. It is more art than science, but can be really effective. I have heard it called the single most important part of applied predictive modeling. It is also difficult and ad hoc. 28
29 Regression and classification trees Figure 2: Regression and classification trees automate feature design 29
30 Regression and classification trees Figure 3: Regression and classification trees partition the feature space. 30
31 Regression and classification trees Figure 4: They fit a piecewise constant response surface. 31
32 Regression and classification trees Figure 5: We grow the tree very deep, then prune it. 32
33 rpart() library(rpart);library(rpart.plot) auto <- auto[,-9] #get ride of names column fit_tree <- rpart(mpg~.-highmpg,data=auto) sqrt(mse(auto$mpg,predict(fit_tree))) [1] rpart.plot(fit_tree) yes % displacement >= 190 no 17 43% displacement >= % weight >= % year < % 26 33% year < % weight >= 2775 displacement >= % year < % 19 4% 19 18% 20 4% 24 15% 27 7% 32 7% 29 12% 36 12% 33
34 Exercise fit_tree <- rpart(mpg~.,data=auto_train) sqrt(mse(auto_test$mpg,predict(prune(fit_tree,cp=0), newdata = auto_test))) [1] sqrt(mse(auto_train$mpg,predict(prune(fit_tree,cp=0), newdata = auto_train))) [1] Use the auto_test and auto_train subsets to try out different settings of the cp parameter. Try different splits of the data as well. Do the results change? 34
35 Random forests and randomforest() Regression trees are automated and are interpretable. But sometimes they are not smooth enough. Random forests is a method that combines a bunch of different tree fits, to get a smoother response surface. It is harder to visualize, but often gives better predictions. library(randomforest) randomforest Type rfnews() to see new features/changes/bug fixes. auto_train <- auto_train[,-9] # get rid of names again auto_train <- auto_train[,-9] fit_rf <- randomforest(mpg~.,data=auto_train) sqrt(mse(auto_train$mpg,predict(fit_rf,newdata = auto_train))) [1] sqrt(mse(auto_test$mpg,predict(fit_rf,newdata = auto_test))) [1]
36 Other models/algorithms Boosted regression trees support vector machines neural networks (deep learning) Gaussian processes There are R package implementations in many cases svm() from the e1071 package nnet() from the nnet package gbm() from the gbm package etc. These methods differ in the representation of the prediction function (response surface). 36
37 An aside about neural networks vs. tree methods Recently neural networks, specifically deep learning networks, have received a lot of press, setting new standards for classification tasks in speech and video domains. It is still unclear what explains their huge success theoretically. In practice, however, the answer seems to be that they fit very complicated functions/surfaces. They seem to work well in low noise settings with highly complicated decision boundaries. For data with lots of unmeasured factors, leading to high measurement error, neural networks seem to work less well and tree methods seem to work better. Just my two cents... 37
38 Concept and tool recap Concepts in-sample vs. out-of-sample fits overfitting linear vs. nonlinear functions regression vs. classification feature design vs. non-parametric regression Tools lm() optim() glm() rpart() randomforest() 38
39 Credits I swiped several pictures from the ISLR book. 39
Classification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationUsing R in 200D Luke Sonnet
Using R in 200D Luke Sonnet Contents Working with data frames 1 Working with variables........................................... 1 Analyzing data............................................... 3 Random
More informationLinear Regression Models P8111
Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started
More informationAnalytics 512: Homework # 2 Tim Ahn February 9, 2016
Analytics 512: Homework # 2 Tim Ahn February 9, 2016 Chapter 3 Problem 1 (# 3) Suppose we have a data set with five predictors, X 1 = GP A, X 2 = IQ, X 3 = Gender (1 for Female and 0 for Male), X 4 = Interaction
More informationChecking the Poisson assumption in the Poisson generalized linear model
Checking the Poisson assumption in the Poisson generalized linear model The Poisson regression model is a generalized linear model (glm) satisfying the following assumptions: The responses y i are independent
More informationChapter 3 - Linear Regression
Chapter 3 - Linear Regression Lab Solution 1 Problem 9 First we will read the Auto" data. Note that most datasets referred to in the text are in the R package the authors developed. So we just need to
More informationcor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )
Tutorial 7: Correlation and Regression Correlation Used to test whether two variables are linearly associated. A correlation coefficient (r) indicates the strength and direction of the association. A correlation
More informationA Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn
A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps
More informationIntroduction to Statistics and R
Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary
More informationLogistic Regression 21/05
Logistic Regression 21/05 Recall that we are trying to solve a classification problem in which features x i can be continuous or discrete (coded as 0/1) and the response y is discrete (0/1). Logistic regression
More informationR Output for Linear Models using functions lm(), gls() & glm()
LM 04 lm(), gls() &glm() 1 R Output for Linear Models using functions lm(), gls() & glm() Different kinds of output related to linear models can be obtained in R using function lm() {stats} in the base
More informationLogistic Regression - problem 6.14
Logistic Regression - problem 6.14 Let x 1, x 2,, x m be given values of an input variable x and let Y 1,, Y m be independent binomial random variables whose distributions depend on the corresponding values
More informationBooklet of Code and Output for STAD29/STA 1007 Midterm Exam
Booklet of Code and Output for STAD29/STA 1007 Midterm Exam List of Figures in this document by page: List of Figures 1 NBA attendance data........................ 2 2 Regression model for NBA attendances...............
More informationSolutions to obligatorisk oppgave 2, STK2100
Solutions to obligatorisk oppgave 2, STK2100 Vinnie Ko May 14, 2018 Disclaimer: This document is made solely for my own personal use and can contain many errors. Oppgave 1 We load packages and read data
More informationLecture 3 Classification, Logistic Regression
Lecture 3 Classification, Logistic Regression Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se F. Lindsten Summary
More informationHandout 4: Simple Linear Regression
Handout 4: Simple Linear Regression By: Brandon Berman The following problem comes from Kokoska s Introductory Statistics: A Problem-Solving Approach. The data can be read in to R using the following code:
More informationStat 401B Final Exam Fall 2016
Stat 40B Final Exam Fall 0 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning
More informationNeural networks (not in book)
(not in book) Another approach to classification is neural networks. were developed in the 1980s as a way to model how learning occurs in the brain. There was therefore wide interest in neural networks
More information7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis
Lecture 6: Logistic Regression Analysis Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox Review Homework 2 Overview Logistic regression model conceptually Logistic regression
More informationModeling Overdispersion
James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 1 Introduction 2 Introduction In this lecture we discuss the problem of overdispersion in
More informationRegression Methods for Survey Data
Regression Methods for Survey Data Professor Ron Fricker! Naval Postgraduate School! Monterey, California! 3/26/13 Reading:! Lohr chapter 11! 1 Goals for this Lecture! Linear regression! Review of linear
More informationStat 401B Exam 3 Fall 2016 (Corrected Version)
Stat 401B Exam 3 Fall 2016 (Corrected Version) I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied
More informationExercise 5.4 Solution
Exercise 5.4 Solution Niels Richard Hansen University of Copenhagen May 7, 2010 1 5.4(a) > leukemia
More informationLogistic Regressions. Stat 430
Logistic Regressions Stat 430 Final Project Final Project is, again, team based You will decide on a project - only constraint is: you are supposed to use techniques for a solution that are related to
More informationGeneralized Linear Models in R
Generalized Linear Models in R NO ORDER Kenneth K. Lopiano, Garvesh Raskutti, Dan Yang last modified 28 4 2013 1 Outline 1. Background and preliminaries 2. Data manipulation and exercises 3. Data structures
More informationGeneralized linear models
Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models
More informationGeneralized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model
Stat 3302 (Spring 2017) Peter F. Craigmile Simple linear logistic regression (part 1) [Dobson and Barnett, 2008, Sections 7.1 7.3] Generalized linear models for binary data Beetles dose-response example
More informationStatistisches Data Mining (StDM) Woche 6. Read and do the excersises of chapter 4.6.1, 4.6.2, and in ILSR
Statistisches Data Mining (StDM) HS 2017 Woche 6 Aufgabe 1 Lab Read and do the excersises of chapter 4.6.1, 4.6.2, and 4.6.5 in ILSR Aufgabe 2 Logistic Regression (based on an excerice in ISLR) In this
More informationExam Applied Statistical Regression. Good Luck!
Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.
More informationLeftovers. Morris. University Farm. University Farm. Morris. yield
Leftovers SI 544 Lada Adamic 1 Trellis graphics Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475
More informationLogistic Regression. 0.1 Frogs Dataset
Logistic Regression We move now to the classification problem from the regression problem and study the technique ot logistic regression. The setting for the classification problem is the same as that
More informationHow to deal with non-linear count data? Macro-invertebrates in wetlands
How to deal with non-linear count data? Macro-invertebrates in wetlands In this session we l recognize the advantages of making an effort to better identify the proper error distribution of data and choose
More informationGeneralised linear models. Response variable can take a number of different formats
Generalised linear models Response variable can take a number of different formats Structure Limitations of linear models and GLM theory GLM for count data GLM for presence \ absence data GLM for proportion
More informationCherry.R. > cherry d h v <portion omitted> > # Step 1.
Cherry.R ####################################################################### library(mass) library(car) cherry < read.table(file="n:\\courses\\stat8620\\fall 08\\trees.dat",header=T) cherry d h v 1
More informationRegression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.
Regression, Part I I. Difference from correlation. II. Basic idea: A) Correlation describes the relationship between two variables, where neither is independent or a predictor. - In correlation, it would
More informationConsider fitting a model using ordinary least squares (OLS) regression:
Example 1: Mating Success of African Elephants In this study, 41 male African elephants were followed over a period of 8 years. The age of the elephant at the beginning of the study and the number of successful
More informationTento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/
Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/28.0018 Statistical Analysis in Ecology using R Linear Models/GLM Ing. Daniel Volařík, Ph.D. 13.
More informationSTA 450/4000 S: January
STA 450/4000 S: January 6 005 Notes Friday tutorial on R programming reminder office hours on - F; -4 R The book Modern Applied Statistics with S by Venables and Ripley is very useful. Make sure you have
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationBoosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13
Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y
More information1 The Classic Bivariate Least Squares Model
Review of Bivariate Linear Regression Contents 1 The Classic Bivariate Least Squares Model 1 1.1 The Setup............................... 1 1.2 An Example Predicting Kids IQ................. 1 2 Evaluating
More informationCLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition
CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data
More informationUNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018
UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018 Work all problems. 60 points needed to pass at the Masters level, 75 to pass at the PhD
More informationSample solutions. Stat 8051 Homework 8
Sample solutions Stat 8051 Homework 8 Problem 1: Faraway Exercise 3.1 A plot of the time series reveals kind of a fluctuating pattern: Trying to fit poisson regression models yields a quadratic model if
More informationClassification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business
Classification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business 1 1. Classification 2. Logistic Regression, One Predictor 3. Inference:
More informationStat 4510/7510 Homework 7
Stat 4510/7510 Due: 1/10. Stat 4510/7510 Homework 7 1. Instructions: Please list your name and student number clearly. In order to receive credit for a problem, your solution must show sufficient details
More informationVarious Issues in Fitting Contingency Tables
Various Issues in Fitting Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Complete Tables with Zero Entries In contingency tables, it is possible to have zero entries in a
More informationThe GLM really is different than OLS, even with a Normally distributed dependent variable, when the link function g is not the identity.
GLM with a Gamma-distributed Dependent Variable. 1 Introduction I started out to write about why the Gamma distribution in a GLM is useful. I ve found it difficult to find an example which proves that
More informationAge 55 (x = 1) Age < 55 (x = 0)
Logistic Regression with a Single Dichotomous Predictor EXAMPLE: Consider the data in the file CHDcsv Instead of examining the relationship between the continuous variable age and the presence or absence
More informationSTATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours
Instructions: STATS216v Introduction to Statistical Learning Stanford University, Summer 2017 Remember the university honor code. Midterm Exam (Solutions) Duration: 1 hours Write your name and SUNet ID
More informationOnline Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?
Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification
More informationHoldout and Cross-Validation Methods Overfitting Avoidance
Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest
More informationDensity Temp vs Ratio. temp
Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,
More informationMATH 644: Regression Analysis Methods
MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100
More informationRecap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:
1 / 23 Recap HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis: Pr(G = k X) Pr(X G = k)pr(g = k) Theory: LDA more
More informationA Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn
A Handbook of Statistical Analyses Using R 2nd Edition Brian S. Everitt and Torsten Hothorn CHAPTER 7 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, Colonic
More informationR Hints for Chapter 10
R Hints for Chapter 10 The multiple logistic regression model assumes that the success probability p for a binomial random variable depends on independent variables or design variables x 1, x 2,, x k.
More informationRegression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.
Regression models Generalized linear models in R Dr Peter K Dunn http://www.usq.edu.au Department of Mathematics and Computing University of Southern Queensland ASC, July 00 The usual linear regression
More informationModule 4: Regression Methods: Concepts and Applications
Module 4: Regression Methods: Concepts and Applications Example Analysis Code Rebecca Hubbard, Mary Lou Thompson July 11-13, 2018 Install R Go to http://cran.rstudio.com/ (http://cran.rstudio.com/) Click
More informationAdministration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books
STA 44/04 Jan 6, 00 / 5 Administration Homework on web page, due Feb NSERC summer undergraduate award applications due Feb 5 Some helpful books STA 44/04 Jan 6, 00... administration / 5 STA 44/04 Jan 6,
More informationStatistical Methods III Statistics 212. Problem Set 2 - Answer Key
Statistical Methods III Statistics 212 Problem Set 2 - Answer Key 1. (Analysis to be turned in and discussed on Tuesday, April 24th) The data for this problem are taken from long-term followup of 1423
More informationDecision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014
Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationHW1 Roshena MacPherson Feb 1, 2017
HW1 Roshena MacPherson Feb 1, 2017 This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. Question 1: In this question we will consider some real
More informationStatistics 203 Introduction to Regression Models and ANOVA Practice Exam
Statistics 203 Introduction to Regression Models and ANOVA Practice Exam Prof. J. Taylor You may use your 4 single-sided pages of notes This exam is 7 pages long. There are 4 questions, first 3 worth 10
More informationArticle from. Predictive Analytics and Futurism. July 2016 Issue 13
Article from Predictive Analytics and Futurism July 2016 Issue 13 Regression and Classification: A Deeper Look By Jeff Heaton Classification and regression are the two most common forms of models fitted
More informationWeek 7 Multiple factors. Ch , Some miscellaneous parts
Week 7 Multiple factors Ch. 18-19, Some miscellaneous parts Multiple Factors Most experiments will involve multiple factors, some of which will be nuisance variables Dealing with these factors requires
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationLogistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University
Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Logistic Regression 1 / 38 Logistic Regression 1 Introduction
More informationSTAT 510 Final Exam Spring 2015
STAT 510 Final Exam Spring 2015 Instructions: The is a closed-notes, closed-book exam No calculator or electronic device of any kind may be used Use nothing but a pen or pencil Please write your name and
More informationGeneralized Additive Models
Generalized Additive Models The Model The GLM is: g( µ) = ß 0 + ß 1 x 1 + ß 2 x 2 +... + ß k x k The generalization to the GAM is: g(µ) = ß 0 + f 1 (x 1 ) + f 2 (x 2 ) +... + f k (x k ) where the functions
More informationA Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46
A Generalized Linear Model for Binomial Response Data Copyright c 2017 Dan Nettleton (Iowa State University) Statistics 510 1 / 46 Now suppose that instead of a Bernoulli response, we have a binomial response
More informationSTAT 3022 Spring 2007
Simple Linear Regression Example These commands reproduce what we did in class. You should enter these in R and see what they do. Start by typing > set.seed(42) to reset the random number generator so
More informationHands on cusp package tutorial
Hands on cusp package tutorial Raoul P. P. P. Grasman July 29, 2015 1 Introduction The cusp package provides routines for fitting a cusp catastrophe model as suggested by (Cobb, 1978). The full documentation
More informationLogistic & Tobit Regression
Logistic & Tobit Regression Different Types of Regression Binary Regression (D) Logistic transformation + e P( y x) = 1 + e! " x! + " x " P( y x) % ln$ ' = ( + ) x # 1! P( y x) & logit of P(y x){ P(y
More informationInteractions in Logistic Regression
Interactions in Logistic Regression > # UCBAdmissions is a 3-D table: Gender by Dept by Admit > # Same data in another format: > # One col for Yes counts, another for No counts. > Berkeley = read.table("http://www.utstat.toronto.edu/~brunner/312f12/
More informationSTA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).
STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population
More information1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression
Logistic Regression 1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression 5. Target Marketing: Tabloid Data
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationDecision Trees (Cont.)
Decision Trees (Cont.) R&N Chapter 18.2,18.3 Side example with discrete (categorical) attributes: Predicting age (3 values: less than 30, 30-45, more than 45 yrs old) from census data. Attributes (split
More informationEnsemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12
Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose
More informationIntroduction to the Generalized Linear Model: Logistic regression and Poisson regression
Introduction to the Generalized Linear Model: Logistic regression and Poisson regression Statistical modelling: Theory and practice Gilles Guillot gigu@dtu.dk November 4, 2013 Gilles Guillot (gigu@dtu.dk)
More informationLecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015
Lecture 8: Fitting Data Statistical Computing, 36-350 Wednesday October 7, 2015 In previous episodes Loading and saving data sets in R format Loading and saving data sets in other structured formats Intro
More informationFinal Exam. Name: Solution:
Final Exam. Name: Instructions. Answer all questions on the exam. Open books, open notes, but no electronic devices. The first 13 problems are worth 5 points each. The rest are worth 1 point each. HW1.
More informationSCHOOL OF MATHEMATICS AND STATISTICS
SHOOL OF MATHEMATIS AND STATISTIS Linear Models Autumn Semester 2015 16 2 hours Marks will be awarded for your best three answers. RESTRITED OPEN BOOK EXAMINATION andidates may bring to the examination
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationSTAT 350: Summer Semester Midterm 1: Solutions
Name: Student Number: STAT 350: Summer Semester 2008 Midterm 1: Solutions 9 June 2008 Instructor: Richard Lockhart Instructions: This is an open book test. You may use notes, text, other books and a calculator.
More informationUNIVERSITY OF TORONTO Faculty of Arts and Science
UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models Generalized Linear Models - part III Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs.
More informationSTA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).
STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationSupport Vector Machines
Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two
More informationMachine Learning and Data Mining. Linear classification. Kalev Kask
Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q
More informationSTAT 526 Spring Midterm 1. Wednesday February 2, 2011
STAT 526 Spring 2011 Midterm 1 Wednesday February 2, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More information