17. Introduction to Tree and Neural Network Regression

Size: px
Start display at page:

Download "17. Introduction to Tree and Neural Network Regression"

Transcription

1 17. Introduction to Tree and Neural Network Regression As we have repeated often throughout this book, the classical multiple regression model is clearly wrong in many ways. One way that the model is wrong is in the assumption of a normal distribution. Instead, the distribution of Y X = x is correctly given as p(y x), where p(y x) can be any distribution (recall that x = (x1, x2,, xk)). Another way that the classical model is wrong is in the assumption that the conditional mean function is given by E(Y X1 = x1, X2 = x2,, Xk = xk) = 0 + 1x1 + 2x2 + + kxk. Instead, the conditional mean function (assuming the mean is finite) is correctly given as E(Y X1 = x1, X2 = x2,, Xk = xk) = f (x1, x2,, xk), where f (x1, x2,, xk) can have any shape whatsoever in (k + 1) dimensional space, from curved hyperplanes, to functions with bumps, to discontinuous functions, depending upon the Nature of the process that you are studying. As we noted in Chapter 1, Section 1.7, even the simple linear model f (x1) = 0 + 1x1 is usually wrong: We gave a logical argument that there must be some curvature as long as the X1 variable has at least three levels, and as long as there is dependence between Y and X1. This argument becomes more relevant in multidimensional space, where the curvature can take so many forms, including twists, turns and bumps. The goal of neural network and tree regression is to estimate f (x1, x2,, xk) as a general, nonlinear (or non-planar) function. In particular, these methods allow, and can easily find, strong interactions between various combinations of X variables, which can be very tricky to tease out using the classical regression model with interaction terms. By allowing greater generality and flexibility than the restrictive classical (planar) model f (x1, x2,, xk) = 0 + 1x1 + 2x2 + + kxk, these models can, in some cases, also give you predicted values Y ˆ that tend to be closer to future Y values than classical linear models. Since these methods attempt to find unusual model features (twists, turns, and bumps) of the function f (x1, x2,, xk) in high dimensional space, they typically require large amounts of data to be able to estimate such features. With the data revolution, such large data sets are routinely available. Thus, it has become nearly a default position in data science to analyze large data sets using tree and neural network regressions, rather than the classical regression model Tree Regression Not only does tree regression allow complex functional relationships, it also provides output that is very easy to interpret even easier than the classical linear model. Further, the output of tree regression models can provide a clear recipe for action; e.g., in the determination of profiles (particular combinations of X values) that are associated with fraudulent activity. To start, let us show an example of what tree regression can do for you. We will use a data set from a survey of n = 1,020 individuals who were asked how they pronounce data, either as day-tuh or daa-tuh. The raw data show 653/1020 = 64.0% day-tuh and 367/1020 = 36.0%

2 daa-tuh. As in all regression, we would like to know how this distribution (64.0% day-tuh, 36.0% daa-tuh ) changes for particular values of the X variables. In addition to the day-tuh / daa-tuh pronunciation, the survey data also contain demographic information such as gender, age, region of the U.S., and other variables. You can use the rpart function to construct tree regression estimates; rpart is short for recursive partitioning, which describes the algorithm that is used. Code to fit a tree model is as follows: pron = read.csv(" attach(pron) Y = ifelse(q4 ==1, 1,0) library(rpart) ft = rpart(y ~ Q1 +Q2+Q3 + Q5 + Q6 +Q7 +Q8 + Q9 +Q11 + Q12 +Q13 + Age + Gender) plot(ft) text(ft) The results are shown in Figure Figure Results of tree regression to predict pronunciation of data. Numbers below nodes are proportions of day-tuh pronunciation.

3 To interpret Figure , note the following. There is an automatic variable selection algorithm inside the software, and it picked only the variables Q7, Q2, and Age as predictors of pronunciation. The variable Q7 is colleague pronunciation with Q7 = 1 indicating colleague pronounces day-tuh, and Q7 = 2 indicating colleague pronounces daa-tuh. The first split is on Q The left side of the split corresponds to truth of Q7 1.5, which means that Q7 = 2, or colleague pronounces daa-tuh. The right side corresponds to falseness of Q7 1.5, which means that Q7 = 1, or colleague pronounces day-tuh. Below the right split, there is just one entry, , which is the mean of Y for this condition. You can verify this as mean(y[q7 < 1.5]), which gives you Since the means of 0/1 data are proportions, this result means that 86.86% of the respondents who said their colleagues say day-tuh also say day-tuh. Down the left hand split, which initially only considers people whose colleagues say daa-tuh, the next split involves Q2, which is region, and takes the values 1,2,3,4 for West, South, Midwest, and Northeast regions, respectively. The left split again indicates truth of Q2 < 3.5, meaning that the left split contains data from regions 1, 2 and 3. The entry 0.2 means that 20% of the subset determined by (i) colleague says daa-tuh and (ii) person lives in regions 1,2 or 3, pronounce day-tuh. In other words, in this subset, 80% pronounce data like their colleagues do. To verify this result manually, enter the command mean(y[(q7 >= 1.5) & (Q2 < 3.5)]). The right hand split on Q2 < 3.5 indicates that the condition is false, so that Q With these data, the only possibility here is that Q2 = 4, i.e., people live in the northeast region. Among these people (whose colleagues say daa-tuh because of the higher-level split), younger people are likely to ignore their colleagues (57.78% say day-tuh ), but older people are likely to mimic their colleagues (18.42% say day-tuh. ) This is an example of the kind of high-level interaction that tree regression can tease out: The effect of colleague s pronunciation of data depends on both region and age, a three-way interaction. (Self-study problem: Find the percentages 57.78% and 18.42% by hand. ) As you can see, tree regression is very useful for identifying pockets based on the X variables where the Y variable differs greatly from pocket to pocket. If the Y variable was an indication of fraudulent activity, the pockets of observations where Y is high, as shown by the tree output, would identify specific observations to investigate further. To understand what the tree regression model is doing in more detail, consider the charity data set again, and consider the prediction of Y = charitable contributions (in log scale), as a function of a single X variable, Income (again in log scale). The following R code shows the default analysis from rpart, as well as a simplified analysis that will facilitate explanation. ## Initial analysis. char = read.csv(" attach(char) library(rpart)

4 fit1 = rpart(charity ~ INCOME) plot(fit1); text(fit1, cex=.7) This code produces the following tree: Figure Tree regression using rpart to predict log charitable contributions in terms of log income (denoted INCOME in the figure). Figure tells you that, as you might expect, there tends to be higher charitable contributions among people with higher income. But rather than express that relationship as a linear function, as in the classical regression model, the tree regression expresses that relationship in terms of ranges of values of INCOME where contributions are either lower or higher. To understand the method further, consider the following code, which provides a simplified analysis. ## Simplified analysis. char = read.csv(" attach(char) library(rpart) fit2 = rpart(charity ~ INCOME, maxdepth=1) plot(fit2); text(fit2, cex=.7)

5 Figure Tree regression using rpart with maxdepth=1 to predict log charitable contributions in terms of log income (denoted INCOME in the figure). In Figure , notice that we are just looking at the first split from Figure The split is the same, at INCOME < Notes on these two analyses are as follows: The split value, 10.91, is chosen to maximize separation between the Y variable (here, log charitable contributions), between the two groups. Different measures of separation are possible, such as the F statistic or the log likelihood statistic. The default measure of separation used by rpart is somewhat unusual (the Gini coefficient). The method is called recursive partitioning because the further splits shown in Figure use the same splitting method as with the first split, except that the method is used on the subsets of data. For example, after the split on INCOME < 10.91, the subset of data where INCOME is < is subject to the same splitting algorithm, finding the next optimal split as INCOME <10.32 as shown in Figure The right hand side of that split indicates INCOME 10.32, but since this split is down the path where INCOME < 10.91, the actual range of INCOME in this group is INCOME < The predicted values ( Y ˆ ) from the model are constant over the intervals as shown in the following code and Figure.

6 R Code for Figure Income.plot = seq(8.5, 12,.001) Income.pred = data.frame(income.plot) names(income.pred) = c("income") Yhat1 = predict(fit1, Income.pred) Yhat2 = predict(fit2, Income.pred) par(mfrow=c(1,2)) plot(income, CHARITY, pch = ".", cex=1.5) points(income.plot, Yhat1, type="l", lwd=2) abline(lsfit(income, CHARITY), lwd=1.5, lty=2) plot(income, CHARITY, pch=".", cex=1.5) points(income.plot, Yhat2, type="l", lwd=2) abline(lsfit(income, CHARITY), lwd=1.5, lty=2) Figure Scatterplots of (ln(income), ln(charitable Contributions)), with predictions of mean charitable contributions using tree regression (solid lines) and ordinary least squares (dashed lines). Left panel: Default tree from rpart. Right panel: Tree pruned to depth = 1. As Figure shows, the tree regression estimates of the conditional mean function are flat line segments over interval ranges. Clearly, these functions are not very good as estimates of the true mean function, because Nature favors continuity over discontinuity. On the other hand, the results are very simple to use and interpret. In multiple regression, the tree functions are not flat lines, but instead flat planes or hyperplanes, which are defined over various rectangular regions defined by combinations of the X variables. Picture a city building that has been built in pieces, with different parts added on over time, all with different heights. That s what the tree regression function looks like in multiple regression.

7 The most interesting and useful applications of tree regression occur when there are many X variables, and where the nature of the relationship between Y and the X variables is highly nonlinear, involving high level interactions. With simpler, well behaved examples, such as the regression of CHARITY on INCOME as shown above, the simple linear regression is obviously much better as shown in Figure Neural Network Regression Neural network regression has a similar goal as tree regression; namely, to estimate conditional mean functions that are highly nonlinear, perhaps also involving high-level interactions. An advantage of neural networks over trees is that the functions are estimated as continuous functions rather than discontinuous flat line (or flat plane) segments. Thus, neural network estimates are generally more realistic than tree estimates, because, again, Nature generally favors continuity over discontinuity. A disadvantage of the neural network estimates is that the results are not simple to interpret. Trees are the easiest to interpret, classical regression next easiest, and neural nets are hardest. Thus, instead of trying to interpret the results, users of neural nets usually just treat them as a black box to produce a predicted mean, ˆ Y = ˆf (x1, x2,, xk). Like LOESS, the neural net regression function ˆf (x1, x2,, xk) has a more complicated form than other regression functions, and researchers therefore do not ordinary examine its function form. There are many practical applications where you really do not care about the function form of ˆ Y ; all you care about is getting a ˆ Y that is close to Y. Examples include: The predicted range of your car (in miles or kilometers), which predicts how many miles (or km) your car can travel, given current fuel in the tank, and current (and recent) fuel economy. The predicted EEG reading of a heart patient, given a current (more simply obtained) reading of pulse, sweat, and blood pressure. The credit worthiness of a customer who is applying for a loan (think of your credit score ). These examples are all cases where you want a good prediction of Y, but you don t necessarily care how that prediction was obtained. In such cases, neural network regression can be useful. Universal Approximators First, a word about function approximation. A main goal in regression is to approximate the conditional mean function E(Y X1 = x1, X2 = x2,, Xk = xk) = f (x1, x2,, xk). As noted repeatedly throughout this book, linear, quadratic, interaction, exponential, and other functions are nearly always different from the true f (x1, x2,, xk). But some types of functions can be made arbitrarily close to f (x1, x2,, xk); such functions are called universal approximators.

8 Polynomial functions are universal approximators. That is, given any function f (x1, x2,, xk) defined over a bounded X set, you can approximate f (x1, x2,, xk) arbitrarily well by a polynomial function g (x1, x2,, xk). You might need an extremely high order of polynomial (recall that such functions involve all possible interaction terms up to the given order), but still, there exists a polynomial g(.) such that the maximum difference between f (.) and g (.) is as small as you would like. This result is mathematically proven as Stone-Weierstrauss Theorem. While the Stone-Weierstrauss Theorem seems to suggest that you should use high order polynomial models to estimate regression functions, you know from the variance-bias trade-off that, while such higher order models will be less biased (because of the Stone-Weierstrauss Theorem), such models also require a huge number of estimated parameters. Hence, high order polynomial models will suffer from extremely high variance when estimated using data. In addition, polynomial models can have extremely poor behavior at the extremes of the data, or where data are sparse. At the extremes, the predictions shoot off quickly to positive or negative infinity, because that s how high-order polynomials terms like x 3, x 4, x 5 etc. behave. Where data ae sparse, the behavior can also be erratic: See Figure , for example. There are universal approximators other than polynomials that do not have such wild behavior. Fourier analysis gives you functions g (.) involving sines and cosines that can approximate general functions f (.); such Fourier functions g (.) are also universal approximators, but do not have the wild extrapolation properties of polynomials. Being based on sines and cosines, Fourier functions g (.) are best 1 for approximating functions f (.) that are cyclical; but like polynomials, they can also approximate any function arbitrarily well. There are infinitely many universal approximators, depending of the particular set of basis functions that you use. In polynomial approximators, the basis functions are polynomials; in Fourier analysis, the basis functions involve sines and cosines. Pick an appropriate set of basis functions, and voilá! You can construct a universal approximator. Neural network is just a fancy-sounding name (meant to suggest that it somehow operates like the brain!) for a universal approximator that involves a particular type of basis function, called an activation function in the neural network jargon. The standard activation function is the same one that you use in logistic regression; namely 1/(1 + e x ). Like polynomial functions are constructed of functions involving various orders of polynomial terms, their interactions, and constant multipliers, neural network functions are constructed of logistic functions, their interactions, and constant multipliers. Example: Predicting Charitable Contributions Using a Neural Network Let s see how the neural network works in a real example first, and then we ll explain what it is doing. 1 Best means that they require fewer terms to arrive at a good approximation. Statistically, such parsimony is preferred because there will be fewer constant multipliers to estimate.

9 R Code for Figure char = read.csv(" attach(char) f = as.formula(paste("charity ~ INCOME")) dat = data.frame(charity, INCOME) library(neuralnet) fit.nn = neuralnet(f, data=dat, hidden=c(1,1)) Income.plot = seq(8.5, 12,.001) Income.pred = data.frame(income.plot) names(income.pred) = c("income") Yhat.nn = compute(fit.nn, Income.plot)$net.result plot(income, CHARITY, pch = ".", cex=1.5) points(income.plot, Yhat.nn, type="l", lwd=2) abline(lsfit(income, CHARITY), lwd=1.5, lty=2) Figure Scatterplot of (ln(income), ln(charitable Contributions)) data, with neural network (solid) and ordinary least squares (dashed) fits. Figure shows that the neural network fit is akin to LOESS in that it allows curved relationships. The flattened appearance on the left is due to the use of the logistic basis functions. (Still need to explain what it is doing).

Final Exam. Name: Solution:

Final Exam. Name: Solution: Final Exam. Name: Instructions. Answer all questions on the exam. Open books, open notes, but no electronic devices. The first 13 problems are worth 5 points each. The rest are worth 1 point each. HW1.

More information

ISQS 5349 Spring 2013 Final Exam

ISQS 5349 Spring 2013 Final Exam ISQS 5349 Spring 2013 Final Exam Name: General Instructions: Closed books, notes, no electronic devices. Points (out of 200) are in parentheses. Put written answers on separate paper; multiple choices

More information

Function Approximation

Function Approximation 1 Function Approximation This is page i Printer: Opaque this 1.1 Introduction In this chapter we discuss approximating functional forms. Both in econometric and in numerical problems, the need for an approximating

More information

Chapter 1 Review of Equations and Inequalities

Chapter 1 Review of Equations and Inequalities Chapter 1 Review of Equations and Inequalities Part I Review of Basic Equations Recall that an equation is an expression with an equal sign in the middle. Also recall that, if a question asks you to solve

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machines and Kernel Methods Geoff Gordon ggordon@cs.cmu.edu July 10, 2003 Overview Why do people care about SVMs? Classification problems SVMs often produce good results over a wide range

More information

CLASS NOTES: BUSINESS CALCULUS

CLASS NOTES: BUSINESS CALCULUS CLASS NOTES: BUSINESS CALCULUS These notes can be thought of as the logical skeleton of my lectures, although they will generally contain a fuller exposition of concepts but fewer examples than my lectures.

More information

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds Chapter 6 Logistic Regression In logistic regression, there is a categorical response variables, often coded 1=Yes and 0=No. Many important phenomena fit this framework. The patient survives the operation,

More information

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

More information

Stat 502X Exam 2 Spring 2014

Stat 502X Exam 2 Spring 2014 Stat 502X Exam 2 Spring 2014 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed This exam consists of 12 parts. I'll score it at 10 points per problem/part

More information

Midterm Exam, Spring 2005

Midterm Exam, Spring 2005 10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee Module - 04 Lecture - 05 Sales Forecasting - II A very warm welcome

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Chapter 9 Regression with a Binary Dependent Variable. Multiple Choice. 1) The binary dependent variable model is an example of a

Chapter 9 Regression with a Binary Dependent Variable. Multiple Choice. 1) The binary dependent variable model is an example of a Chapter 9 Regression with a Binary Dependent Variable Multiple Choice ) The binary dependent variable model is an example of a a. regression model, which has as a regressor, among others, a binary variable.

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

2. If the values for f(x) can be made as close as we like to L by choosing arbitrarily large. lim

2. If the values for f(x) can be made as close as we like to L by choosing arbitrarily large. lim Limits at Infinity and Horizontal Asymptotes As we prepare to practice graphing functions, we should consider one last piece of information about a function that will be helpful in drawing its graph the

More information

CONTENTS OF DAY 2. II. Why Random Sampling is Important 10 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

CONTENTS OF DAY 2. II. Why Random Sampling is Important 10 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE 1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 4 Problems with small populations 9 II. Why Random Sampling is Important 10 A myth,

More information

1 What is the area model for multiplication?

1 What is the area model for multiplication? for multiplication represents a lovely way to view the distribution property the real number exhibit. This property is the link between addition and multiplication. 1 1 What is the area model for multiplication?

More information

Mathematics for Intelligent Systems Lecture 5 Homework Solutions

Mathematics for Intelligent Systems Lecture 5 Homework Solutions Mathematics for Intelligent Systems Lecture 5 Homework Solutions Advanced Calculus I: Derivatives and local geometry) Nathan Ratliff Nov 25, 204 Problem : Gradient and Hessian Calculations We ve seen that

More information

Definition: A "system" of equations is a set or collection of equations that you deal with all together at once.

Definition: A system of equations is a set or collection of equations that you deal with all together at once. System of Equations Definition: A "system" of equations is a set or collection of equations that you deal with all together at once. There is both an x and y value that needs to be solved for Systems

More information

Continuum Limit and Fourier Series

Continuum Limit and Fourier Series Chapter 6 Continuum Limit and Fourier Series Continuous is in the eye of the beholder Most systems that we think of as continuous are actually made up of discrete pieces In this chapter, we show that a

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Algebra Exam. Solutions and Grading Guide

Algebra Exam. Solutions and Grading Guide Algebra Exam Solutions and Grading Guide You should use this grading guide to carefully grade your own exam, trying to be as objective as possible about what score the TAs would give your responses. Full

More information

Alex s Guide to Word Problems and Linear Equations Following Glencoe Algebra 1

Alex s Guide to Word Problems and Linear Equations Following Glencoe Algebra 1 Alex s Guide to Word Problems and Linear Equations Following Glencoe Algebra 1 What is a linear equation? It sounds fancy, but linear equation means the same thing as a line. In other words, it s an equation

More information

REVIEW FOR TEST I OF CALCULUS I:

REVIEW FOR TEST I OF CALCULUS I: REVIEW FOR TEST I OF CALCULUS I: The first and best line of defense is to complete and understand the homework and lecture examples. Past that my old test might help you get some idea of how my tests typically

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices. Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices. 1. What is the difference between a deterministic model and a probabilistic model? (Two or three sentences only). 2. What is the

More information

Vocabulary: Samples and Populations

Vocabulary: Samples and Populations Vocabulary: Samples and Populations Concept Different types of data Categorical data results when the question asked in a survey or sample can be answered with a nonnumerical answer. For example if we

More information

Section 7.4: Inverse Laplace Transform

Section 7.4: Inverse Laplace Transform Section 74: Inverse Laplace Transform A natural question to ask about any function is whether it has an inverse function We now ask this question about the Laplace transform: given a function F (s), will

More information

Tree-based methods. Patrick Breheny. December 4. Recursive partitioning Bias-variance tradeoff Example Further remarks

Tree-based methods. Patrick Breheny. December 4. Recursive partitioning Bias-variance tradeoff Example Further remarks Tree-based methods Patrick Breheny December 4 Patrick Breheny STA 621: Nonparametric Statistics 1/36 Introduction Trees Algorithm We ve seen that local methods and splines both operate locally either by

More information

Business Statistics. Lecture 9: Simple Regression

Business Statistics. Lecture 9: Simple Regression Business Statistics Lecture 9: Simple Regression 1 On to Model Building! Up to now, class was about descriptive and inferential statistics Numerical and graphical summaries of data Confidence intervals

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Final Exam - Solutions

Final Exam - Solutions Ecn 102 - Analysis of Economic Data University of California - Davis March 19, 2010 Instructor: John Parman Final Exam - Solutions You have until 5:30pm to complete this exam. Please remember to put your

More information

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7 Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector

More information

Chapter 9. Polynomial Models and Interaction (Moderator) Analysis

Chapter 9. Polynomial Models and Interaction (Moderator) Analysis Chapter 9. Polynomial Models and Interaction (Moderator) Analysis In Chapter 4, we introduced the quadratic model as a device to test for curvature in the conditional mean function. You could also use

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable Chapter 08: Linear Regression There are lots of ways to model the relationships between variables. It is important that you not think that what we do is the way. There are many paths to the summit We are

More information

2. FUNCTIONS AND ALGEBRA

2. FUNCTIONS AND ALGEBRA 2. FUNCTIONS AND ALGEBRA You might think of this chapter as an icebreaker. Functions are the primary participants in the game of calculus, so before we play the game we ought to get to know a few functions.

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

15. NUMBERS HAVE LOTS OF DIFFERENT NAMES!

15. NUMBERS HAVE LOTS OF DIFFERENT NAMES! get the complete book: http://wwwonemathematicalcatorg/getfulltetfullbookhtm 5 NUMBERS HAVE LOTS OF DIFFERENT NAMES! a fun type of game with numbers one such game playing the game: 3 pets There are lots

More information

Chapter 10 Nonlinear Models

Chapter 10 Nonlinear Models Chapter 10 Nonlinear Models Nonlinear models can be classified into two categories. In the first category are models that are nonlinear in the variables, but still linear in terms of the unknown parameters.

More information

3 rd class Mech. Eng. Dept. hamdiahmed.weebly.com Fourier Series

3 rd class Mech. Eng. Dept. hamdiahmed.weebly.com Fourier Series Definition 1 Fourier Series A function f is said to be piecewise continuous on [a, b] if there exists finitely many points a = x 1 < x 2

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Chapter 1. Gaining Knowledge with Design of Experiments

Chapter 1. Gaining Knowledge with Design of Experiments Chapter 1 Gaining Knowledge with Design of Experiments 1.1 Introduction 2 1.2 The Process of Knowledge Acquisition 2 1.2.1 Choosing the Experimental Method 5 1.2.2 Analyzing the Results 5 1.2.3 Progressively

More information

Chapter 13 - Inverse Functions

Chapter 13 - Inverse Functions Chapter 13 - Inverse Functions In the second part of this book on Calculus, we shall be devoting our study to another type of function, the exponential function and its close relative the Sine function.

More information

y response variable x 1, x 2,, x k -- a set of explanatory variables

y response variable x 1, x 2,, x k -- a set of explanatory variables 11. Multiple Regression and Correlation y response variable x 1, x 2,, x k -- a set of explanatory variables In this chapter, all variables are assumed to be quantitative. Chapters 12-14 show how to incorporate

More information

NP-Completeness Review

NP-Completeness Review CS124 NP-Completeness Review Where We Are Headed Up to this point, we have generally assumed that if we were given a problem, we could find a way to solve it. Unfortunately, as most of you know, there

More information

In economics, the amount of a good x demanded is a function of the price of that good. In other words,

In economics, the amount of a good x demanded is a function of the price of that good. In other words, I. UNIVARIATE CALCULUS Given two sets X and Y, a function is a rule that associates each member of X with eactly one member of Y. That is, some goes in, and some y comes out. These notations are used to

More information

Sampling Distribution Models. Chapter 17

Sampling Distribution Models. Chapter 17 Sampling Distribution Models Chapter 17 Objectives: 1. Sampling Distribution Model 2. Sampling Variability (sampling error) 3. Sampling Distribution Model for a Proportion 4. Central Limit Theorem 5. Sampling

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

MA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities

MA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities MA 1128: Lecture 08 03/02/2018 Linear Equations from Graphs And Linear Inequalities Linear Equations from Graphs Given a line, we would like to be able to come up with an equation for it. I ll go over

More information

High-Dimensional Statistical Learning: Introduction

High-Dimensional Statistical Learning: Introduction Classical Statistics Biological Big Data Supervised and Unsupervised Learning High-Dimensional Statistical Learning: Introduction Ali Shojaie University of Washington http://faculty.washington.edu/ashojaie/

More information

Chapter 11. Regression with a Binary Dependent Variable

Chapter 11. Regression with a Binary Dependent Variable Chapter 11 Regression with a Binary Dependent Variable 2 Regression with a Binary Dependent Variable (SW Chapter 11) So far the dependent variable (Y) has been continuous: district-wide average test score

More information

Chapter 6: The Definite Integral

Chapter 6: The Definite Integral Name: Date: Period: AP Calc AB Mr. Mellina Chapter 6: The Definite Integral v v Sections: v 6.1 Estimating with Finite Sums v 6.5 Trapezoidal Rule v 6.2 Definite Integrals 6.3 Definite Integrals and Antiderivatives

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

of 8 28/11/ :25

of 8 28/11/ :25 Paul's Online Math Notes Home Content Chapter/Section Downloads Misc Links Site Help Contact Me Differential Equations (Notes) / First Order DE`s / Modeling with First Order DE's [Notes] Differential Equations

More information

Proof Techniques (Review of Math 271)

Proof Techniques (Review of Math 271) Chapter 2 Proof Techniques (Review of Math 271) 2.1 Overview This chapter reviews proof techniques that were probably introduced in Math 271 and that may also have been used in a different way in Phil

More information

2 Systems of Linear Equations

2 Systems of Linear Equations 2 Systems of Linear Equations A system of equations of the form or is called a system of linear equations. x + 2y = 7 2x y = 4 5p 6q + r = 4 2p + 3q 5r = 7 6p q + 4r = 2 Definition. An equation involving

More information

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21 Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

University of California Berkeley CS170: Efficient Algorithms and Intractable Problems November 19, 2001 Professor Luca Trevisan. Midterm 2 Solutions

University of California Berkeley CS170: Efficient Algorithms and Intractable Problems November 19, 2001 Professor Luca Trevisan. Midterm 2 Solutions University of California Berkeley Handout MS2 CS170: Efficient Algorithms and Intractable Problems November 19, 2001 Professor Luca Trevisan Midterm 2 Solutions Problem 1. Provide the following information:

More information

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21 Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we

More information

NP-Completeness I. Lecture Overview Introduction: Reduction and Expressiveness

NP-Completeness I. Lecture Overview Introduction: Reduction and Expressiveness Lecture 19 NP-Completeness I 19.1 Overview In the past few lectures we have looked at increasingly more expressive problems that we were able to solve using efficient algorithms. In this lecture we introduce

More information

Tutorial on obtaining Taylor Series Approximations without differentiation

Tutorial on obtaining Taylor Series Approximations without differentiation Tutorial on obtaining Taylor Series Approximations without differentiation Professor Henry Greenside February 2, 2018 1 Overview An important mathematical technique that is used many times in physics,

More information

Lecture 2: Linear regression

Lecture 2: Linear regression Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk 94 Jonathan Richard Shewchuk 7 Neural Networks NEURAL NETWORKS Can do both classification & regression. [They tie together several ideas from the course: perceptrons, logistic regression, ensembles of

More information

35 Chapter CHAPTER 4: Mathematical Proof

35 Chapter CHAPTER 4: Mathematical Proof 35 Chapter 4 35 CHAPTER 4: Mathematical Proof Faith is different from proof; the one is human, the other is a gift of God. Justus ex fide vivit. It is this faith that God Himself puts into the heart. 21

More information

Regression Analysis IV... More MLR and Model Building

Regression Analysis IV... More MLR and Model Building Regression Analysis IV... More MLR and Model Building This session finishes up presenting the formal methods of inference based on the MLR model and then begins discussion of "model building" (use of regression

More information

AP Calculus Chapter 9: Infinite Series

AP Calculus Chapter 9: Infinite Series AP Calculus Chapter 9: Infinite Series 9. Sequences a, a 2, a 3, a 4, a 5,... Sequence: A function whose domain is the set of positive integers n = 2 3 4 a n = a a 2 a 3 a 4 terms of the sequence Begin

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

1 ** The performance objectives highlighted in italics have been identified as core to an Algebra II course.

1 ** The performance objectives highlighted in italics have been identified as core to an Algebra II course. Strand One: Number Sense and Operations Every student should understand and use all concepts and skills from the pervious grade levels. The standards are designed so that new learning builds on preceding

More information

Chris Piech CS109 CS109 Final Exam. Fall Quarter Dec 14 th, 2017

Chris Piech CS109 CS109 Final Exam. Fall Quarter Dec 14 th, 2017 Chris Piech CS109 CS109 Final Exam Fall Quarter Dec 14 th, 2017 This is a closed calculator/computer exam. You are, however, allowed to use notes in the exam. The last page of the exam is a Standard Normal

More information

19. TAYLOR SERIES AND TECHNIQUES

19. TAYLOR SERIES AND TECHNIQUES 19. TAYLOR SERIES AND TECHNIQUES Taylor polynomials can be generated for a given function through a certain linear combination of its derivatives. The idea is that we can approximate a function by a polynomial,

More information

8. Classification and Regression Trees (CART, MRT & RF)

8. Classification and Regression Trees (CART, MRT & RF) 8. Classification and Regression Trees (CART, MRT & RF) Classification And Regression Tree analysis (CART) and its extension to multiple simultaneous response variables, Multivariate Regression Tree analysis

More information

Finding the Gold in Your Data

Finding the Gold in Your Data Finding the Gold in Your Data An introduction to Data Mining Originally presented @ SAS Global Forum David A. Dickey North Carolina State University Decision Trees A divisive method (splits) Start with

More information

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Lines and Their Equations

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Lines and Their Equations ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER 1 017/018 DR. ANTHONY BROWN. Lines and Their Equations.1. Slope of a Line and its y-intercept. In Euclidean geometry (where

More information

At the start of the term, we saw the following formula for computing the sum of the first n integers:

At the start of the term, we saw the following formula for computing the sum of the first n integers: Chapter 11 Induction This chapter covers mathematical induction. 11.1 Introduction to induction At the start of the term, we saw the following formula for computing the sum of the first n integers: Claim

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

8/04/2011. last lecture: correlation and regression next lecture: standard MR & hierarchical MR (MR = multiple regression)

8/04/2011. last lecture: correlation and regression next lecture: standard MR & hierarchical MR (MR = multiple regression) psyc3010 lecture 7 analysis of covariance (ANCOVA) last lecture: correlation and regression next lecture: standard MR & hierarchical MR (MR = multiple regression) 1 announcements quiz 2 correlation and

More information

Solutions to Math 41 First Exam October 12, 2010

Solutions to Math 41 First Exam October 12, 2010 Solutions to Math 41 First Eam October 12, 2010 1. 13 points) Find each of the following its, with justification. If the it does not eist, eplain why. If there is an infinite it, then eplain whether it

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 06 - Regression & Decision Trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom

More information

Physics 509: Propagating Systematic Uncertainties. Scott Oser Lecture #12

Physics 509: Propagating Systematic Uncertainties. Scott Oser Lecture #12 Physics 509: Propagating Systematic Uncertainties Scott Oser Lecture #1 1 Additive offset model Suppose we take N measurements from a distribution, and wish to estimate the true mean of the underlying

More information

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Generalization to Multi-Class and Continuous Responses. STA Data Mining I Generalization to Multi-Class and Continuous Responses STA 5703 - Data Mining I 1. Categorical Responses (a) Splitting Criterion Outline Goodness-of-split Criterion Chi-square Tests and Twoing Rule (b)

More information

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity

Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity P. Richard Hahn, Jared Murray, and Carlos Carvalho June 22, 2017 The problem setting We want to estimate

More information

9. Linear Regression and Correlation

9. Linear Regression and Correlation 9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,

More information

M15/5/MATME/SP2/ENG/TZ2/XX/M MARKSCHEME. May 2015 MATHEMATICS. Standard level. Paper pages

M15/5/MATME/SP2/ENG/TZ2/XX/M MARKSCHEME. May 2015 MATHEMATICS. Standard level. Paper pages M15/5/MATME/SP/ENG/TZ/XX/M MARKSCHEME May 015 MATHEMATICS Standard level Paper 18 pages M15/5/MATME/SP/ENG/TZ/XX/M This markscheme is the property of the International Baccalaureate and must not be reproduced

More information

Statistical Prediction

Statistical Prediction Statistical Prediction P.R. Hahn Fall 2017 1 Some terminology The goal is to use data to find a pattern that we can exploit. y: response/outcome/dependent/left-hand-side x: predictor/covariate/feature/independent

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

Week 3: Linear Regression

Week 3: Linear Regression Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to

More information

Mathematics for Chemists 2 Lecture 14: Fourier analysis. Fourier series, Fourier transform, DFT/FFT

Mathematics for Chemists 2 Lecture 14: Fourier analysis. Fourier series, Fourier transform, DFT/FFT Mathematics for Chemists 2 Lecture 14: Fourier analysis Fourier series, Fourier transform, DFT/FFT Johannes Kepler University Summer semester 2012 Lecturer: David Sevilla Fourier analysis 1/25 Remembering

More information

Notes on Continuous Random Variables

Notes on Continuous Random Variables Notes on Continuous Random Variables Continuous random variables are random quantities that are measured on a continuous scale. They can usually take on any value over some interval, which distinguishes

More information

Essential facts about NP-completeness:

Essential facts about NP-completeness: CMPSCI611: NP Completeness Lecture 17 Essential facts about NP-completeness: Any NP-complete problem can be solved by a simple, but exponentially slow algorithm. We don t have polynomial-time solutions

More information

Ratios, Proportions, Unit Conversions, and the Factor-Label Method

Ratios, Proportions, Unit Conversions, and the Factor-Label Method Ratios, Proportions, Unit Conversions, and the Factor-Label Method Math 0, Littlefield I don t know why, but presentations about ratios and proportions are often confused and fragmented. The one in your

More information

15. NUMBERS HAVE LOTS OF DIFFERENT NAMES!

15. NUMBERS HAVE LOTS OF DIFFERENT NAMES! 5 NUMBERS HAVE LOTS OF DIFFERENT NAMES! a fun type of game with numbers one such game playing the game: 3 pets There are lots of number games that can make you look clairvoyant One such game goes something

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information