Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

Size: px

Start display at page:

Download "Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?"

Susanna Mitchell
5 years ago
Views:

2 Train the model with a subset of the data Test the model on the remaining data (the validation set) What data to choose for training vs. test? In a time-series dimension, it is natural to hold out the last year (or time period) of the data, to simulate predicting the future based on all past data. In most settings, however, we ll randomly select our training/test sets.

3 A class of methods to do many training/test splits and average over all the runs Here is a simple example of 5-fold cross validation. Gives 5 test sets 5 estimates of MSE. The 5-fold CV estimate is obtained by averaging these values.

4 Split the data up into K folds. Iteratively leave fold k out of the training data and use it to test. The more folds, the smaller each testing set is (more training data), but the more times we need to run the estimation procedure. Using rules of thumb like 5 10 folds is often utilized in practice. This can be done with a simple for loop in R For generalized linear models, the cv.glm() function can be used to perform k-fold cross validation. For example, this code loops over 10 possible polynomial orders and computes the 10- fold cross-validated error in each step

5 The validation set methodology works when we specify a class of models Examples: Linear models with p-order polynomials A set of related models such as LDA, QDA and Logit Non-parametric models with a smooth parameter Once I have a class of models, I loop over the class to try each one, and compute a validation set error on each pass. K-fold CV methods helps ensure my estimates will be stable

9 An alternative to the approach of fitting many models and picking the best is to fit a single model using all the predictors and throw out the less useful ones in one step. This could also help pare down my feature set into a more manageable set before trying out fancy models and deeper analysis Helps guard against overfitting or have higher model interpretability Regularization helps achieve these aims!

10 The suppose we have p features and want to fit the following estimating equation p y i = α + j=1 β j x ij + ε i OLS will find መβ s that minimize least squares in the training set. We have learned that this will tend to overfit models Another way of putting this, we end up with too complex ( squiggly ) models. They have too much variance.

11 Is a method to reduce variance in our model by imposing a penalty for having high መβ s. We can then tune this penalty to minimize MSE in a validation set. To do so, we minimize:

12 The size of the penalty term λ determines how aggressively we lower the coefficients Since the penalty function is the absolute value, we will often set መβ s equal to zero The idea is that for a feature k that does not improve fit by very much, it s not worth suffering the penalty, λ β k, of adding it into the model

17 Binary outcomes: equal to 1 or 0. Examples: coin is heads or tails, person is guilty or innocent, default or not on a loan, etc. Continuous outcomes: can take an real value Categorical outcomes: can take one of N qualitative values. Example: mapping symptons to a well-defined disease. Traditional regression is designed for continuous outcomes. We use classification for binary and categorical outcomes

18 A mapping of categories to numbers comes with strong implications Would imply different differences across qualitative conditions

19 Outcome: 0 or 1 Given features X, we want to say what is the probability the outcome=1. We can write this as P(outcome = 1 X) Our model will output a probability even though real outcomes will always be 0 or 1

26 Inverse Elasticity Rule MR MC Profit Max (MR=MC) Price-cost margin (Lerner index) = 1 over elasticity p c( q) p 0 p qp( q) c'( q) q qp p Price minus marginal costs divided by price is referred to as gross margin

27 Inverse Elasticity Rule p c( q) p qp p Suppose MC=0. Then quantity is chosen so that elasticity is 1. Intuition: if marginal costs are zero, then optimize for revenue. Total revenue grows until elasticity = If MC>0, one will stop before reaching elasticity = 1. p p qp 1 p 1 27

28 Inverse Elasticity Rule 2 Suppose we sell n goods indexed i=1,,n Demands x i (p) Profit n p mc( i) x ( p i 1 ). i i If we assume constant marginal cost, this simplification is an example of selling the same good in multiple markets or to multiple customer types Cross-price elasticity p j dx i ij. xi dp j Note no minus sign. Positive substitutes; Negative complements 28

29 Representative Consumer Assumption If there is a representative consumer maximizing utility: max u(x)-px, so u ( x) p and u( x) dx Thus there are symmetric cross-derivatives x p i j x p Recall this rule from multivariate calculus j i dp This rule need not hold in practice, but is a commonly made assumption From the total derivative of FOC 29

30 30 In Matrix Notation Price cost margin: n j j i j i n j i j j i i p x i mc p x p x i mc p x p 1 1 )) ( ( )) ( ( 0 n j ij j j i p i mc p x 1 )) ( ( 1, ) ( i i i p i mc p L 0 = 1 + E L, and thus L = - E -1 1

31 Two Good Formula Rule for inverting a 2x2 matrix L = - E -1 1 yields L 1 L 2 = e 11 e 12 e 21 e Divide top and bottom by e 22 L 1 = e 22 e 12 e 11 e 22 e 12 e 21 = 1 e 12 e 22 e 11 e 12e 21 e 22 = e 11 1 e 12 e 22 1 e 12e 21 e 22 e 11 = 1 e 11 1 e 12 e 22 1 e 12e 21 e 22 e 11 Factor out e 11 Multiply top and bottom by e 11 31

32 Two Good Formula L = - E -1 1 yields L 1 = 1 e 11 1 e 12 e 22 1 e 12e 21 e 22 e 11 e 12 e 21 e 22 e 11 will be between 0 and 1 because e 12 e 21 < e 22 e 11 This is because cross price elasticities have to be smaller than the relevant own price elasticities. 32

33 Two Good Formula L = - E -1 1 yields L 1 = 1 e 11 1 e 12 e 22 1 e 12e 21 e 22 e 11 = 1 e = 1 e 11 >

34 Two Good Formula for Substitutes L = - E -1 1 yields L 1 = 1 e 11 1 e 12 e 22 1 e 12e 21 e 22 e 11 = 1 e = 1 e 11 >

35 Two Good Formula for Substitutes L = - E -1 1 yields L 1 = 1 e 11 1 e 12 e 22 1 e 12e 21 e 22 e 11 35

36 Two Good Formula for Complements L = - E -1 1 yields L 1 = 1 e 1 12 e

37 Two Good Formula Review L = - E -1 1 yields = 1 e 11 1 e 12 e 22 1 e 12e 21 e 22 e 11 e 12 > 0, goods are substitutes. A price decrease on product 2 decreases sales on product 1 (go in same direction) e 12 < 0, goods are complements. A price decrease on product 2 increases sales on product 1 (go in opposite directions) e 11 & e 22 will be negative due to law of demand (note before we embedded the negative sign) 37

38 v 2 p B p 2 Buy Good 2 Buy Both Reducing bundle price gives the additional sales of both goods with a single price cut Buy Nothing Buy Good 1 p 1 v 1 38

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,