The risk of machine learning

Size: px

Start display at page:

Download "The risk of machine learning"

Archibald Rice
5 years ago
Views:

1 / 33 The risk of machine learning Alberto Abadie Maximilian Kasy July 27, 27

2 2 / 33 Two key features of machine learning procedures Regularization / shrinkage: Improve prediction or estimation performance by trading off variance and bias ( avoiding overfitting ) 2 Data-dependent choice of tuning parameters: Try to do trade-off optimally Large number of methods: Regularization: Tuning: Ridge, Lasso, Pre-testing, Trees, Neural Networks, Support Vector Machines,... Cross-validation (CV), Stein s Unbiased Risk Estimate (SURE), Empirical Bayes (marginal likelihood),...

3 3 / 33 Questions facing the empirical researcher When should we bother with regularization? 2 What kind of regularization should we choose? What features of the data generating process matter for this choice? 3 When do CV or SURE work for tuning?

4 4 / 33 Roadmap Our answers to these questions A stylized setting: Estimation of many means Backing up our answers, using Theory 2 Empirical examples 3 Simulations Conclusion and summary of our formal contributions

5 ) When should we bother with regularization? Answer: Can reduce risk (mean squared error) when there are many parameters, and we care about point-estimates for each of these. Examples: Causal / predictive effects, many treatment values: Location effects: Chetty and Hendren (2) Teacher effects: Chetty et al. (24) Worker and firm effects: Abowd et al. (999), Card et al. (22) Judge effects: Abrams et al. (22) Binary treatment, many subgroups: Class size effect for different demographic groups: Krueger (999) Event studies: DellaVigna and La Ferrara (2) (many treatments and many treated units) Prediction with many predictive covariates / transformations of covariates: Macro forecasting: Stock and Watson (22) Series regression: Newey (997) / 33

6 6 / 33 2) What kind of regularization should we choose? Answer: Depends on the setting / distribution of parameters. Parameters smoothly distributed, no true zeros: Ridge / linear shrinkage e.g.: location effects (Chetty and Hendren, 2) Arguably most common case in econ settings. Many true zeros, non-zeros well separated: Pre-testing / hard thresholding e.g.: large fixed costs for non-zero behavior (DellaVigna and La Ferrara, 2) Rare! Many true zeros, non-zeros not well separated (intermediate case): Lasso / soft thresholding Robust choice for many settings.

7 7 / 33 3) When do CV or SURE work for tuning? Answer: CV and SURE are uniformly close to the oracle-optimum in high-dimensional settings. Intuition: Optimal tuning depends on the distribution of parameters. When there are many parameters, we can learn this distribution. Requirements: SURE: normality 2 CV: number of observations number of parameters

8 8 / 33 Stylized setting: Estimation of many means We observe n independent random variables X,...,X n, where Componentwise estimators: Squared error loss: E[X i ] = µ i, Var(X i ) = σ 2 i. µ i = m(x i,λ) L( µ, µ) = n ( µ i µ i ) 2 i Many applications: X i equal to OLS estimated coefficients Connection to linear regression and prediction

9 9 / 33 Example Chetty, R. and Hendren, N. (2). The impacts of neighborhoods on intergenerational mobility: Childhood exposure effects and county-level estimates. Working Paper, Harvard University and NBER. µ i : causal effect of growing up in commuting zone i X i : unbiased but noisy estimate of µ i identified from sibling differences of families moving between locations

10 / 33 Componentwise estimators Ridge: Lasso: Pre-test: ( m R (x,λ) = argmin (x c) 2 + λ c 2) c R = + λ x. ( m L (x,λ) = argmin (x c) 2 + 2λ c ) c R = (x < λ)(x + λ) + (x > λ)(x λ). m PT (x,λ) = ( x > λ)x.

11 m / 33 Componentwise estimators Ridge Pretest Lasso X (Ridge: λ =, Pretest: λ = 4, Lasso: λ = 2)

12 2 / 33 Risk Risk = expected loss = mean squared error Conceptual subtlety: Think of µ i (more generally: P i ) as fixed or random? Compound risk: µ,..., µ n as fixed effects, average over their sample distribution: R n (m(,λ),p) = n n i= E[(m(X i,λ) µ i ) 2 P i ] Empirical Bayes risk: µ,..., µ n as random effects, average over their population distribution: where (X i, µ i ) π R(m(,λ),π) = E π [(m(x i,λ) µ i ) 2 ],

13 3 / 33 Characterizing mean squared error Random effects setting: Joint distribution (X, µ) π Conditional expectation: m π(x) = E π [µ X = x] Theorem: The empirical Bayes risk of m(,λ) can be written as R = const. + E π [ (m(x,λ) m π (X)) 2], Performance of estimator m(, λ) depends on how closely it approximates m π( ). Fixed effects / compound risk: Completely analogous, with empirical distribution µ,..., µ n instead of π.

14 4 / 33 A useful family of examples: Spike and normal DGP Assume X i N(µ i,) Distribution of µ i across i: Fraction p µ i = Fraction p µ i N(µ,σ 2) Covers many interesting settings: p = : smooth distribution of true parameters p, µ and σ 2 large: sparsity, non-zeros well separated

15 / 33 Comparing risk Consider ridge, lasso, pre-test, optimal shrinkage function. Assume λ is chosen optimally (will return to that). Can calculate, compare, and plot mean squared error: By construction smaller than ( = risk of unregularized estimator, λ = ), larger than risk for optimal shrinkage m π( ) (by previous theorem). Paper: Analytic risk functions R.

16 Best estimator p =. p = σ σ µ µ p =. p = σ σ µ µ is ridge, x is lasso, is pretest 6 / 33

17 7 / 33 Mean squared error p =... σ µ σ ridge lasso µ.. σ µ σ pretest optimal µ

18 7 / 33 Mean squared error p =.2.. σ µ σ ridge lasso µ.. σ µ σ pretest optimal µ

19 7 / 33 Mean squared error p =.4.. σ µ σ ridge lasso µ.. σ µ σ pretest optimal µ

20 7 / 33 Mean squared error p =.6.. σ µ σ ridge lasso µ.. σ µ σ pretest optimal µ

21 7 / 33 Mean squared error p =.8.. σ µ σ ridge lasso µ.. σ µ σ pretest optimal µ

22 8 / 33 Estimating λ So far: Benchmark of optimal (oracle) λ. Can we consistently estimate λ, and do almost as well as if we knew it? Answer: Yes, for large n, suitably bounded moments. We show this for two methods: Stein s Unbiased Risk Estimate (SURE) (requires normality) 2 Cross-validation (CV) (requires panel data)

23 9 / 33 Uniform loss consistency Shorthand notation for loss: L n (λ) = n (m(x i,λ) µ i ) 2 i Definition: Uniform loss consistency of m(., λ) for m(., λ ): sup π ( Ln P π ( λ) L n ( λ ) > ε as n for all ε >, where P i iid π. )

24 2 / 33 Minimizing estimated risk Estimate λ by minimizing estimated risk: λ = argmin λ R(λ) Different estimators R(λ) of risk: CV, SURE Theorem: Regularization using SURE or CV is uniformly loss consistent as n in the random effects setting under some regularity conditions. Contrast with Leeb and Pötscher (26)! (fixed dimension of parameter vector) Key ingredient: uniform laws of larger numbers to get convergence of L n (λ), R(λ).

25 2 / 33 Two methods to estimate risk Stein s Unbiased Risk Estimate (SURE) Requires normality of X i. R(λ) = n (m(x i,λ) X i ) 2 + penalty i 2 Ridge : +λ penalty = Lasso : 2P n ( X > λ) Pre-test : 2P n ( X > λ) + 2λ ( f( λ) + f(λ)) 2 Cross validation (CV) Requires multiple observations X ij for µ i. R(λ) = kn n k i= j= (m(x i, j,λ) X ij ) 2 X i, j = leave-one-out-mean.

26 22 / 33 Applications Neighborhood effects: The effect of location during childhood on adult income (Chetty and Hendren, 2) Arms trading event study: Changes in the stock prices of arms manufacturers following changes in the intensity of conflicts in countries under arms trade embargoes (DellaVigna and La Ferrara, 2) Nonparametric Mincer equation: A nonparametric regression equation of log wages on education and potential experience (Belloni and Chernozhukov, 2)

27 23 / 33 Estimated Risk Stein s unbiased risk estimate R at the optimized tuning parameter λ for each application and estimator considered. n Ridge Lasso Pre-test location effects 9 R λ arms trade 24 R λ returns to education 6 R λ..9.4

28 SURE(6) Neighborhood effects: SURE estimates.8 SURE as function of 6 ridge lasso pretest / 33

29 bm(x) Neighborhood effects: shrinkage estimators 2. Shrinkage estimators x Kernel estimate of the density of X b f(x) x Solid line in top figure is an estimate of m π(x) 2 / 33

30 SURE(6) Arms event study: SURE estimates.8 SURE as function of 6 ridge lasso pretest / 33

31 bm(x) Arms event study: shrinkage estimators 4 Shrinkage estimators x Kernel estimate of the density of X b f(x) x Solid line in top figure is an estimate of m π(x) 27 / 33

32 SURE(6) Mincer regression: SURE estimates.2. SURE as function of 6 ridge lasso pretest / 33

33 bm(x) Mincer regression: shrinkage estimators Shrinkage estimators x Kernel estimate of the density of X b f(x) x Solid line in top figure is an estimate of m π(x) 29 / 33

34 3 / 33 Monte Carlo simulations Spike and normal DGP Number of parameters n =,2, λ chosen using SURE, CV with 4,2 folds Relative performance: As predicted. Simulation results

35 3 / 33 Summary and Conclusion We study the risk properties of machine learning estimators in the context of the problem of estimating many means µ i based on observations X i We provide a simple characterization of the risk of machine learning estimators based on proximity to the optimal shrinkage function We use a spike-and-normal setting to investigate how simple features of the DGP affect the relative performance of the different estimators We obtain uniform loss consistency results under SURE and CV based choices of regularization parameters We use data from recent empirical studies to demonstrate the practical applicability of our findings

36 32 / 33 Recommendations for empirical work Use regularization / shrinkage when you have many parameters of interest, and high variance (overfitting) is a concern. 2 Pick a regularization method appropriate for your application: Ridge: Smoothly distributed true effects, no special role of zero: 2 Pre-testing: Many zeros, non-zeros well separated 3 Lasso: Robust choice, especially for series regression / prediction 3 Use CV or SURE in high dimensional settings, when number of observations number of parameters.

37 Thank you! 33 / 33

38 Connection to linear regression and prediction Normal linear regression model: Y W N(W β,σ 2 ). Sample W,...,W n. Let Ω = N N j= W jw j. Draw new value of covariates from sample for prediction. Expected squared prediction error R = E [(Y W β) ] ( ) 2 = tr Ω E[( β β)( β β) ] + σ 2. Orthogonalize: Let µ = Ω /2 β, X = Ω /2 β OLS, µi = m(x i,λ). Then and X N (µ, σ 2 N I n ), [ ] R = E ( µ i µ i ) 2 + E[ε 2 ]. i Back / 4

39 2 / 4 Table: Average Compound Loss Across Simulations with N = SURE Cross-Validation Cross-Validation NPEB (k = 4) (k = 2) p µ σ ridge lasso pretest ridge lasso pretest ridge lasso pretest

40 3 / 4 Table: Average Compound Loss Across Simulations with N = 2 SURE Cross-Validation Cross-Validation NPEB (k = 4) (k = 2) p µ σ ridge lasso pretest ridge lasso pretest ridge lasso pretest

41 4 / 4 Table: Average Compound Loss Across Simulations with N = SURE Cross-Validation Cross-Validation NPEB (k = 4) (k = 2) p µ σ ridge lasso pretest ridge lasso pretest ridge lasso pretest Back

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory Maximilian Kasy May 25, 218 1 / 27 Introduction Recent years saw a boom of machine learning methods. Impressive advances