The risk of machine learning
|
|
- Archibald Rice
- 5 years ago
- Views:
Transcription
1 / 33 The risk of machine learning Alberto Abadie Maximilian Kasy July 27, 27
2 2 / 33 Two key features of machine learning procedures Regularization / shrinkage: Improve prediction or estimation performance by trading off variance and bias ( avoiding overfitting ) 2 Data-dependent choice of tuning parameters: Try to do trade-off optimally Large number of methods: Regularization: Tuning: Ridge, Lasso, Pre-testing, Trees, Neural Networks, Support Vector Machines,... Cross-validation (CV), Stein s Unbiased Risk Estimate (SURE), Empirical Bayes (marginal likelihood),...
3 3 / 33 Questions facing the empirical researcher When should we bother with regularization? 2 What kind of regularization should we choose? What features of the data generating process matter for this choice? 3 When do CV or SURE work for tuning?
4 4 / 33 Roadmap Our answers to these questions A stylized setting: Estimation of many means Backing up our answers, using Theory 2 Empirical examples 3 Simulations Conclusion and summary of our formal contributions
5 ) When should we bother with regularization? Answer: Can reduce risk (mean squared error) when there are many parameters, and we care about point-estimates for each of these. Examples: Causal / predictive effects, many treatment values: Location effects: Chetty and Hendren (2) Teacher effects: Chetty et al. (24) Worker and firm effects: Abowd et al. (999), Card et al. (22) Judge effects: Abrams et al. (22) Binary treatment, many subgroups: Class size effect for different demographic groups: Krueger (999) Event studies: DellaVigna and La Ferrara (2) (many treatments and many treated units) Prediction with many predictive covariates / transformations of covariates: Macro forecasting: Stock and Watson (22) Series regression: Newey (997) / 33
6 6 / 33 2) What kind of regularization should we choose? Answer: Depends on the setting / distribution of parameters. Parameters smoothly distributed, no true zeros: Ridge / linear shrinkage e.g.: location effects (Chetty and Hendren, 2) Arguably most common case in econ settings. Many true zeros, non-zeros well separated: Pre-testing / hard thresholding e.g.: large fixed costs for non-zero behavior (DellaVigna and La Ferrara, 2) Rare! Many true zeros, non-zeros not well separated (intermediate case): Lasso / soft thresholding Robust choice for many settings.
7 7 / 33 3) When do CV or SURE work for tuning? Answer: CV and SURE are uniformly close to the oracle-optimum in high-dimensional settings. Intuition: Optimal tuning depends on the distribution of parameters. When there are many parameters, we can learn this distribution. Requirements: SURE: normality 2 CV: number of observations number of parameters
8 8 / 33 Stylized setting: Estimation of many means We observe n independent random variables X,...,X n, where Componentwise estimators: Squared error loss: E[X i ] = µ i, Var(X i ) = σ 2 i. µ i = m(x i,λ) L( µ, µ) = n ( µ i µ i ) 2 i Many applications: X i equal to OLS estimated coefficients Connection to linear regression and prediction
9 9 / 33 Example Chetty, R. and Hendren, N. (2). The impacts of neighborhoods on intergenerational mobility: Childhood exposure effects and county-level estimates. Working Paper, Harvard University and NBER. µ i : causal effect of growing up in commuting zone i X i : unbiased but noisy estimate of µ i identified from sibling differences of families moving between locations
10 / 33 Componentwise estimators Ridge: Lasso: Pre-test: ( m R (x,λ) = argmin (x c) 2 + λ c 2) c R = + λ x. ( m L (x,λ) = argmin (x c) 2 + 2λ c ) c R = (x < λ)(x + λ) + (x > λ)(x λ). m PT (x,λ) = ( x > λ)x.
11 m / 33 Componentwise estimators Ridge Pretest Lasso X (Ridge: λ =, Pretest: λ = 4, Lasso: λ = 2)
12 2 / 33 Risk Risk = expected loss = mean squared error Conceptual subtlety: Think of µ i (more generally: P i ) as fixed or random? Compound risk: µ,..., µ n as fixed effects, average over their sample distribution: R n (m(,λ),p) = n n i= E[(m(X i,λ) µ i ) 2 P i ] Empirical Bayes risk: µ,..., µ n as random effects, average over their population distribution: where (X i, µ i ) π R(m(,λ),π) = E π [(m(x i,λ) µ i ) 2 ],
13 3 / 33 Characterizing mean squared error Random effects setting: Joint distribution (X, µ) π Conditional expectation: m π(x) = E π [µ X = x] Theorem: The empirical Bayes risk of m(,λ) can be written as R = const. + E π [ (m(x,λ) m π (X)) 2], Performance of estimator m(, λ) depends on how closely it approximates m π( ). Fixed effects / compound risk: Completely analogous, with empirical distribution µ,..., µ n instead of π.
14 4 / 33 A useful family of examples: Spike and normal DGP Assume X i N(µ i,) Distribution of µ i across i: Fraction p µ i = Fraction p µ i N(µ,σ 2) Covers many interesting settings: p = : smooth distribution of true parameters p, µ and σ 2 large: sparsity, non-zeros well separated
15 / 33 Comparing risk Consider ridge, lasso, pre-test, optimal shrinkage function. Assume λ is chosen optimally (will return to that). Can calculate, compare, and plot mean squared error: By construction smaller than ( = risk of unregularized estimator, λ = ), larger than risk for optimal shrinkage m π( ) (by previous theorem). Paper: Analytic risk functions R.
16 Best estimator p =. p = σ σ µ µ p =. p = σ σ µ µ is ridge, x is lasso, is pretest 6 / 33
17 7 / 33 Mean squared error p =... σ µ σ ridge lasso µ.. σ µ σ pretest optimal µ
18 7 / 33 Mean squared error p =.2.. σ µ σ ridge lasso µ.. σ µ σ pretest optimal µ
19 7 / 33 Mean squared error p =.4.. σ µ σ ridge lasso µ.. σ µ σ pretest optimal µ
20 7 / 33 Mean squared error p =.6.. σ µ σ ridge lasso µ.. σ µ σ pretest optimal µ
21 7 / 33 Mean squared error p =.8.. σ µ σ ridge lasso µ.. σ µ σ pretest optimal µ
22 8 / 33 Estimating λ So far: Benchmark of optimal (oracle) λ. Can we consistently estimate λ, and do almost as well as if we knew it? Answer: Yes, for large n, suitably bounded moments. We show this for two methods: Stein s Unbiased Risk Estimate (SURE) (requires normality) 2 Cross-validation (CV) (requires panel data)
23 9 / 33 Uniform loss consistency Shorthand notation for loss: L n (λ) = n (m(x i,λ) µ i ) 2 i Definition: Uniform loss consistency of m(., λ) for m(., λ ): sup π ( Ln P π ( λ) L n ( λ ) > ε as n for all ε >, where P i iid π. )
24 2 / 33 Minimizing estimated risk Estimate λ by minimizing estimated risk: λ = argmin λ R(λ) Different estimators R(λ) of risk: CV, SURE Theorem: Regularization using SURE or CV is uniformly loss consistent as n in the random effects setting under some regularity conditions. Contrast with Leeb and Pötscher (26)! (fixed dimension of parameter vector) Key ingredient: uniform laws of larger numbers to get convergence of L n (λ), R(λ).
25 2 / 33 Two methods to estimate risk Stein s Unbiased Risk Estimate (SURE) Requires normality of X i. R(λ) = n (m(x i,λ) X i ) 2 + penalty i 2 Ridge : +λ penalty = Lasso : 2P n ( X > λ) Pre-test : 2P n ( X > λ) + 2λ ( f( λ) + f(λ)) 2 Cross validation (CV) Requires multiple observations X ij for µ i. R(λ) = kn n k i= j= (m(x i, j,λ) X ij ) 2 X i, j = leave-one-out-mean.
26 22 / 33 Applications Neighborhood effects: The effect of location during childhood on adult income (Chetty and Hendren, 2) Arms trading event study: Changes in the stock prices of arms manufacturers following changes in the intensity of conflicts in countries under arms trade embargoes (DellaVigna and La Ferrara, 2) Nonparametric Mincer equation: A nonparametric regression equation of log wages on education and potential experience (Belloni and Chernozhukov, 2)
27 23 / 33 Estimated Risk Stein s unbiased risk estimate R at the optimized tuning parameter λ for each application and estimator considered. n Ridge Lasso Pre-test location effects 9 R λ arms trade 24 R λ returns to education 6 R λ..9.4
28 SURE(6) Neighborhood effects: SURE estimates.8 SURE as function of 6 ridge lasso pretest / 33
29 bm(x) Neighborhood effects: shrinkage estimators 2. Shrinkage estimators x Kernel estimate of the density of X b f(x) x Solid line in top figure is an estimate of m π(x) 2 / 33
30 SURE(6) Arms event study: SURE estimates.8 SURE as function of 6 ridge lasso pretest / 33
31 bm(x) Arms event study: shrinkage estimators 4 Shrinkage estimators x Kernel estimate of the density of X b f(x) x Solid line in top figure is an estimate of m π(x) 27 / 33
32 SURE(6) Mincer regression: SURE estimates.2. SURE as function of 6 ridge lasso pretest / 33
33 bm(x) Mincer regression: shrinkage estimators Shrinkage estimators x Kernel estimate of the density of X b f(x) x Solid line in top figure is an estimate of m π(x) 29 / 33
34 3 / 33 Monte Carlo simulations Spike and normal DGP Number of parameters n =,2, λ chosen using SURE, CV with 4,2 folds Relative performance: As predicted. Simulation results
35 3 / 33 Summary and Conclusion We study the risk properties of machine learning estimators in the context of the problem of estimating many means µ i based on observations X i We provide a simple characterization of the risk of machine learning estimators based on proximity to the optimal shrinkage function We use a spike-and-normal setting to investigate how simple features of the DGP affect the relative performance of the different estimators We obtain uniform loss consistency results under SURE and CV based choices of regularization parameters We use data from recent empirical studies to demonstrate the practical applicability of our findings
36 32 / 33 Recommendations for empirical work Use regularization / shrinkage when you have many parameters of interest, and high variance (overfitting) is a concern. 2 Pick a regularization method appropriate for your application: Ridge: Smoothly distributed true effects, no special role of zero: 2 Pre-testing: Many zeros, non-zeros well separated 3 Lasso: Robust choice, especially for series regression / prediction 3 Use CV or SURE in high dimensional settings, when number of observations number of parameters.
37 Thank you! 33 / 33
38 Connection to linear regression and prediction Normal linear regression model: Y W N(W β,σ 2 ). Sample W,...,W n. Let Ω = N N j= W jw j. Draw new value of covariates from sample for prediction. Expected squared prediction error R = E [(Y W β) ] ( ) 2 = tr Ω E[( β β)( β β) ] + σ 2. Orthogonalize: Let µ = Ω /2 β, X = Ω /2 β OLS, µi = m(x i,λ). Then and X N (µ, σ 2 N I n ), [ ] R = E ( µ i µ i ) 2 + E[ε 2 ]. i Back / 4
39 2 / 4 Table: Average Compound Loss Across Simulations with N = SURE Cross-Validation Cross-Validation NPEB (k = 4) (k = 2) p µ σ ridge lasso pretest ridge lasso pretest ridge lasso pretest
40 3 / 4 Table: Average Compound Loss Across Simulations with N = 2 SURE Cross-Validation Cross-Validation NPEB (k = 4) (k = 2) p µ σ ridge lasso pretest ridge lasso pretest ridge lasso pretest
41 4 / 4 Table: Average Compound Loss Across Simulations with N = SURE Cross-Validation Cross-Validation NPEB (k = 4) (k = 2) p µ σ ridge lasso pretest ridge lasso pretest ridge lasso pretest Back
Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory
Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory Maximilian Kasy May 25, 218 1 / 27 Introduction Recent years saw a boom of machine learning methods. Impressive advances
More informationMachine learning, shrinkage estimation, and economic theory
Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1 / 43 Introduction Recent years saw a boom of machine learning methods. Impressive advances in domains such
More informationThe Risk of Machine Learning
The Risk of Machine Learning Alberto Abadie MIT Maximilian Kasy Harvard University August, 7 Abstract Many applied settings in empirical economics involve simultaneous estimation of a large number of parameters.
More informationThe Risk of Machine Learning
The Risk of Machine Learning Alberto Abadie MIT Maximilian Kasy Harvard University February, 7 Abstract Many applied settings in empirical economics involve simultaneous estimation of a large number of
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationEcon 2140, spring 2018, Part IIa Statistical Decision Theory
Econ 2140, spring 2018, Part IIa Maximilian Kasy Department of Economics, Harvard University 1 / 35 Examples of decision problems Decide whether or not the hypothesis of no racial discrimination in job
More informationFixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility
American Economic Review: Papers & Proceedings 2016, 106(5): 400 404 http://dx.doi.org/10.1257/aer.p20161082 Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility By Gary Chamberlain*
More informationWhy experimenters should not randomize, and what they should do instead
Why experimenters should not randomize, and what they should do instead Maximilian Kasy Department of Economics, Harvard University Maximilian Kasy (Harvard) Experimental design 1 / 42 project STAR Introduction
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationHigh-dimensional regression
High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationUltra High Dimensional Variable Selection with Endogenous Variables
1 / 39 Ultra High Dimensional Variable Selection with Endogenous Variables Yuan Liao Princeton University Joint work with Jianqing Fan Job Market Talk January, 2012 2 / 39 Outline 1 Examples of Ultra High
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationDirect Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:
More informationEcon 2148, spring 2019 Statistical decision theory
Econ 2148, spring 2019 Statistical decision theory Maximilian Kasy Department of Economics, Harvard University 1 / 53 Takeaways for this part of class 1. A general framework to think about what makes a
More informationHigh-dimensional regression with unknown variance
High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More information41903: Introduction to Nonparametrics
41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationProblem Set 7. Ideally, these would be the same observations left out when you
Business 4903 Instructor: Christian Hansen Problem Set 7. Use the data in MROZ.raw to answer this question. The data consist of 753 observations. Before answering any of parts a.-b., remove 253 observations
More informationEcon 583 Final Exam Fall 2008
Econ 583 Final Exam Fall 2008 Eric Zivot December 11, 2008 Exam is due at 9:00 am in my office on Friday, December 12. 1 Maximum Likelihood Estimation and Asymptotic Theory Let X 1,...,X n be iid random
More informationShrinkage Methods: Ridge and Lasso
Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationMachine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Regression basics Linear regression, non-linear features (polynomial, RBFs, piece-wise), regularization, cross validation, Ridge/Lasso, kernel trick Marc Toussaint University of Stuttgart
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationCSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015
CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 Luke ZeElemoyer Slides adapted from Carlos Guestrin Predic5on of con5nuous variables Billionaire says: Wait, that s not what
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationThe lasso, persistence, and cross-validation
The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationIntroduction to Machine Learning
Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Parzen Windows Kernels, algorithm Model selection
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationLinear regression methods
Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response
More informationSupport Vector Machine I
Support Vector Machine I Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please use piazza. No emails. HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW
More informationThe prediction of house price
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationInternal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.
Section 7 Model Assessment This section is based on Stock and Watson s Chapter 9. Internal vs. external validity Internal validity refers to whether the analysis is valid for the population and sample
More informationGeneralized Elastic Net Regression
Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1
More informationSTAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă
STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationEconometrics in a nutshell: Variation and Identification Linear Regression Model in STATA. Research Methods. Carlos Noton.
1/17 Research Methods Carlos Noton Term 2-2012 Outline 2/17 1 Econometrics in a nutshell: Variation and Identification 2 Main Assumptions 3/17 Dependent variable or outcome Y is the result of two forces:
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More informationDimension Reduction Methods
Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.
More informationCross-Validation with Confidence
Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University WHOA-PSI Workshop, St Louis, 2017 Quotes from Day 1 and Day 2 Good model or pure model? Occam s razor We really
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationQuantile Regression for Panel/Longitudinal Data
Quantile Regression for Panel/Longitudinal Data Roger Koenker University of Illinois, Urbana-Champaign University of Minho 12-14 June 2017 y it 0 5 10 15 20 25 i = 1 i = 2 i = 3 0 2 4 6 8 Roger Koenker
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More informationCross-Validation with Confidence
Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation
More informationIV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade
IV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade Denis Chetverikov Brad Larsen Christopher Palmer UCLA, Stanford and NBER, UC Berkeley September
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationarxiv: v2 [math.st] 9 Feb 2017
Submitted to the Annals of Statistics PREDICTION ERROR AFTER MODEL SEARCH By Xiaoying Tian Harris, Department of Statistics, Stanford University arxiv:1610.06107v math.st 9 Feb 017 Estimation of the prediction
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature
More informationData analysis strategies for high dimensional social science data M3 Conference May 2013
Data analysis strategies for high dimensional social science data M3 Conference May 2013 W. Holmes Finch, Maria Hernández Finch, David E. McIntosh, & Lauren E. Moss Ball State University High dimensional
More informationStatistics 203: Introduction to Regression and Analysis of Variance Penalized models
Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance
More informationOn Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong
On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality Weiqiang Dong 1 The goal of the work presented here is to illustrate that classification error responds to error in the target probability estimates
More informationProbability and Statistical Decision Theory
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Erik Sudderth (UCI) Prof. Mike Hughes
More informationPeter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11
Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationSample questions for Fundamentals of Machine Learning 2018
Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure
More informationShort Questions (Do two out of three) 15 points each
Econometrics Short Questions Do two out of three) 5 points each ) Let y = Xβ + u and Z be a set of instruments for X When we estimate β with OLS we project y onto the space spanned by X along a path orthogonal
More informationcxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c
Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is
More informationBayesian regression tree models for causal inference: regularization, confounding and heterogeneity
Bayesian regression tree models for causal inference: regularization, confounding and heterogeneity P. Richard Hahn, Jared Murray, and Carlos Carvalho June 22, 2017 The problem setting We want to estimate
More informationTransparent Structural Estimation. Matthew Gentzkow Fisher-Schultz Lecture (from work w/ Isaiah Andrews & Jesse M. Shapiro)
Transparent Structural Estimation Matthew Gentzkow Fisher-Schultz Lecture (from work w/ Isaiah Andrews & Jesse M. Shapiro) 1 A hallmark of contemporary applied microeconomics is a conceptual framework
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationDegrees of Freedom in Regression Ensembles
Degrees of Freedom in Regression Ensembles Henry WJ Reeve Gavin Brown University of Manchester - School of Computer Science Kilburn Building, University of Manchester, Oxford Rd, Manchester M13 9PL Abstract.
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationRirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology
Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationPaper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)
Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationHow to use economic theory to improve estimators: shrinking toward theoretical restrictions
How to use economic theory to improve estimators: shrinking toward theoretical restrictions Fessler, Pirmin Kasy, Maximilian February 23, 2018 Abstract We propose to use economic theories to construct
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationIntroduction to Machine Learning and Cross-Validation
Introduction to Machine Learning and Cross-Validation Jonathan Hersh 1 February 27, 2019 J.Hersh (Chapman ) Intro & CV February 27, 2019 1 / 29 Plan 1 Introduction 2 Preliminary Terminology 3 Bias-Variance
More informationIEOR 165 Lecture 7 1 Bias-Variance Tradeoff
IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to
More informationNonparametric forecasting of the French load curve
An overview RTE & UPMC-ISUP ISNPS 214, Cadix June 15, 214 Table of contents 1 Introduction 2 MAVE modeling 3 IBR modeling 4 Sparse modeling Electrical context Generation must be strictly equal to consumption
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationHeteroskedasticity. y i = β 0 + β 1 x 1i + β 2 x 2i β k x ki + e i. where E(e i. ) σ 2, non-constant variance.
Heteroskedasticity y i = β + β x i + β x i +... + β k x ki + e i where E(e i ) σ, non-constant variance. Common problem with samples over individuals. ê i e ˆi x k x k AREC-ECON 535 Lec F Suppose y i =
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationUsing data to inform policy
Using data to inform policy Maximilian Kasy Department of Economics, Harvard University Maximilian Kasy (Harvard) data and policy 1 / 41 Introduction The roles of econometrics Forecasting: What will be?
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationNew Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER
New Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER MLE Going the Way of the Buggy Whip Used to be gold standard of statistical estimation Minimum variance
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More information