Econ.AI. The role for Deep Neural Networks in Economics. Matt Taddy UChicago and Microsoft

Similar documents
Counterfactuals via Deep IV. Matt Taddy (Chicago + MSR) Greg Lewis (MSR) Jason Hartford (UBC) Kevin Leyton-Brown (UBC)

Big Data, Machine Learning, and Causal Inference

Deep IV: A Flexible Approach for Counterfactual Prediction

Regression Adjustment with Artificial Neural Networks

ECS171: Machine Learning

CS260: Machine Learning Algorithms

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Mostly Dangerous Econometrics: How to do Model Selection with Inference in Mind

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory

causal inference at hulu

Ad Placement Strategies

ISyE 691 Data mining and analytics

Normalization Techniques in Training of Deep Neural Networks

Recitation 7. and Kirkebøen, Leuven, and Mogstad (2014) Spring Peter Hull

CSC321 Lecture 7: Optimization

Linear Regression. Volker Tresp 2018

Linear classifiers: Overfitting and regularization

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CSC321 Lecture 8: Optimization

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Bayesian Analysis of Multivariate Normal Models when Dimensions are Absent

CSC 578 Neural Networks and Deep Learning

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

SVRG++ with Non-uniform Sampling

Linear discriminant functions

Lecture#12. Instrumental variables regression Causal parameters III

Additional Material for Estimating the Technology of Cognitive and Noncognitive Skill Formation (Cuttings from the Web Appendix)

ON ILL-POSEDNESS OF NONPARAMETRIC INSTRUMENTAL VARIABLE REGRESSION WITH CONVEXITY CONSTRAINTS

Logistic Regression. COMP 527 Danushka Bollegala

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Classification. Chapter Introduction. 6.2 The Bayes classifier

Evaluating the Variance of

Ultra High Dimensional Variable Selection with Endogenous Variables

Accelerate Subgradient Methods

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Prediction of Citations for Academic Papers

The connection of dropout and Bayesian statistics

Linear Methods for Regression. Lijun Zhang

Ad Placement Strategies

Lecture: Simultaneous Equation Model (Wooldridge s Book Chapter 16)

CSC2541 Lecture 5 Natural Gradient

MSA220/MVE440 Statistical Learning for Big Data

ECE521 week 3: 23/26 January 2017

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Supplemental Material for KERNEL-BASED INFERENCE IN TIME-VARYING COEFFICIENT COINTEGRATING REGRESSION. September 2017

Financial Econometrics

Reinforcement Learning

CSC321 Lecture 4: Learning a Classifier

Logistic Regression. Stochastic Gradient Descent

A Robust Approach to Estimating Production Functions: Replication of the ACF procedure

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Stochastic Gradient Descent

Program Evaluation with High-Dimensional Data

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

14.32 Final : Spring 2001

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility

Introduction to Machine Learning and Cross-Validation

Summary and discussion of: Dropout Training as Adaptive Regularization

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Longitudinal Data?

ECS171: Machine Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Robust Predictions in Games with Incomplete Information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Optimization for Machine Learning

Stochastic Gradient Descent with Variance Reduction

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

A Note on Demand Estimation with Supply Information. in Non-Linear Models

The Simple Linear Regression Model

Machine learning, shrinkage estimation, and economic theory

Deep Learning & Neural Networks Lecture 4

Learning Representations for Counterfactual Inference. Fredrik Johansson 1, Uri Shalit 2, David Sontag 2

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Econometrics Summary Algebraic and Statistical Preliminaries

Empirical approaches in public economics

Visualizing VAR s: Regularization and Network Tools for High-Dimensional Financial Econometrics

Neural Networks and Deep Learning

ECS289: Scalable Machine Learning

Centering Predictor and Mediator Variables in Multilevel and Time-Series Models

Normalization Techniques

Machine Learning

Flexible Estimation of Treatment Effect Parameters

Policy Gradient Methods. February 13, 2017

Ordinary Least Squares Regression

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

How to do backpropagation in a brain

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Calibrated Uncertainty in Deep Learning

UNSUPERVISED LEARNING

Inference with few assumptions: Wasserman s example

Bayesian Deep Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Learning Tetris. 1 Tetris. February 3, 2009

Short T Panels - Review

Transcription:

Econ.AI The role for Deep Neural Networks in Economics Matt Taddy UChicago and Microsoft

What is AI? Domain Structure + Data Generation + General Purpose ML Econ Theory / Biz Frame Structural Econometrics Relaxations and Heuristics Reinforcement Learning Sensor Networks, IOT Simulation/GANs Deep Neural Nets SGD + OOS + GPUs Video/Audio/Text Self-training structures of ML predictors that automate and accelerate human tasks

https://www.youtube.com/watch?time_continue=13&v=zqywmhfjewu

The Economics of AI DNNs are GPT and method for invention Broad impact, up and down the value chain Gets better, faster, and cheaper in time Can suffer from underinvestment Productivity gains lag invention Automation, inequality, skill acquisition Data ownership, markets, and privacy High-info contracts and outcome pricing Graph from Cockburn/Henderson/Stern

What about the impact of AI on the practice of Econom[etr]ics? Susan Athey:

Econometrics breaks systemic questions into sets of prediction tasks Prediction after controlling for confounders Two-stage prediction with IVs Heterogeneous treatment effect prediction Structural equation systems Machine Learning can automate and accelerate tasks in applied econometric workflows

Example: short-term price elasticity If I drop price 1%, by what % will quantity sold increase? Problem: both prices and sales respond to underlying demand Need a causal effect of price on sales, not their co-movement Beer Data A single shared elasticity gives tiny -0.23 Separate elasticity for each: noisy zeros We need to group the products together using brand, pack, etc.

Beer Elasticity Say w bk = 1 if word k is in description for beer b @transaction t: y tb = γ b p tb + f t (w b ) + ε tb, γ b = w b β p tb = h t w b + ν tb Creates a large number of parameters Just throw it all in a lasso? Yields unbelievably small elasticities The naïve ML conflates two problems: selecting controls and predicting response after controlling for confounders.

Instead, use Orthogonal ML (Chernozhukov et al, 2016 and earlier) Estimate nuisance functions E y tb w b and E p tb w b Orthogonalize the score against these nuisance functions (data split) Then estimation for γ is robust to slow-learned nuisances Estimation breaks into a series of ML tasks: 1. Predict sales from the demand variables: y tb g(t, w b ) 2. Predict prices from the demand variables: p tb h(t, w b ) 3. Get OOS residuals: y t = y t g t(t, ҧ w b ), p t = p t h t(t, ҧ w b ) 4. And fit the final regression: E y t = Γ p t = diag γ p t

Orthogonal ML for Beer The text encodes a natural hierarchy Many beers are IPA or Cider But individual brands also load Most Price Sensitive Least Price Sensitive There s no ground truth, but these are economically realistic with Vira Semenova MIT

Econ + ML This is what econometricians do: break systems into measurable pieces Another common example: Instrumental Variables Endogenous errors: y = g p, x + e and E[ p e ] 0 If you estimate this using naïve ML, you ll get E y p, x = E e p [g p, x + e] = g p, x + E[e p, x] But, with instruments

Instrumental Variables x y The exclusion structure implies z p e E y x, z = නg p, x df(p x, z) You can observe and estimate E[y x, z] and F p x, z to solve for structural g(p, x) we have an inverse problem. cf Newey+Powell 2003

min g G y i නg p, x i df(p x i, z i ) 2 2SLS: p = βz + ν and g p = τp so that g p df p z = τe[p z] So you first regress p on z then regress y on pƹ to recover τ. Ƹ Sieve: g p, x i k γ k φ k p, x i, E F [φ k p, x i ] j α kj β j (x i, z i ) Also Blundell, Chen, Kristensen,, Chen + Pouzo, Darolles et al, Hall+Horowitz

min g G y i නg p, x i df(p x i, z i ) 2 Deep IV uses DNNs to target the integral loss function directly First, fit F using a network with multinomial response Second (preferably on another sample) fit g following L x i, y i, z i, θ = 2 y i g θ ( p ሶ, x i ) g θ p ሷ, x i, p, ሶ p~f ሷ p x i, z i Hartford, Lewis, Leyton-Brown, Taddy ICML 2017

Stochastic Gradient Descent You have loss L(D, θ) where D = d 1 d N In the usual GD, you iteratively descend θ t = θ t 1 C t L(D, θ t 1 ) In SGD, you instead follow noisy but unbiased sample gradients B θ t = θ t 1 C t L({d tb } b=1, θ t 1 )

A pricing simulation Time-dependent ψ t Customer type s

Biased? L x i, y i, z i, θ = 2 y i g θ ( p ሶ, x i ) g θ p ሶ, x i is biased for our loss but unbiased for an upper bound on that loss (via Jensen s) It works pretty well on OOS Loss:

Validation and model tuning We can do OOS causal validation Leave-out deviance on first stage log መf p x i, z i i LO Leave-out loss on second y i g θ p, x i d F(p x i, z i ) 2 i LO You want to minimize both of these (in order).

Deep nets are not nonparametric sieves The 1 st layer is a big dimension reduction word embedding for text matrix convolution for images

Why? Heterogeneity! Example: ads application from Goldman and Rao (2014) 74 mil click-rates over 4 hour increments for 10k search terms Treatment: ad position 1-2 Instrument: background AB testing (bench of ~ 100 tests) Covariates: advertiser id and ad properties, search text, time period The reduction in clicks due to a drop in position is search and ad dependent

Example product (click rate response) Treatment effect is small for on-navigation queries (`searches for microsoft.com) Effect rises for less popular websites (brands) Off brand effect is flat with website popularity

Inference? Good question Data Splitting Variational Dropout Quantile Regression with Jason Hartford UBC

Data Split Fit DNNs that map from inputs to output layer ψ k x, k = 1 K Use out-of-sample x i to obtain `features ψ ik = ψ k x i Possibly do PCA on ψ i to get a nonsingular design Fit OLS y i ψ i β to get β with variance var β = Ψ Ψ 1 Ψ diag y Ψβ Ψ Ψ Ψ 1 This can be used to get var E y x )

Variational Bayes and Dropout VB fits q to minimize E q log q(w) log p D W log p W We train with dropout SGD: At each update of weights ω, use gradients for w = ξω, ξ Bern(c) This corresponds to VB under q W = l k c1 Wk =Ω k + (1 c) 1 Wk =0 This can be used to get var E y x )

Quantile Regression Instead of targeting MSE or logit loss, minimize quantile loss L q = y η q x q 1 y<ηq x Where q is your desired probability and η q (x) is the quantile function Better yet, architect a net to fit multiple quantiles at once This can be used to get prediction intervals for y x

PI coverage around random songs A dataset of a million songs Inputs are timbre features Output is the release year Test and train are split to have no overlap on artists

If you want Prediction Intervals, you should use quantile regression For Confidence Intervals, sample splitting can t be beat

Economic AI The ML doesn t create new economic insights or replace economists It automates and accelerates subjective labor-intense measurement Instruments are everywhere inside firms With reinforcement learning there will be even more Reduced forms are low fruit; structural econometrics is next Need to link long term rewards to short term signals

Business AI Deep learning revolution: good low-dev-cost off-the-shelf ML As the tools become plug-n-play, teams get interdisciplinary The next big gains in AI are coming from domain context Use domain structure to break questions into ML problems Don t re-learn things you already know with baby AI