Doubly Robust Policy Evaluation and Learning

Similar documents
Doubly Robust Policy Evaluation and Learning

Learning with Exploration

Doubly Robust Policy Evaluation and Optimization 1

Doubly Robust Policy Evaluation and Optimization 1

Reducing contextual bandits to supervised learning

Counterfactual Evaluation and Learning

Outline. Offline Evaluation of Online Metrics Counterfactual Estimation Advanced Estimators. Case Studies & Demo Summary

Estimation Considerations in Contextual Bandits

Learning for Contextual Bandits

Slides at:

Counterfactual Model for Learning Systems

Off-policy Evaluation and Learning

Learning from Logged Implicit Exploration Data

1 A Support Vector Machine without Support Vectors

Bayesian Contextual Multi-armed Bandits

New Algorithms for Contextual Bandits

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Linear and Logistic Regression. Dr. Xiaowei Huang

1 [15 points] Frequent Itemsets Generation With Map-Reduce

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Lecture 9: Learning Optimal Dynamic Treatment Regimes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Adaptive Crowdsourcing via EM with Prior

Reinforcement Learning as Classification Leveraging Modern Classifiers

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ISyE 691 Data mining and analytics

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Data Integration for Big Data Analysis for finite population inference

Lecture 10: Contextual Bandits

Chapter 7: Model Assessment and Selection

CS 195-5: Machine Learning Problem Set 1

Evaluation. Andrea Passerini Machine Learning. Evaluation

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Bandit models: a tutorial

CSE 546 Final Exam, Autumn 2013

Learning with multiple models. Boosting.

ECE521 week 3: 23/26 January 2017

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

10-701/ Machine Learning, Fall

Diagnostics. Gad Kimmel

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Optimal and Adaptive Online Learning

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

The Naïve Bayes Classifier. Machine Learning Fall 2017

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

Machine Learning, Fall 2009: Midterm

(Deep) Reinforcement Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

CS 598 Statistical Reinforcement Learning. Nan Jiang

Lecture 10 : Contextual Bandits

Bias-Variance Tradeoff

Introduction. Chapter 1

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Introduction and Models

Multi-armed bandit models: a tutorial

Least Squares Regression

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

CPSC 340: Machine Learning and Data Mining

Statistical Machine Learning from Data

Be able to define the following terms and answer basic questions about them:

Least Squares Regression

Double Robustness. Bang and Robins (2005) Kang and Schafer (2007)

Large-scale Information Processing, Summer Recommender Systems (part 2)

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning Basics

Online Advertising is Big Business

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Active Learning and Optimized Information Gathering

Matching. Quiz 2. Matching. Quiz 2. Exact Matching. Estimand 2/25/14

Variance Reduction and Ensemble Methods

Markov Decision Processes

Introduction to Machine Learning

Evaluation requires to define performance measures to be optimized

Crowd-Learning: Improving the Quality of Crowdsourcing Using Sequential Learning

SF2930 Regression Analysis

arxiv: v1 [stat.ml] 12 Feb 2018

Statistical Learning. Philipp Koehn. 10 November 2015

Data Mining und Maschinelles Lernen

6.867 Machine Learning

Midterm, Fall 2003

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Bounded Off-Policy Evaluation with Missing Data for Course Recommendation and Curriculum Design

Reinforcement Learning. Introduction

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Algorithmisches Lernen/Machine Learning

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning And Applications: Supervised Learning-SVM

ESS2222. Lecture 4 Linear model

Methods for Interpretable Treatment Regimes

Reinforcement Learning and NLP

A Magiv CV Theory for Large-Margin Classifiers

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

Empirical Evaluation (Ch 5)

Decoupled Collaborative Ranking

Bayesian Learning (II)

Transcription:

Doubly Robust Policy Evaluation and Learning Miroslav Dudik, John Langford and Lihong Li Yahoo! Research Discussed by Miao Liu October 9, 2011 October 9, 2011 1 / 17

1 Introduction 2 Problem Definition and Approaches Two Common Solutions Doubly Robust (DR) Estimator 3 Experiments Multiclass Classification with Bandit Feedback Estimating Average User Visits October 9, 2011 2 / 17

This paper is about decision making in environment where feedback is received only for chosen action (medical studies, content recommendation etc). These problems are instance of contextual bandits. Definition (Contextual bandit problem) In a contextual bandits problem, there is a distribution D over pairs (x, r ) where x X is the context (feature vector), a A {1,...,k} is one of the k arms to be pulled, and r [0, 1] A is a vector of rewards. The problem is a repeated game: on each round, a sample (x, r ) is drawn from D, the context x is announced, and then for precisely one arm a chosen by the player, its reward r a is revealed. Definition (Contextual bandit algorithm) A contextual bandits algorithm B determines an arm a A {1,...,k} to pull at each time step t, based on previous observation sequence (x 1, a 1, r a,1 ),..., (x t 1, a t 1, r a,t 1 ), and the current context x t. October 9, 2011 3 / 17

Definition The input data generation A new example (x, r ) D. Policy a p(a x, h), where h is the history of previous observations, i.e. the concatenation of all preceding contexts, actions and observed rewards. Reward r a is revealed, whereas other rewards r a with a a are not observed Note the distribution D and policy p are unknown. Given a data set S = (x, h, a, r a ), the tasks of interest are 1 Policy evaluation V π = E (x, r ) D [r π(x) ] 2 Policy optimization π = argmax π V π Better evaluation generally leads to better optimization. The key challenge here: only partial information about the reward is available which prohibits direct policy evaluation on data S. October 9, 2011 4 / 17

Two common solutions I. Direct method (DM) Notes: Forms an estimate ˆρ a (x) of the expected reward conditioned on the context and action. Estimates the policy value by ˆV DM π = 1 ˆρ S π(x) (x). (1) x S If ˆρ a (x) is a good approximation of the true expected reward, defined as ρ a (x) = E (x, r ) D [r a x], then (1) is a close to V π. If ˆρ a (x) is unbiased, ˆV π DM is an unbiased estimate of V π. A problem with DM: without knowing π, the estimate ˆρ is biased. October 9, 2011 5 / 17

Two common solutions II. Inverse propensity score (IPS) Forms an approximation ˆp(a x, h) of p(a x, h). Corrects for the shift in action proportions between the old, data collection policy and the new policy: ˆV π IPS = 1 S (x,h,a,r a) S r a I(π(x) = a) ˆp(a x, h) (2) Notes: where I( ) is an indicator function. If ˆp(a x, h) p(a x, h), then (2) will approximately be an unbiased estimate of V π (from importance sampling theory). Since data collection policy p(a x, h) is typically well understood, it is often easier to obtain a good estimate ˆp. However, IPS typically has a much larger variance, due to the range of random variable increasing (Especially when p(a x, h) gets smaller). October 9, 2011 6 / 17

Doubly Robust (DR) Estimator DR estimators take advantage of both the estimate of expected reward ˆρ a (x) and estimate of action probabilities ˆp(a x, h). DR estimators was first suggested by Cassel et al. (1976) for regression, but not studied for policy learning: [ ˆV DR π = 1 (ra ˆρ a (x))i ( π(x) = a ) ] + ˆρ S ˆp(a x, h) π(x) (x) (x,h,a,r a) S (3) Informally, (3) uses ˆρ as a baseline and if there is data available, a correction is applied. The estimator (3) is accurate if at least one the estimators, ˆρ and ˆp, is accurate. A basic question How does (3) perform as the estimates ˆρ and ˆp deviate from the truth? October 9, 2011 7 / 17

Bias Analysis Denote the additive deviation of ˆρ from ρ and δ a multiplicative deviation of ˆp from p: (a, x) = ˆρ a (x) ρ a (x), δ(a, x, h) = 1 p(a x, h)/ˆp(a x, h) (4) To remove clutter, use shorthands ρ a for ρ a (x), I for I(π(x) = a), p for p(π(x) x, h), ˆp for ˆp(π(x) x, h), and δ for δ(π(x), x, h). To evaluate E[ ˆV DR π ], it suffices to focus on a single term in Eq.(3), conditioning on h; [ ] (ra ˆρ a )I E (x, r ) D,a p( x,h) + ˆρ ˆp π(x) [ ] (ra ρ a )I = E x, r,a h + ρ ˆp π(x) + (5) [ ] (ρa ρ a )I = E x,a h + (1 I/ˆp) + E x [ρ π (x)] ˆp = E x h [ (1 p/ˆp)] + V π = E x h [ δ] + V π. October 9, 2011 8 / 17

Theorem Let and δ be defined as above. Then, the bias of the doubly robust estimator is [ E S [ ˆV DR π ] V π = 1 S E S δ] (x,h) S If the past policy and the past policy estimate are stationary (i.e., independent of h), the expression simplifies to E S [ ˆV π DR ] V π = E x [ δ]. In contrast (assume stationarity): E S [ ˆV π DM ] V π = E x [ ] E S [ ˆV π IPS ] V π = E x [ρ π(x) δ], (6) IPS is a special case of DR for ˆρ a (x) 0 If either 0, or δ 0, ˆV DR π V π, while DM requires 0 and IPS requires δ 0. October 9, 2011 9 / 17

Variance Analysis Define ɛ = (r a ρ a )I/ˆp and note E[ɛ x, a] = 0, we have [( )] (ra ˆρ a )I E x, r,a + ˆρ p π(x) =E x, r,a [ɛ 2 ] + E x [ρ 2 π(x) ] + 2E x,a[ρ π(x) (1 I/ˆp)] + E x,a [ 2 (1 I/ˆp) 2 ] =E x, r,a [ɛ 2 ] + E x [ρ 2 π(x) ] + 2E x[ρ π(x) δ] + E x [ 2 (1 2p/ˆp + p/ˆp 2 )] =E x, r,a [ɛ 2 ] + E x [ρ 2 π(x) ] + 2E x[ρ π(x) δ] + E x [ 2 (1 2p/ˆp + p 2 /ˆp 2 + p(1 p)/ˆp 2 )] =E x, r,a [ɛ 2 ] + E x [(ρ π(x)+ δ ) 2 ] + E x [ 2 p(1 p)/ˆp 2 ] =E x, r,a [ɛ 2 ] + E x [(ρ π(x)+ δ ) 2 ] + E x [ 1 p p 2 (1 δ) 2 ] Summing across all terms in Eq.3 and combining with Theorem3, the variance is obtained October 9, 2011 10 / 17

Theorem Let, δ and ɛ be defined as above. If the past policy and the policy estimate are stationary, then the variance of the doubly robust estimator is ( Var[ ˆV DR] π = 1 [ ] ) E x, r,a [ɛ 2 1 p ] + Var x[ρ π(x) (x) + δ] + E x 2 (1 δ) 2. S p E x, r,a [ɛ 2 ] accounts for the randomness in rewards. Var x[ρ π(x) + δ] is the variance of the estimator due to the randomness in x. The last term can be viewed as the importance weighting penalty. Similar expression can be derived for the IPS estimator ( Var[ ˆV IPS] π = 1 [ E x, r,a [ɛ 2 1 p ]+Var x[ρ π(x) (x) ρ π(x) δ]+e x S p The variance of the estimator for the direct method Var[ ˆV π DM] = 1 S Varx[ρ π(x)(x) + ] ρ π(x) 2 (1 δ) 2 ] ). which does not have terms depending either on the past policy or the randomness in the rewards. October 9, 2011 11 / 17

Data Setup Assume data are drawn IID from a fixed distribution: (x, c) D, where x X is the feature vector and c {1, 2,, k} is the class label. A typical goal is to find a classifer π : X {1, 2,, k} minimizing the classification error: e(π) = E (x,c) D [I(π(x) c)] A classifier π may be interpreted as an action-selection policy, i.e. the data point (x, c) can be turned into a cost-sensitive classification example (x, l 1,, l k ), where l a = I(a c) is the loss for predicting a. A partially labeled data is constructed by randomly selecting a label a UNIF(1,, k) and revealing the component l a. The final data is in the form of (x, a, l a ). p(a x) = 1/k is assumed to be known. October 9, 2011 12 / 17

Policy Evaluation For each data set: V π = E (x, r ) D [r π(x) ] 1 Randomly split data into training and test sets of roughly the same size; 2 Obtain a classifier π by running a direct loss minimization (DLM) algorithm of McAllester et al. (2011) on the training set with fully revealed losses; 3 Compute the classification error on fully observed test data (ground truth); 4 Obtain a partially labeled test set. Both DM and DR require estimating the expected loss denoted as l(x, a) for given (x, a). Use a liner loss model: ˆl(x, a) = w a x, parameterized by k weighted vectors {w a } a {1,,k}, and use least-squares ridge regression to fit w a. Both IPS and DR are unbiased since p(a h) = 1/k is accurate. October 9, 2011 13 / 17

Policy Optimization π = argmax π V π Repeat the following steps 30 times: 1 Randomly split data into tranining (70%) and test (30%) sets; 2 Obtain a partially labeled training data set; 3 Use IPS and DR estimators to impute unrevealed losses in the training data; 4 Two cost-sensitive multiclass classification algorithms are used to learn a classifier from the losses completed by either IPS or DR: the first is DLM (McAllester et al., 2011), the other is the Filter Tree reduction of Beygelzimer et al. (2008). 5 Evaluate the learned classifiers on the test data. October 9, 2011 14 / 17

Performance Comparisons October 9, 2011 15 / 17

Estimating Average User Visits The dataset: D = {(b i, x i, v i )} i=1,,n, where N = 3854689, b i is the ith(unique) bcookie 1, x i {0, 1} 5000 is the corresponding (sparse) binary feature vector, and v i is the number of visits. A sample mean might be a biased estimator if the uniform sampling scheme is not ensured. This is known as covariate shift 2, a special case of contextual bandit problem with k = 2 arms. The partially labeled data consists of tuples (x i, a i, r i ), where a i {0, 1} indicates whether bcookie b i is sampled, r i = p i v i is the observed number of visits, and p i is the probability that a i = 1. DR estimator requires building a reward model ˆρ(x) = ωx The sampling proabilities p i of the b i is defined as (Gretton et al. 208)p i = min{n ( x i, x ), 1}, where x is the first principle component of all features {x i }. 1 A bcookie is unique string that identifies a user. 2 http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey/node8.html. October 9, 2011 16 / 17

Performance Comparisons Figure: 3. Comparison of IPS and DR: rmse(left), bias(right). The ground truth value is 23.8. October 9, 2011 17 / 17