learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015

Similar documents
Learning Bounds for Importance Weighting

Sample Selection Bias Correction

Learning with Imperfect Data

Computational and Statistical Learning theory

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

The PAC Learning Framework -II

Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania)

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Does Unlabeled Data Help?

Maximum Mean Discrepancy

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Active Learning: Disagreement Coefficient

i=1 = H t 1 (x) + α t h t (x)

Introduction to Machine Learning

IFT Lecture 7 Elements of statistical learning theory

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

Domain Adaptation for Regression

PATTERN RECOGNITION AND MACHINE LEARNING

Lecture 35: December The fundamental statistical distances

Generalization and Overfitting

Chapter 6. Hypothesis Tests Lecture 20: UMP tests and Neyman-Pearson lemma

Classification: The PAC Learning Framework

Computational and Statistical Learning Theory

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Introduction to Statistical Learning Theory

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Lecture 21: Minimax Theory

Generalization, Overfitting, and Model Selection

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Introduction to Machine Learning

Information measures in simple coding problems

The Vapnik-Chervonenkis Dimension

Learning Interpretable Features to Compare Distributions

Advanced Machine Learning

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Hypothesis Testing with Communication Constraints

ECS171: Machine Learning

Lecture 7 Introduction to Statistical Decision Theory

Lecture 5 Channel Coding over Continuous Channels

576M 2, we. Yi. Using Bernstein s inequality with the fact that E[Yi ] = Thus, P (S T 0.5T t) e 0.5t2

1 Review of The Learning Setting

Stephen Scott.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernel methods for comparing distributions, measuring dependence

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

MMD GAN 1 Fisher GAN 2

Statistical Machine Learning

Sharp Generalization Error Bounds for Randomly-projected Classifiers

Advanced Machine Learning

Models of Language Acquisition: Part II

Mathematical Foundations of Supervised Learning

Domain-Adversarial Neural Networks

topics about f-divergence

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Proofs, Tables and Figures

Announcements. Proposals graded

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Lecture Support Vector Machine (SVM) Classifiers

Performance Metrics for Machine Learning. Sargur N. Srihari

Preliminary check: are the terms that we are adding up go to zero or not? If not, proceed! If the terms a n are going to zero, pick another test.

Expectation Propagation Algorithm

12 Statistical Justifications; the Bias-Variance Decomposition

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Source Coding with Lists and Rényi Entropy or The Honey-Do Problem

References for online kernel methods

Rademacher Bounds for Non-i.i.d. Processes

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Kyle Reing University of Southern California April 18, 2018

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Relevant sections from AMATH 351 Course Notes (Wainwright): 1.3 Relevant sections from AMATH 351 Course Notes (Poulin and Ingalls): 1.1.

A Readable Introduction to Real Mathematics

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Lecture 5 - Hausdorff and Gromov-Hausdorff Distance

Statistical and Computational Learning Theory

EC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 1: Preliminaries

Introduction to Machine Learning

Hands-On Learning Theory Fall 2016, Lecture 3

Algebraic Information Geometry for Learning Machines with Singularities

CIS 800/002 The Algorithmic Foundations of Data Privacy September 29, Lecture 6. The Net Mechanism: A Partial Converse

AN EXPLORATION OF THE METRIZABILITY OF TOPOLOGICAL SPACES

MA5206 Homework 4. Group 4. April 26, ϕ 1 = 1, ϕ n (x) = 1 n 2 ϕ 1(n 2 x). = 1 and h n C 0. For any ξ ( 1 n, 2 n 2 ), n 3, h n (t) ξ t dt

Introduction to Machine Learning

Cauchy Sequences. x n = 1 ( ) 2 1 1, . As you well know, k! n 1. 1 k! = e, = k! k=0. k = k=1

Computational Learning Theory. CS534 - Machine Learning

Expectation Maximization Algorithm

Foundations of Machine Learning

Time Series Prediction & Online Learning

Introduction to Support Vector Machines

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

The sample complexity of agnostic learning with deterministic labels

7.1 Coupling from the Past

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

ALGEBRAIC PROPERTIES OF THE LATTICE R nx

arxiv: v3 [cs.ai] 20 Nov 2015

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

HW3. All layers except the first will simply pass through the previous layer: W (t+1) . (The last layer will pass the first neuron).

Sample complexity bounds for differentially private learning

FA Homework 5 Recitation 1. Easwaran Ramamurthy Guoquan (GQ) Zhao Logan Brooks

Transcription:

learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015

Introduction Often, training distribution does not match testing distribution Want to utilize information about test distribution Correct bias or discrepancy between training and testing distributions 1

Importance Weighting Labeled training data from source distribution Q Unlabeled test data from target distribution P Weight the cost of errors on training instances. Common definition of weight for point x: w(x) = P(x)/Q(x) 2

Importance Weighting Reasonable method, but sometimes doesn t work Can we give generalization bounds for this method? When does DA work? When does it not work? How should we weight the costs? 3

Overview Preliminaries Learning guarantee when loss is bounded Learning guarantee when loss is unbounded, but second moment is bounded Algorithm 4

Preliminaries: Rényi divergence For α 0, D α (P Q) between distributions P and Q D α (P Q) = 1 α 1 log 2 d α (P Q) = 2 D α(p Q) = ( P(x) P(x) Q(x) x [ x ) α 1 ] 1 P α α 1 (x) Q α 1 (x) Metric of info lost when Q is used to approximate P D α (P Q) = 0 iff P = Q 5

Preliminaries: Importance weights Lemma 1: Proof: E[w] = 1 E[w 2 ] = d 2 (P Q) σ 2 = d 2 (P Q) 1 E Q [w 2 ] = x X w 2 (x)q(x) = x X ( ) 2 P(x) Q(x) = d 2 (P Q) Q(x) Lemma 2: For all α > 0 and x X, E Q [w 2 (x)l 2 h (x)] d α+1(p Q)R(h) 1 1 α 6

Preliminaries: Importance weights Hölder s Inequality (Jin, Wilson, and Nobel, 2014): Let 1 p + 1 q = 1, then ( ) 1 ( a x b x a x p p ) 1 b x q q x x x 7

Preliminaries: Importance weights Proof for Lemma 2: Let the loss be bounded by B = 1, then E x Q [w 2 (x)l 2 h (x)] = x [ ] 2 P(x) Q(x) L 2 h Q(x) (x) = x P(x) 1 α [ ] P(x) P(x) α 1 α L 2 Q(x) h (x) [ [ ] α ] 1 [ P(x) α P(x) P(x)L 2α h Q(x) x [ = d α+1 (P Q) x x P(x)L h (x)l α+1 α 1 h α 1 ] α 1 α (x) ] α 1 α (x) d α+1 (P Q)R(h) 1 1 α B 1+ 1 α = dα+1 (P Q)R(h) 1 1 α 8

Learning Guarantees: Bounded case P(x) sup x w(x) = sup x Q(x) = d (P Q) = M. Let d (P Q) < +. Fix h H. Then, for any δ > 0, with probability at least 1 δ, R(h) ˆR w (h) M log 2 δ 2m M can be very large, so we naturally want a more favorable bound... 9

Overview Preliminaries Learning guarantee when loss is bounded Learning guarantee when loss is unbounded, but second moment is bounded Algorithm 10

Learning Guarantees: Bounded case Theorem 1: Fix h H. For any α 1, for any δ > 0, with probability at least 1 δ, the following bound holds: R(h) ˆR w (h) + 2M log 1 δ 3m + 2[d α+1 (P Q)R(h) 1 1 α R(h) 2 ] log 1 δ m 11

Learning Guarantees: Bounded case Bernstein s inequality (Bernstein 1946): when x i M. ( 1 Pr n n i=1 ) ( nϵ 2 ) x i ϵ exp 2σ 2 + 2Mϵ/3 12

Learning Guarantees: Bounded case Proof of Theorem 1: Let Z be the random variable w(x)l h (x) R(x). Then Z M. Thus, by lemma 2, the variance of Z can be bounded in terms of d α+1 (P Q): σ 2 (Z) = E Q [w 2 (x)l h (x) 2 )] R(h) 2 d α+1 (P Q)R(h) 1 1 α R(h) 2 ( mϵ Pr[R(h) ˆR 2 ) /2 w (h) > ϵ] exp σ 2. (Z) + ϵm/3 13

Learning Guarantees: Bounded case Thus, setting δ to match upper bound, then with probability at least 1 δ R(h) ˆR w (h) + 2M log 1 δ 3m + M 2 log 2 1 δ 9m 2 + 2σ2 (Z) log 1 δ m = ˆR w (h) + 2M log 1 δ 3m + 2σ 2 (Z) log 1 δ m 14

Learning Guarantees: Bounded case Theorem 2: Let H be a finite hypothesis set. Then for any δ > 0, with probability at least 1 δ, the following holds for the importance weighting method: R(h) ˆR w (h) + 2M(log H + log 1 δ ) 3m + 2d 2 (P Q)(log H = log 1 δ m 15

Learning Guarantees: Bounded case Theorem 2 holds when α = 1. Note that theorem 1 can be simplified in the case of α = 1: R(h) ˆR w (h) + 2M log 1 δ 3m + 2d 2 (P Q) log 1 δ m Thus, theorem 2 follows by including the cardinality of H 16

Learning Guarantees: Bounded case Proposition 2: Lower bound. Assume M < and σ 2 (w)/m 2 1/m. Assume there exists h 0 H such that L h0 (x) = 1 for all x. There exists an absolute constant c, c = 2/41 2, such that [ ] d2 (P Q) 1 Pr sup R(h) ˆR w (h) c > 0 h H 4m Proof from theorem 9 of Cortes, Mansour, and Mohri, 2010. 17

Overview Preliminaries Learning guarantee when loss is bounded Learning guarantee when loss is unbounded, but second moment is bounded Algorithm 18

Learning Guarantees: Unbounded case d (P Q) < does not always hold... Assume P and Q follow a Guassian distribution with σ P and σ Q with means µ and µ P(x) Q(x) = σ [ P exp σ2 Q (x µ)2 σp 2(x µ ) 2 ] σ Q 2σ 2 P σ2 Q Thus, even if σ P = σ Q and µ µ P(x), d (P Q) = sup x Q(x) =, thus Theorem 1 is not informative. 19

Learning Guarantees: Unbounded case However, the variance of the importance weights is bounded. d w (P Q) = σ 2 P σ + Q 2π [ exp 2σ2 Q (x µ)2 σp 2(x µ ) 2 ] 2σP 2 σq 2 dx 20

Learning Guarantees: Unbounded case Intuition: if µ = µ and σ P >> σ Q Q provides some useful information about P But sample from Q only has few points far from µ A few extreme sample points would have large weights Likewise, if σ P = σ Q but µ >> µ, weights would be negligible. 21

Learning Guarantees: Unbounded case Theorem 3: Let H be a hypothesis set such that Pdim({L h (x) : H H}) = p <. Assume that d 2 (P Q) < + and w(x) 0 for all x. Then for δ > 0, with probability at least 1 δ, the following holds: R(h) ˆR w (h) + 2 5/4 8 d 2 (P Q) 3 p log 2me p + log 4 δ m 22

Learning Guarantees: Unbounded case Proof outline (full proof in of Cortes, Mansour, Mohri, 2010): [ ] E[L Pr sup h ] Ê[L h ] Ê[L h H > ϵ 2 + log 1 2 h ] ϵ [ ] ˆPr[L h>t] Pr[L Pr sup h >t] h H,t R > ϵ ˆPr[L h >t] [ ] ( ) R(h) ˆR(h) Pr sup h H > ϵ 2 + log 1 R(h) ϵ 4Π H (2m) exp mϵ2 4 [ ] Pr sup h H E[L h(x)] Ê[L h (x)] > ϵ 2 + log 1 E[L 2 h (x)] ϵ ( ) 4 exp p log 2em p mϵ2 4 [ ] ( ) E[L Pr sup h (x)] Ê[L h (x)] h H > ϵ 4 exp p log 2em E[L 2 h (x)] p mϵ8/3 4 5/3 E[L h (x)] Ê[L h (x)] 2 5/4 max{ E[L 2 h Ê[L (x)], 2 8 h (x)]} 3 p log 2me p +log 8 δ m 23

Learning Guarantees: Unbounded case Thus, we can show the following: Pr [ ] ( R(h) ˆR w (h) sup > ϵ 4 exp p log 2em ) h H d2 (P Q) p mϵ8/3. 4 5/3 Where p = Pdim({L h (x) : h H} is the pseudo-dimension of H = {w(x)l h (x) : h H}. Note, any set shattered by H is shattered by H, since there exists a subset B of a set A that is shattered by H, such that H shatters A with witnesses s i = r i /w(x i ). 24

Overview Preliminaries Learning guarantee when loss is bounded Learning guarantee when loss is unbounded, but second moment is bounded Algorithm 25

Alternative algorithms We can generalize this analysis to an arbitrary function u : X R, u > 0. Let ˆR u (h) = 1 m m i=1 u(x i)l h (x i ) and let ˆQ be the empirical distribution: Theorem 4: Let H be a hypothesis set such that Pdim({L h (x) : h H}) = p <. Assume that 0 < E Q [u 2 (x)] < + and u(x) 0 for all x. Then for any δ > 0 with probability at least 1 δ, R(h) ˆR u (h) E Q [ [w(x) u(x)]lh (x) ] ) +2 5/4 max ( E Q [u 2 (x)l 2h (x)], EˆQ[u 2 (x)l 2 h (x)] 38 p log 2me p + log 4 δ m 26

Alternative algorithms Other functions u than w can be used to reweight cost of error Minimize upper bound ( EQ ) max [u 2 ], EˆQ[u 2 ] E Q [u 2 ] ( 1 + O(1/ m) ), [ ] min u U E w(x u(x) + γ E Q [u 2 ] Trade-off between bias and variance minimization. 27

Alternative algorithms 28

Alternative algorithms 29