Distirbutional robustness, regularizing variance, and adversaries

Size: px

Start display at page:

Download "Distirbutional robustness, regularizing variance, and adversaries"

Philip Roberts
5 years ago
Views:

1 Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017

2 Motivation We do not want machine-learned systems to fail when they get in the real world

3 Challenge one: Curly fries

4 Challenge one: Curly fries Who doesn t like curly fries?

5 Challenge two: changes in environment Learning to drive in California

6 Challenge two: changes in environment Learning to drive in California Driving in Ann Arbor

7 Challenge three: adversaries [Goodfellow et al. 15]

8 Challenge three: adversaries [Goodfellow et al. 15] Paraphrased Quote: We could put a transparent film on a stop sign, essentially imperceptible to a human, and a computer would see the stop sign as air (Dan Boneh)

9 Stochastic optimization problems minimize R(θ) := E P0 [l(θ; Z)] = l(θ; z)dp 0 (z) subject to θ Θ. Empirical risk minimization: Often, solve θ n = argmin θ Θ R n (θ) := 1 n n l(θ; Z i ) i=1

10 Stochastic optimization problems minimize R(θ) := E P0 [l(θ; Z)] = l(θ; z)dp 0 (z) subject to θ Θ. Data/randomness is Z Loss function θ l(θ; z) Parameter space Θ is a nonempty closed (convex) set Empirical risk minimization: Often, solve θ n = argmin θ Θ R n (θ) := 1 n n l(θ; Z i ) i=1

11 Stochastic optimization problems minimize R(θ) := E P0 [l(θ; Z)] = l(θ; z)dp 0 (z) subject to θ Θ. Data/randomness is Z Loss function θ l(θ; z) Parameter space Θ is a nonempty closed (convex) set iid Observe data Z i P0, i = 1,..., n Empirical risk minimization: Often, solve θ n = argmin θ Θ R n (θ) := 1 n n l(θ; Z i ) i=1

12 Distributional robustness R(θ) = E P0 [l(θ; Z)]

13 Distributional robustness R(θ, P) := sup E P [l(θ; Z)] P P

14 Distributional robustness R(θ, P) := sup E P [l(θ; Z)] P P Uncertainty set P is set of possible distributions/worlds Different choices of uncertainty yield different behaviors Some sample-based uncertainty sets P certify future performance

15 Distributional robustness R(θ, P) := sup E P [l(θ; Z)] P P Uncertainty set P is set of possible distributions/worlds Different choices of uncertainty yield different behaviors Some sample-based uncertainty sets P certify future performance Much work in optimization literature: [Delage & Ye 10, Ben-Tal et al. 13, Bertsimas et al. 14, Lam & Zhou 15, Gotoh et al. 15]

16 Distributional robustness R(θ, P) := sup E P [l(θ; Z)] P P Uncertainty set P is set of possible distributions/worlds Different choices of uncertainty yield different behaviors Some sample-based uncertainty sets P certify future performance Much work in optimization literature: [Delage & Ye 10, Ben-Tal et al. 13, Bertsimas et al. 14, Lam & Zhou 15, Gotoh et al. 15] Rest of this talk: Two vignettes showing some aspects of this approach

17 Vignette one: regularization by variance

18 Vignette one: regularization by variance Any learning algorithm has bias (approximation error) and variance (estimation error)

19 Vignette one: regularization by variance Any learning algorithm has bias (approximation error) and variance (estimation error) From empirical Bernstein s inequality, with probability 1 δ 2Var Pn (l(θ; X)) R(θ) R n (θ) + }{{} bias } n {{ } variance + C log 1 δ n

20 Vignette one: regularization by variance Any learning algorithm has bias (approximation error) and variance (estimation error) From empirical Bernstein s inequality, with probability 1 δ 2Var Pn (l(θ; X)) R(θ) R n (θ) + }{{} bias } n {{ } variance + C log 1 δ n Goal: Trade between these automatically and optimally by solving θ var argmin θ Θ R 2Var n (θ) + Pn (l(θ; X)) n.

21 Optimizing for bias and variance Good idea: Directly minimize bias + variance, certify optimality!

22 Optimizing for bias and variance Good idea: Directly minimize bias + variance, certify optimality! Minor issue: variance is wildly non-convex Figure: Variance of l(θ, X) = θ X

23 Robust ERM Goal: minimize θ Θ R(θ) = E P0 [l(θ; X)]

24 Robust ERM Goal: minimize θ Θ Solve empirical risk minimization problem R(θ) = E P0 [l(θ; X)] minimize θ Θ n i=1 1 n l(θ; X i)

25 Robust ERM Goal: minimize θ Θ R(θ) = E P0 [l(θ; X)] Solve sample average optimization problem minimize θ Θ n i=1 1 n l(θ; X i)

26 Robust ERM Goal: minimize θ Θ R(θ) = E P0 [l(θ; X)] Instead, solve distributionally robust optimization (RO) problem minimize θ Θ sup p P n,ρ n p i l(θ; X i ) i=1 where P n,ρ is some appropriately chosen set of vectors

27 Robust ERM Goal: minimize θ Θ R(θ) = E P0 [l(θ; X)] Instead, solve distributionally robust optimization (RO) problem minimize θ Θ sup p P n,ρ n p i l(θ; X i ) i=1 where P n,ρ is some appropriately chosen set of vectors This bit of talk: Give a principled statistical approach to choosing P n,ρ and give stochastic optimality certificates for RO.

28 Empirical likelihood and robustness Idea: Optimize over uncertainty set of possible distributions, { P n,ρ := Distributions P such that D(P P n ) ρ } n for some ρ > 0, where D(P Q) = (p/q 1) 2 q

29 Empirical likelihood and robustness Idea: Optimize over uncertainty set of possible distributions, { P n,ρ := Distributions P such that D(P P n ) ρ } n for some ρ > 0, where D(P Q) = (p/q 1) 2 q Define (and optimize) empirical likelihood upper confidence bound R n (θ, P n,ρ ) := max P P n,ρ E P [l(θ, X)] = max p P n,ρ n p i l(θ, X i ) i=1

30 Empirical likelihood and robustness Idea: Optimize over uncertainty set of possible distributions, { P n,ρ := Distributions P such that D(P P n ) ρ } n for some ρ > 0, where D(P Q) = (p/q 1) 2 q Define (and optimize) empirical likelihood upper confidence bound R n (θ, P n,ρ ) := max P P n,ρ E P [l(θ, X)] = max p P n,ρ Nice properties: Convex optimization problem n p i l(θ, X i ) i=1 Efficient solution methods [D. & Namkoong NIPS 16]

31 Robust Optimization = Variance Regularization Theorem (D. & Namkoong) Assume that l is bounded over the space of decision vectors θ. Then R n (θ; P n,ρ ) = R 2ρVar n (θ) + Pn (l(θ; X)) + O(ρ/n). n

32 Robust Optimization = Variance Regularization Theorem (D. & Namkoong) Assume that l is bounded over the space of decision vectors θ. Then R n (θ; P n,ρ ) = R 2ρVar n (θ) + Pn (l(θ; X)) + O(ρ/n). n Choose θ rob to minimize robust empirical risk R n (θ, P n,ρ ) := max P P n,ρ E P [l(θ, X)] = max p P n,ρ n p i l(θ, X i ). i=1

33 Optimal bias variance tradeoff Choose θ rob to minimize robust empirical risk { R n ( θ rob, P n,ρ ) = min max E P [l(θ; X)] : D χ θ Θ P P 2 (P P ) n ρ }. n n

34 Optimal bias variance tradeoff Choose θ rob to minimize robust empirical risk { R n ( θ rob, P n,ρ ) = min max E P [l(θ; X)] : D χ θ Θ P P 2 (P P ) n ρ }. n n Assume that Θ R d compact with radius R and l(θ; X) is M-Lipschitz.

35 Optimal bias variance tradeoff Choose θ rob to minimize robust empirical risk { R n ( θ rob, P n,ρ ) = min max E P [l(θ; X)] : D χ θ Θ P P 2 (P P ) n ρ }. n n Assume that Θ R d compact with radius R and l(θ; X) is M-Lipschitz. Theorem (D. & Namkoong 17) Let ρ = log 1 δ + d log n. Then with probability at least 1 δ, R( θ rob ) R n ( θ rob, P n,ρ ) }{{} optimality certificate + cmr n ρ { } 2ρVar(l(θ, ξ)) min R(θ) cmr θ Θ n n ρ }{{} optimal tradeoff for some universal constant c > 0.

36 Experiment: Reuters Corpus (multi-label) Problem: Classify documents as a subset of the 4 categories: { } Corporate, Economics, Government, Markets

37 Experiment: Reuters Corpus (multi-label) Problem: Classify documents as a subset of the 4 categories: { } Corporate, Economics, Government, Markets Data: pairs x R d represents document, y { 1, 1} 4 where y j = 1 indicating x belongs j-th category.

38 Experiment: Reuters Corpus (multi-label) Problem: Classify documents as a subset of the 4 categories: { } Corporate, Economics, Government, Markets Data: pairs x R d represents document, y { 1, 1} 4 where y j = 1 indicating x belongs j-th category. Loss l(θ j, (x, y)) = log(1 + e yx θ j ) for each j = 1,..., 4 and Θ = { θ R d : θ }.

39 Experiment: Reuters Corpus (multi-label) Problem: Classify documents as a subset of the 4 categories: { } Corporate, Economics, Government, Markets Data: pairs x R d represents document, y { 1, 1} 4 where y j = 1 indicating x belongs j-th category. Loss l(θ j, (x, y)) = log(1 + e yx θ j ) for each j = 1,..., 4 and Θ = { θ R d : θ }. d = 47, 236, n = 804, fold cross-validation.

40 Experiment: Reuters Corpus (multi-label) Problem: Classify documents as a subset of the 4 categories: { } Corporate, Economics, Government, Markets Data: pairs x R d represents document, y { 1, 1} 4 where y j = 1 indicating x belongs j-th category. Loss l(θ j, (x, y)) = log(1 + e yx θ j ) for each j = 1,..., 4 and Θ = { θ R d : θ }. d = 47, 236, n = 804, fold cross-validation. Table: Reuters Number of Examples Corporate Economics Government Markets 381, , , ,820

41 Experiment: Reuters Corpus (multi-label) Table: Reuters Corpus (%) Precision Recall Corporate Economics ρ train test train test train test train test erm

42 Experiment: Reuters Corpus (multi-label) Figure: Recall on rare category (Economics) train test 0.78 Recall (Economics) ERM ρ

43 Experiment: Reuters Corpus (multi-label) Figure: Average logistic risk and confidence bound risk train risk test robust train Risk ERM ρ

44 Vignette two: Wasserstein robustness We do not want machine-learned systems to fail when they get in the real world

45 Vignette two: Wasserstein robustness We do not want machine-learned systems to fail when they get in the real world It is irresponsible to release systems into the world whose robustness we do not understand

46 Challenges

47 A type of robustess Robust optimization: instead of l, look at robust loss l ɛ (θ; z) := sup l(θ; z + ) ɛ

48 A type of robustess Robust optimization: instead of l, look at robust loss l ɛ (θ; z) := sup l(θ; z + ) ɛ Adversarial attacks and defenses with heuristics and more advanced ideas [Goodfellow et al. 15, Jia and Liang 17, Papernot et al. 16, Madry et al. 17]

49 A type of robustess Robust optimization: instead of l, look at robust loss l ɛ (θ; z) := sup l(θ; z + ) ɛ Adversarial attacks and defenses with heuristics and more advanced ideas [Goodfellow et al. 15, Jia and Liang 17, Papernot et al. 16, Madry et al. 17] Minor issue: Usually this is NP-hard Further issue: In neural network, f θ (x) = θ T 1 σ relu (θ T 2 σ relu ( )) and is is NP-hard to compute sup l(f θ (x + ))

50 Distributional robustness Question: How can we figure out how to change distribution right way to get robustness?

51 Distributional robustness Question: How can we figure out how to change distribution right way to get robustness? Let c : Z Z R + be some cost function, and define Wasserstein distance W c (P, Q) := inf c(z 1, z 2 )dm(z 1, z 2 ) M { } = sup f(z)(dp (z) dq(z)) f(x) f(z) c(x, z) f where M has P and Q as its marginal distributions

52 Wasserstein robustness Look at distributionally robust risk R(θ, P) := sup {E P [l(θ; Z)] P P} P

53 Wasserstein robustness Look at distributionally robust risk defined for ρ 0 R(θ, ρ) := sup {E P [l(θ; Z)] s.t. W c (P, P 0 ) ρ} P

54 Wasserstein robustness Look at distributionally robust risk defined for ρ 0 R(θ, ρ) := sup {E P [l(θ; Z)] s.t. W c (P, P 0 ) ρ} P Allows changing support to harder distributions Studied in robust optimization literature [Shafieezadeh-Abadeh et al. 15, Esfahani & Kuhn 15, Blanchet and Murthy 16] Minor issue: Often still NP-hard

55 A first idea (Simple) insight: If l(θ, z) is smooth in θ and z, then life gets a bit easier

56 A first idea (Simple) insight: If l(θ, z) is smooth in θ and z, then life gets a bit easier The function { l λ (θ; z) := sup l(θ; z + ) λ } is efficient to compute (and differentiable, etc.) for large enough λ

57 Duality and robustness Theorem (D., Namkoong, Sinha) Let P 0 be any distribution on Z and c : Z Z R + be any function. Then { { sup E P [l(θ; Z)] = inf sup l(θ; z ) λc(z, z) } } dp 0 (z) + λρ W c(p,p 0 ) ρ λ 0 z = inf λ 0 {E P 0 [l λ (θ; Z)] + λρ}.

58 Duality and robustness Theorem (D., Namkoong, Sinha) Let P 0 be any distribution on Z and c : Z Z R + be any function. Then { { sup E P [l(θ; Z)] = inf sup l(θ; z ) λc(z, z) } } dp 0 (z) + λρ W c(p,p 0 ) ρ λ 0 z = inf λ 0 {E P 0 [l λ (θ; Z)] + λρ}. Idea: Ignore that infimum, pick a large enough λ, and solve minimize θ E P0 [l λ (θ; Z)]

59 Stochastic gradient algorithm Repeat: minimize θ 1. Draw Z k iid P { E P0 [l λ (θ; Z)] = E P0 [sup l(θ; Z + ) λ }] Compute (approximate) maximizer 3. Update where α k is a stepsize Ẑ k argmax z { l(θ; z) λ 2 z Z k 2 2 θ k+1 := θ k α k θ l(θ k ; Ẑk) }

60 Stochastic gradient algorithm Repeat: minimize θ 1. Draw Z k iid P { E P0 [l λ (θ; Z)] = E P0 [sup l(θ; Z + ) λ }] Compute (approximate) maximizer 3. Update where α k is a stepsize Ẑ k argmax z { l(θ; z) λ 2 z Z k 2 2 θ k+1 := θ k α k θ l(θ k ; Ẑk) Theorem(ish): This converges with all the typical convergence properties }

61 A certificate of robustness A desiderata: We would like to certify that any learned θ has robustness properties

62 A certificate of robustness A desiderata: We would like to certify that any learned θ has robustness properties Theorem (D., Namkoong, Sinha 17) With high probability, for all θ Θ and uniformly in ρ, 1 n n i=1 { sup l(θ; Z i + ) λ } λρ sup {E P [l(θ; Z)]} O(1) P :W (P,P 0 ) ρ n

63 A certificate of robustness A desiderata: We would like to certify that any learned θ has robustness properties Theorem (D., Namkoong, Sinha 17) With high probability, for all θ Θ 1 n n i=1 { sup l(θ; Z i + ) λ } λŵ (θ) sup {E P [l(θ; Z)]} O(1) P :W (P,P 0 ) Ŵ n (θ) Empirical estimate: get an approximate divergence Ŵ (θ) := 1 2n n Ẑi(θ) Z i (θ) i=1 where Ẑi = argmax z {l(θ; z) λ 2 z Z i 2 2 } 2 2

64 Digging into neural networks Typically predict with f θ (x) = θ 1 σ relu (θ 2 σ relu ( )) where σ relu (t) = min{1, (t) + }

65 Digging into neural networks Typically predict with f θ (x) = θ 1 σ relu (θ 2 σ relu ( )) where σ relu (t) = min{1, (t) + } Replace σ relu with (t) 2 + 2ɛ if t ɛ σ smooth (t) = t + ɛ 2 if ɛ t 1 ɛ (1 t)2 + 2ɛ + 1 if t t ɛ

66 Simple Visualization y = sign( x 2 2)

67 Experimental results: adversarial classification MNIST dataset with 3 convolutional layers, fully connected softmax top layer

68 Experimental results: adversarial classification MNIST dataset with 3 convolutional layers, fully connected softmax top layer

69 Reading tea leaves

70 Reinforcement learning?

71 References H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems 29, 2016 H. Namkoong and J. C. Duchi. Variance regularization with convex objectives. In Advances in Neural Information Processing Systems 30, 2017 A. Sinha, H. Namkoong, and J. C. Duchi. Certifiable distributional robustness with principled adversarial training. arxiv: [stat.ml], 2017

Optimal Transport Methods in Operations Research and Statistics

Optimal Transport Methods in Operations Research and Statistics Jose Blanchet (based on work with F. He, Y. Kang, K. Murthy, F. Zhang). Stanford University (Management Science and Engineering), and Columbia