ADANET: adaptive learning of neural networks

Size: px

Start display at page:

Download "ADANET: adaptive learning of neural networks"

Sheena Warren
5 years ago
Views:

1 ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Motivation Modeling challenges for neural networks: proper specification of the architecture. non-convex optimization difficulties. lack of sufficient theory. page 2

3 Questions Can neural networks architecture be learned together with their weights? Can this be done efficiently and in a principled way? page 3

4 Previous Work Theoretical understanding of NNs: properties of objective function (Choromanska et al., 2014; Sagun et al., 2014; Zhang et al., 2015; Livni et al., 2014; Kawaguchi, 2016). study of black-box optimization algorithms used (Hardt et al., 2015; Lian et al., 2015). statistical and generalization properties (Bartlett, 1998; Zhang et al., 2016; Neyshabur et al., 2015; Sun et al., 2016). generative point of view (Arora et al., 2014; 2015). expressive power (Cohen et al., 2015; Eldan & Shamir, 2015; Telgarsky, 2016; Daniely et al., 2016). page 4

5 Previous Work Structure learning for NNs: grow-then-prune heuristics (Kwok & Yeung, 1997; Le-ung et al., 2003; Islam et al., 2003; Lehtokangas, 1999; Islam et al., 2009; Ma & Khorasani, 2003; Narasimha et al., 2008; Han & Qiao, 2013; Kotani et al., 1997; Alvarez & Salzmann, 2016). search-based approaches (Ha et al., 2016; Chen et al., 2015; Zoph & Le, 2016; Baker et al., 2016). Generalization and training of two-layer NNs using tensor methods (Janzamin et al., 2015). page 5

6 General Architecture Special cases: multi-layer neworks. more exotic architectures (He et al., 2015; Huang et al., 2016). page 6

7 Layer Hypothesis Sets First layer units: H 1 = nx 7! u (x): u 2 R n 0, kuk p apple 1,0 o. Layer k>1units: H k = x 7! kx 1 s=1 u s (' s h s )(x): u s 2 R n s, ku s k p apple k,s, h s 2 H n s s. page 7

8 Network Hypothesis Set For a network with F = ( lx k=1 l>1 layers, w k h k : h k 2 H n k k, lx k=1 kw k k 1 =1 Define eh k = H k [ ( H k ) and H = S l k=1 H e k, then, F = conv(h). ). page 8

9 This Talk Theory. Algorithm. Model selection. Experiments. page 9

10 Theory Deep Boosting - page

11 Ensembles - Margin Bound (Koltchinskii and Panchenko, 2002) Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f 2 conv(h) : s R(f) apple R b S, (f)+ 2 R log 1 m(h)+ 2m, where br S, (f) = 1 m mx 1 yi f(x i )apple. i=1 page 11

12 Questions Can we use a deep and complex family H? generalization bound indicates risk of overfitting. but a rich family H may be needed for difficult tasks. page 12

13 Ensemble Family Ensembles functions in F = conv([ p k=1h e k ): f = lx k=1 Xn k j=1 w k,j h k,j eh 2 eh 4 eh 1 eh 3 eh 5 page 13

14 Ideas Use hypotheses drawn from eh k s with larger k s but allocate more weight to hypotheses drawn from smaller ks. how can we determine quantitatively the amounts of mixture weights apportioned to different families? can we provide learning guarantees guiding these choices? page 14

15 Learning Guarantee (Cortes, MM, and Syed, 2014) Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f = P l k=1 w k h k 2 F: R(f) apple R b S, (f)+ 4 lx r w k 1 R m ( H e k )+ O e 1 log l. m k=1 page 15

16 Consequences Complexity term with explicit dependency on mixture weights. quantitative guide for controlling weights assigned to more complex sub-families. bound can be used to directly define an ensemble algorithm. page 16

17 Algorithms Deep Boosting - page

18 Algorithm Objective function: for any w 2 R N, F (w) = 1 m mx i=1 X 1 N y i w j h j + j=1 NX j=1 j w j, x 7! ( x) non-increasing convex function upper bounding x 7! 1 xapple0 ; e.g., (x) =e x, (x) = log(1 + e x ). j = r j +, where r j = R m ( H e,. kj ), 0 page 18

19 ADANET Block coordinate descent applied to convex objective: at each iteration, a base subnetwork is selected (direction). next, best step chosen by solving a convex optimization problem. Convergence guarantees based on weak-learning assumption: each network augmentation improves objective by a constant amount ( -optimality condition) (Raetsch et al., 2001; Luo & Tseng, 1992). page 19

20 Bound on Rad. Complexity Lemma: let k = Q k s=1 2 s,s 1 and N k = Q k s=1 n s 1. Then, for any k 1, q br S (Hk ) apple r 1 k N 1 q k log(2n 0 ) 2m, where 1 p + 1 q =1. page 20

21 Description Fix B (no. of units per layer of subnetwork). Let l t 1 be the depth of network before iteration t. At iteration t: augment network with subnetwork of the same depth with connections to layer below (within subnetwork and network). augment network with subnetwork with depth l t 1 +1 with connections to layer(s) below (within subnetwork and network). l t 1 page 21

22 Incremental Construction 2-layer subnetwork extension 3-layer subnetwork extension page 22

23 Experiments Deep Boosting - page

24 Experiments - CIFAR-10 Label pair AdaNet LR NN NN-GP deer-truck ± ± ± ± 0.69 deer-horse ± ± ± ± 1.81 automobile-truck ± ± ± ± 1.38 cat-dog ± ± ± ± 0.97 dog-horse ± ± ± ± ,000 images, 10 classes. page 24

25 Parameters Same total no. of hyperparameter settings for ADANET and NNs. k s chosen to be fixed. =0, T = 30. B chosen in {100, 150, 250}. l in [1, 3], n in {100, 150, 512, 1024, 2048}. Parameters n and l for NNs, and B for ADANET. ReLu activation function for NNs and AdaNet. Trainining using SGD with batch size of 100, 10K iterations. Gaussian process bandits (NN-GP) (Snoek et al., 2012): instead of a fixed grid, searches in a given range. 10-fold cross-validation. page 25

26 Architecture - Comparison Label pair AdaNet NN NN-GP 1st layer 2nd layer deer-truck deer-horse automobile-truck cat-dog dog-horse page 26

27 Experiments - Criteo Algorithm Accuracy AdaNet NN Criteo Click Rate Prediction: training sample: 32,743,299 instances. test sample: 6,548,659 instances. SGD with mini-batch 512 and 100K iterations. same number of hyperparameter settings for both algorithms. page 27

28 Model Selection Deep Boosting - page

29 Model Selection Problem: how to select hypothesis set? H too complex, no gen. bound, overfitting. H too simple, gen. bound, but underfitting. balance between estimation and approx. errors. H H h h h Bayes page 29

30 Structural Risk Minimization SRM: solution: error H = k=1 f (Vapnik and Chervonenkis, 1974; Vapnik, 1995) H k with H 1 H 2 H k = argmin h H k,k 1 R S (h)+pen(k, m). training error + penalty penalty training error complexity page 30

31 Ideas: Voted Risk Minimization no selection of specific. H k instead, use all s:,,. hypothesis-dependent penalty: p k=1 Deep ensembles. H k h = p k=1 kh k h k H k kr m (H k ). (MM, 2016) page 31

32 Other Related Algorithms Structural Maxent models (Cortes, Kuznetsov, MM, and Syed, ICML 2015): feature functions chosen from a union of very complex families. page 32

33 Other Related Algorithms Deep Cascades (DeSalvo, MM, and Syed, ALT 2015): cascade of predictors with leaf predictors and node questions selected from very rich families. Node 1: q 1 (x) 1 µ 1 µ 1 Leaf 1: h 1 (x) 1 µ 2 µ 2 1 µ 3 µ 3 page 33

34 Conclusion ADANET: algorithm for adaptively learning artificial NNs. balances trade-off between model complexity and empirical error at each round. benefits from solid data-dependent theoretical guarantees. favorable large-scale experimental results compared to NNs learned via grid search. algorithm and theory extend to multi-class classification. general techniques applicable to other architectures such as CNNs and RNNs. page 34

Advanced Machine Learning

Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem: