Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager

Size: px

Start display at page:

Download "Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager"

Harry Harper
5 years ago
Views:

1 Feature Noising Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager

2 Outline Part 0: Some backgrounds Part 1: Dropout as adaptive regularization with applications to semi-supervised learning joint work with Stefan Wager and Percy Part 2: Applications to structured prediction using CRFs when the log-partition function cannot be easily computed joint work with Mengqiu, Chris, Percy and Stefan Wager

3 The basics of dropout training Introduced by Hinton et al. in Improving neural networks by preventing co-adaptation of feature detectors For each example, randomly select features zero them compute the gradient, make an update repeat

4 Empirically successful Dropout is important in some recent successes won the ImageNet challenge [Krizhevsky et al., 2012] won the Merck challenge [Dahl et al., 2012] Improved performance on standard datasets images: MNIST, CIFAR, ImageNet, etc. document classification: Reuters, IMDB, Rotten Tomatoes, etc. speech: TIMIT, GlobalPhone, etc.

5 Lots of related works already Variants DropConnect [Wan et al., 2013] Maxout networks [Goodfellow et al., 2013] Analytical integration Fast Dropout [Wang and Manning, 2013] Marginalized Corrupted Features [van der Maaten et al., 2013] Many other works report empirical gains

6 Outline Part 0: Some backgrounds Part 1: Dropout as adaptive regularization with applications to semi-supervised learning Part 2: Applications to structured prediction using CRFs when the log-partition function cannot be easily computed

7 Theoretical understanding? Dropout as adaptive regularization feature noising -> interpretable penalty term Loss( Dropout(data) ) = Loss(data)+Regularizer(data) Semi-supervised learning feature dependent, label independent regularizer: Regularizer(Unlabeled data)

8 Dropout for Log-linear Models Log likelihood (e.g., softmax classification): log p(y x; ) =x T y A(x T ) =[ 1, 2,..., K ]

9 Dropout for Log-linear Models Log likelihood (e.g., softmax classification): Dropout: log p(y x; ) =x T y A(x T ) =[ 1, 2,..., K ] ( 2x j with p=0.5 x j = 0 otherwise E[ x] =x Dropout objective: E[log p(y x; )] {z } Loss(Dropout(data)) = E[ x T y ] E[A( x T )] = Loss(data)+Regularizer(data)

10 Dropout for Log-linear Models Log likelihood (e.g., softmax classification): Dropout: Dropout objective: log p(y x; ) =x T y A(x T ) =[ 1, 2,..., K ] ( 2x j with p=0.5 x j = 0 otherwise E[log p(y x; )] {z } -Loss(Dropout(data)) Loss(data)+Regularizer(data) E[ x] =x = E[ x T y ] E[A( x T )]

11 Dropout for Log-linear Models We can rewrite the dropout log-likelihood E[log p(y x; )] = E[ x T y ] E[A( x T )] log p(y x; ) = x T y A(x T ) E[log p(y x; )] {z } -Loss(Dropout(data)) = log p(y x; ) {z } -Loss(data) Dropout reduces to a regularizer R(,x)=E[A( x T )] A(x T ) (E[A( x T )] A(x T ) ) {z } Regularizer(data)

12 Second-order delta method Take the Taylor expansion A(s) A(s 0 )+(s s 0 ) T A 0 (s 0 )+(s s 0 ) T A00 (s 0 ) (s s 0 ) 2

13 Second-order delta method Take the Taylor expansion A(s) A(s 0 )+(s s 0 ) T A 0 (s 0 )+(s s 0 ) T A00 (s 0 ) (s s 0 ) 2 Substitute s = s def = T x, s 0 = E[ s] Take expectations to get the quadratic approximation: R q (,x)= 1 2 E[( s s)t r 2 A(s)( s s)] = 1 2 tr(r2 A(s)Cov( s))

14 Example: logistic regression The quadratic approximation R q (,x)= 1 2 A00 (x T )Var[ x T ]

15 Example: logistic regression The quadratic approximation R q (,x)= 1 2 A00 (x T )Var[ x T ] A 00 (x T )=p(1 p) represents uncertainty: p = p(y x; ) =(1+exp( yx T )) 1

16 Example: logistic regression The quadratic approximation R q (,x)= 1 2 A00 (x T )Var[ x T ] A 00 (x T )=p(1 p) represents uncertainty: p = p(y x; ) =(1+exp( yx T )) 1 Var[ x T ]= X j 2 x 2 j is L 2 -regularization after j normalizing the data

17 The regularizers Dropout on Linear Regression R q ( ) = 1 X X j 2 2 Dropout on Logistic Regression R q ( ) = 1 X X j 2 p i (1 2 Multiclass, CRFs [Wang et al., 2013] j j i i x (i)2 j p i )x (i)2 j

18 Dropout intuition R q ( ) = 1 2 X j 2 j X p i (1 Regularizes rare features less, like AdaGrad: there is actually a more precise connection [Wager et al., 2013] Big weights are okay if they contribute only to confident predictions Normalizing by the diagonal Fisher information i p i )x (i)2 j

19 Semi-supervised Learning These regularizers are label-independent but can be data adaptive in interesting ways labeled dataset D = {x 1,x 2,...,x n } unlabeled data D unlabeled = {u 1,u 2,...,u n } We can better estimate the regularizer R (, D, D unlabeled ) def n X n = R(,x i )+ n + m i=1 for some tunable. mx i=1 R(,u i ).

20 Semi-supervised intuition R q ( ) = 1 2 X Like other semi-supervised methods: j 2 j X p i (1 transductive SVMs [Joachims, 1999] entropy regularization [Grandvalet and Bengio, 2005] EM: guess a label [Nigam et al., 2000] want to make confident predictions on the unlabeled data Get a better estimate of the Fisher information i p i )x (i)2 j

21 IMDB dataset [Maas et al., 2011] 25k examples of positive reviews 25k examples of negative reviews Half for training and half for testing 50k unlabeled reviews also containing neutral reviews 300k sparse unigram features ~5 million sparse bigram features

22 Experiments: semi-supervised Add more unlabeled data (10k labeled) improves performance accuracy dropout+unlabeled dropout L size of unlabeled data

23 Experiments: semi-supervised Add more labeled data (40k unlabeled) improves performance accuracy dropout+unlabeled dropout L size of labeled data

24 Quantitative results on IMDB Method \ Settings Supervised Semi-sup. MNB - unigrams with SFE [Su et al., 2011] Vectors for sentiment analysis [Maas et al., 2011] This work: dropout + unigrams This work: dropout + bigrams

25 Experiments: other datasets Dataset \ Settings L 2 Drop +Unlbl Subjectivity [Peng and Lee, 2004] Rotten Tomatoes [Peng and Lee, 2005] newsgroups CoNLL

26 Outline Part 0: Some backgrounds Part 1: Dropout as adaptive regularization with applications to semi-supervised learning With Stefan and Percy Part 2: Applications to structured prediction using CRFs when the log-partition function cannot be easily computed with Mengqiu, Chris, Percy and Stefan

27 Log-linear structured prediction A vector of scores s =(s 1,...,s Y ) s y = f(y,x) The likelihood is: p(y x; ) =exp{s y A(s)} A(s) = log X y exp{s y } Y might be really huge!

28 What about structured prediction? Recall that in logistic regression: R q ( ) = 1 X X j 2 p i (1 2 j What if we cannot easily compute the logpartition function A? and its second derivatives? i p i )x (i)2 j

29 The original setup Take the Taylor expansion A(s) A(s 0 )+(s s 0 ) T A 0 (s 0 )+(s s 0 ) T A00 (s 0 ) (s s 0 ) 2 Substitute s = s def = T x, s 0 = E[ s] Take expectations to get the quadratic approximation: R q (,x)= 1 2 E[( s s)t r 2 A(s)( s s)] = 1 2 tr(r2 A(s)Cov( s))

30 The structured prediction setup Take the Taylor expansion A(s) A(s 0 )+(s s 0 ) T A 0 (s 0 )+(s s 0 ) T A00 (s 0 ) (s s 0 ) 2 Substitute s = s def = f(y,x), s 0 = E[ s] Take expectations to get the quadratic approximation: R q (,x)= 1 2 E[( s s)t r 2 A(s)( s s)] = 1 2 tr(r2 A(s)Cov( s))

31 Use the independence structure Depends on the underlying graphical model We assume we can do exact inference via message passing (e.g. clique tree) E.g. Linear-chain CRF: f(y,x)= TX t=1 g t (y t 1,y t,x) ( T X ) A(s) = log X y2y exp t=1 s yt 1,y t,t

32 Local Noising Global noising: s = s def = f(y,x) Local noising: s = s def = TX g(y t 1,y t,x) t=1 Can try to justify in restrospect

33 The regularizer The regularizer is: X R q (,x)= 1 2 For marginals: And derivatives: ( ) X X µ a,b,t (1 µ a,b,t ) Var[ s a,b,t ] a,b,t µ a,b,t = p (y t 1 = a, y t = b x) al elements of the Hessian r 2 rµ a,b,t = E p (y x,y t 1 =a,y t =b)[f(y,x)] E p (y x)[f(y,x)] (

34 Efficient computation For every a,b,t we need rµ a,b,t = E p (y x,y t 1 =a,y t =b)[f(y,x)] E p (y x)[f(y,x)] Naïve computation is O(K 4 T 2 ) Can reduce to O(K 3 T 2 ) We provide a dynamic program to compute in O(KT 2 ), like normal forward backwards, except need to do this for every feature

35 Feature group trick (Mengqiu) rµ a,b,t = E p (y x,y t 1 =a,y t =b)[f(y,x)] E p (y x)[f(y,x)] Features that always appeared in the same location all have the same conditional expectations Gives a 4x speedup, applicable to general CRFs

36 CRF sequence tagging CoNLL 2003 Named Entity Recognition Stanford[ORG] is[o] near[o] Palo[LOC] Alto[LOC] Dataset \ Settings None L 2 Drop CoNLL 2003 Dev CoNLL 2003 Test

37 CRF sequence tagging Dropout helps more on precision than recall Tag LOC MISC ORG PER Overall ularization Precision Recall F = % 86.13% % 79.30% % 80.49% % 93.33% % 86.08% (e) CoNLL test set with L 2 regularization regularization Precision Recall F = % 87.74% % 77.34% % 81.89% % 92.68% % 86.45% (f) CoNLL test set with dropout regularization

38 SANCL POS Tagging Test set difference statistically significant for newsgroups and reviews F =1 None L 2 Drop newsgroups Dev Test reviews Dev Test answers Dev Test

39 Summary Part 0: Some backgrounds Part 1: Dropout as adaptive regularization with applications to semi-supervised learning joint work with Stefan Wager and Percy Part 2: Applications to structured prediction using CRFs when the log-partition function cannot be easily computed joint work with Mengqiu, Chris, Percy and Stefan Wager

40 References Our arxiv paper [Wager et al., 2013] has more details, including the relation to AdaGrad Our EMNLP paper [Wang et al., 2013] extends this framework to structured prediction Our ICML paper [Wang and Manning, 2013] applies a related technique to neural networks and provides some negative examples

41 Dropout vs. L 2 Can be much better than all settings of L 2 Part of the gain comes from normalization Accuracy L2 only 0.8 L2+Gaussian dropout L2+Quadratic dropout L 2 regularization strength (λ)

42 Example: linear least squares The loss function is f( x) =1/2( x y) 2 Let X = x where x j =2z j x j, z j = Bernoulli(0.5) E[f(X)] = f(e[x]) + f 00 (E[X]) 2 =1/2( x The total regularizer is R q ( ) = 1 X 2 This is just L2 applied after data normalization j y) 2 +1/2 X j 2 j X i x (i)2 j Var[X] x 2 j 2 j

43 Quantitative results on IMDB Method \ Settings Supervised Semi-sup. MNB - unigrams with SFE [Su et al., 2011] MNB bigrams Vectors for sentiment analysis [Maas et al., 2011] NBSVM bigrams [Wang and Manning, 2012] This work: dropout + unigrams This work: dropout + bigrams

Feature Noising for Log-linear Structured Prediction

Feature Noising for Log-linear Structured Prediction Sida I. Wang, Mengqiu Wang, Stefan Wager, Percy Liang, Christopher D. Manning Department of Computer Science, Department of Statistics Stanford University,