attention mechanisms and generative models

Size: px

Start display at page:

Download "attention mechanisms and generative models"

Anthony Sparks
6 years ago
Views:

1 attention mechanisms and generative models Master's Deep Learning Sergey Nikolenko Harbour Space University, Barcelona, Spain November 20, 2017

2 attention in neural networks

3 attention You re paying attention to me, right? What does that mean?.. Images from the retina come through our brain s CNNs, but then we kind of notice a part of it and disregard the other part. What does that mean? It proves to be a rather difficult question. A.R. Luria: attention, memory, and cortex activation. 3

4 attention Attention is about interacting with the working memory (Knudsen, 2007): 3

5 attention How do we implement it in a neural network? Especially «conscious» attention. But «unconscious» too; e.g., with image processing: we see very little at every specific time moment! The fovea (small high-definition part of the retina): 3

6 attention The eye moves in saccades: And this also not always helps: 3

7 foveal glimpses We also want to make a neural network consciously understand what to look at. One of the first works (Larochelle, Hinton, 2010): Attempt to model foveal fixations and construct a sequence of them with RBMs. A sequence means that... 4

8 recurrent visual attention First modern attention mechanism was put forward in (Mnih et al., 2014), «Recurrent Models of Visual Attention»: from the previous h t 1 and the position of l t for a new «glimpse» function f g makes g t, input for step t; from h t 1 and g t function f h gets h t ; from h t, the «action» a t = f a (h t ) and the position of the next «glimpse» l t+1 = f l (h t ). 5

9 recurrent visual attention The next work (Ba et al., 2015) extended this to a deep model: Trained by very traditional methods. 5

10 recurrent visual attention This is how it works in recognizing pairs of digits: Interestingly, addition is different: 5

11 and beyond This gave rise to «Show, Attend, and Tell» (Xu et al., 2015): 6

12 and beyond One-dimensional attention in NLP: training weights with which inputs participate in the current output (Olah, Carter, 2016): 6

13 and beyond In machine translation: Or, e.g., in speech recognition (Chan et al., 2015): 6

14 neural turing machines...but that is not all either! What if we add memory explicitly? Something like memory networks, but in an even more general form. How do we read and write from memory in such a way that gradients would be able to flow through? 7

15 neural turing machines With the same attention-like mechanism, really. Neural Turing machines (Graves et al., 2014): 7

16 neural turing machines And write in the same way: Where will all of these attention weights come from? 7

17 neural turing machines Content-based and location-based attention. 7

18 neural turing machines (Zaremba, Sutskever, 2016): train an NTM with reinforcement learning. 7

19 neural turing machines (Kaiser, Sutskever, 2016): Neural GPU sequential highly parallel computations based on convolutional GRUs; they get something like a two-dimensional cellular automaton able to train to do arithmetic. 7

20 neural turing machines (Neelakantan, Le, Sutskever, 2016): Neural Programmer an RNN controller trains to do a sequence of operations. 7

21 generative models

22 generative models We have already seen deep learning with supervision and without. But can we generate anything new? There are discriminative models that model p(y x); generative models that model p(x, y), and then you can sample from them. p(x, y) = p(y x)p(x), of course, but there is an important difference. We follow (Goodfellow, 2016). First, why would we need that? 9

generative models Generative models: check how well we have understood the distribution; can train with lack of data and with unlabeled data, through semi-supervised learning (very important); can

23 generative models Generative models: check how well we have understood the distribution; can train with lack of data and with unlabeled data, through semi-supervised learning (very important); can train multimodal outputs when there are several correct answers (see below); can serve as environment models in reinforcement learning (later); and sometimes we simply do really need to generate something. 9

24 generative models Usually generative models maximize the likelihood: θ = arg max p model (x; θ) = arg max log p model (x; θ). x D x D Important alternative view this is the same as minimizing the KL between «data distribution» p data and p model : θ = arg min KL (p data p model ) = arg min p data (x) ln p model(x) p data (x) dx, because p data is given through D, and this is basically a discrete uniform distribution. 9

25 generative models In other words, data points pull up the distribution p model : But it s harder in large dimension and with more complicated distributions... There is a big difference in whether we presume that we can explicitly define and compute the density p model. 9

26 generative models General taxonomy: 9

27 generative models If we can (explicit), we get something like FVBN (fully visible belief networks): p model (x) = n i=1 p model (x i x 1, x i 1 ). We assume that we can somehow manage one-dimensional distributions, e.g., by modeling them with neural networks. FVBNs appeared in late 1990s (Frey et al., 1996; Frey, 1998). 9

28 generative models A modern example of FVBN WaveNet (Oord et al., 2016). We have not said much about sound so far; how do we generate, say, human speech? Basic idea we will model the conditional distribution p(x h) = T t=1 p(x t x 1,, x t 1, h). On sound data we can do one-dimensional convolutions, but they should not look ahead in time... 9

29 generative models...so WaveNet uses causal convolutions: Dilated over time for higher layers: 9

30 generative models And then we get an architecture with familiar tricks: residual connections, skip-layer connections... Pretty good results: But in general one has to approximate complicated distributions (variational methods, for example). 9

31 generative models For implicit models we usually model the sampling process from the distribution rather than the density itself. We can sample with a Markov chain; if we model it with a neural network we get a Generative Stochastic Network (Alain et al., 2015): But we will consider a different approach... 9

32 thank you! Thank you for your attention! 10

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the