TTIC 31230, Fundamentals of Deep Learning, Winter David McAllester. The Fundamental Equations of Deep Learning

Size: px

Start display at page:

Download "TTIC 31230, Fundamentals of Deep Learning, Winter David McAllester. The Fundamental Equations of Deep Learning"

Amice Hoover
5 years ago
Views:

1 TTIC 31230, Fundamentals of Deep Learning, Winter 2019 David McAllester The Fundamental Equations of Deep Learning 1

2 Early History 1943: McCullock and Pitts introduced the linear threshold neuron. 1962: Rosenblatt applies a Hebbian learning rule. Novikoff proved the perceptron convergence theorem. 1969: Minsky and Papert publish the book Perceptrons. The Perceptrons book greatly discourages work in artificial neural networks. Symbolic methods dominate AI research through the 1970s. 2

3 80s Renaissance 1980: Fukushima introduces the neocognitron (a form of CNN) 1984: Valiant defines PAC learnability and stimulates learning theory. Wins Turing Award in : Hinton and Sejnowski introduce the Boltzman machine 1986: Rummelhart, Hinton and Williams demonstrate empirical success with backpropagation (itself dating back to 1961). 3

4 90s and 00s: Research In the Shadows 1997: Schmidhuber et al. introduce LSTMs 1998: LeCunn introduces convolutional neural networks (CNNs) (LeNet). 2003: Bengio introduces neural language modeling. 4

5 Current Era 2012: Alexnet dominates the Imagenet computer vision challenge. Google speech recognition converts to deep learning. Both developments come out of Hinton s group in Toronto. 2013: Refinement of AlexNet continues to dramatically improve computer vision. 2014: Neural machine translation appears (Seq2Seq models). Variational auto-encoders (VAEs) appear. Graph networks for molecular property prediction appear. Dramatic improvement in computer vision and speech recognition continues. 5

6 Current Era 2015: Google converts to neural machine translation leading to dramatic improvements. ResNet appears. This makes yet another dramatic improvement in computer vision. Generative Adversarial Networks (GANs) appear. 2016: Alphago defeats Lee Sedol. 6

7 Current Era 2017: AlphaZero learns both go and chess at super-human levels in a mater of hours entirely form self-play and advances computer go far beyond human abilities. Unsupervised machine translation is demonstrated. Progressive GANs. 2018: Unsupervised pre-training significantly improves a broad range of NLP tasks including question answering (but dialogue remains unsolved). AlphaFold revolutionizes protein structure prediction. 7

8 What is a Deep Network? VGG, Zisserman, 2014 Davi Frossard 138 Million Parameters 8

9 What is a Deep Network? We assume some set X of possible inputs, some set Y of possible outputs, and a parameter vector R d. For R d and x X and y Y a deep network computes a probability P (y x). 9

10 The Fundamental Equation of Deep Learning We assume a population probability distribution Pop on pairs (x, y). = argmin E (x,y) Pop ln P (y x) This loss function L(x, y, ) = ln P (y x) is called cross entropy loss. 10

11 A Second Fundamental Equation Softmax: Converting Scores to Probabilities We start from a score function s (y x) R. P (y x) = 1 Z es (y x) ; Z = y e s (y x) = softmax y s (y x) 11

12 Note the Final Softmax Layer Davi Frossard 12

13 How Many Possibilities We have y Y where Y is some set of possibilities. Binary: Y = { 1, 1} Multiclass: Y = {y 1,... y k } k manageable. Structured: y is a structured object like a sentence. Here Y is unmanageable. 13

14 Binary Classification We have a population distribution over (x, y) with y { 1, 1}. We compute a single score s (x) where for s (x) 0 predict y = 1 for s (x) < 0 predict y = 1 14

15 Softmax for Binary Classification P (y x) = 1 Z eys(x) = = = e ys(x) e ys(x) + e ys(x) e 2ys(x) e m(y) m(y x) = 2ys(x) is the margin 15

16 Logistic Regression for Binary Classification ln ln = argmin = argmin E (x,y) Pop L(x, y, ) E (x,y) Pop ln P (y x) = argmin E (x,y) Pop ln ( 1 + e m(y x)) 0 for m(y x) >> 1 (1 + e m(y x)) ( 1 + e m(y x)) m(y x) for m(y x) >> 1 16

17 Log Loss vs. Hinge Loss (SVM loss) 17

18 Image Classification (Multiclass Classification) We have a population distribution over (x, y) with y {y 1,..., y k }. P (y x) = softmax y s (y x) = argmin = argmin E (x,y) Pop L(x, y, ) E (x,y) Pop ln P (y x) 18

19 Machine Translation (Structured Labeling) We have a population of translation pairs (x, y) with x Vx and y Vy where V x and V y are source and target vocabularies respectively. P (w t+1 x, w 1,..., w t ) = P (y x) = softmax s (w x, w 1,..., w t ) w V y <EOS> y t=0 P (y t+1 x, y 1,..., y t ) = argmin = argmin E (x,y) Pop L(x, y, ) E (x,y) Pop ln P (y x) 19

20 Fuundamental Equation: Unconditional Form = argmin E y Pop ln P (y)

21 Entropy of a Distribution The entropy of a distribution P is defined by H(P ) = E y Pop ln P (y) in units of nats H 2 (P ) = E y Pop log 2 P (y) in units of bits Example: Let Q be a uniform distribution on 256 values. E y Q log 2 Q(y) = log = log = 8 bits = 1 byte 1 nat = 1 ln 2 bits 1.44 bits 21

22 The Coding Interpretation of Entropy We can interpret H 2 (Q) as the number of bits required an average to represent items drawn from distribution Q. We want to use fewer bits for common items. There exists a representation where, for all y, the number of bits used to represent y is no larger than log 2 y + 1 (Shannon s source coding theorem). H(Q) = 1 ln 2 H 2(Q) 1.44 H 2 (Q) 22

23 Cross Entropy Let P and Q be two distribution on the same set. H(P, Q) = E y P ln Q(y) = argmin H(Pop, P ) H(P,Q) also has a data compression interpretation. H(P, Q) can be interpreted as 1.44 times the number of bits used to code draws from P when using the imperfect code defined by Q. 23

24 Entropy, Cross Entropy and KL Divergence Let P and Q be two distribution on the same set. Entropy : H(P ) = E y P ln P (y) CrossEntropy : H(P, Q) = E y P ln Q(y) KL Divergence : KL(P, Q) = H(P, Q) H(P ) = E y P ln P (y) Q(y) We have H(P, Q) H(P ) or equivalently KL(P, Q) 0. 24

25 The Universality Assumption = argmin H(Pop, P ) = argmin H(Pop) + KL(Pop, P ) Universality assumption: P can represent any distribution and can be fully optimized. This is clearly false for deep networks. But it gives important insights like: P = Pop This is the motivatation for the fundamental equation. 25

26 Asymmetry of Cross Entropy Consider = argmin H(P, Q ) (1) = argmin H(Q, P ) (2) For (1) Q must cover all of the support of P. For (2) Q concentrates all mass on the point maximizing P. 26

27 Consider Asymmetry of KL Divergence = argmin = argmin = argmin KL(P, Q ) H(P, Q ) (1) KL(Q, P ) = argmin H(Q, P ) H(Q ) (2) If Q is not universally expressive we have that (1) still forces Q to cover all of P (or else the KL divergence is infinite) while (2) allows Q to be restricted to a single mode of P (a common outcome).

28 Proving KL(P, Q) 0: Jensen s Inequality For f convex (upward curving) we have E[f(x)] f(e[x]) 28

29 Proving KL(P, Q) 0 KL(P, Q) = E y P log Q(y) P (y) log E y P Q(y) P (y) = log y P (y) Q(y) P (y) = log y Q(y) = 0 29

30 Summary = argmin H(Pop, P ) unconditional = argmin E x Pop H(Pop(y x), P (y x)) conditional Entropy : H(P ) = E y P ln P (y) CrossEntropy : H(P, Q) = E y P ln Q(y) KL Divergence : KL(P, Q) = H(P, Q) H(P ) = E y P ln P (y) Q(y) H(P, Q) H(P ), KL(P, Q) 0, argmin Q H(P, Q) = P

31 Appendix: The Rearrangement Trick H(P, Q) = H(P ) + KL(P, Q) KL(P, Q) = H(P, Q) H(P ) The rearrangement trick applies to any expression of the form E x P ln i A i = E x P ln A i i

32 END

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Information Theory and Distribution Modeling Why do we model distributions and conditional distributions using the following objective