Do you like to be successful? Able to see the big picture

Do you like to be successful? Able to see the big picture 1

Are you able to recognise a scientific GEM 2

How to recognise good work? suggestions please item#1 1st of its kind item#2 solve problem item#3 learning process effort put in value added 3

All these works resulted in Nobel Prizes Description Enrico Fermi's seminal paper on weak interaction, 1933 Reason for rejection It contained speculations too remote from reality to be of interest to the reader Hans Krebs' paper on the citric acid cycle, AKA the Krebs cycle, 1937 Nature publishing had a huge backlog Murray Gell-Mann's work on classifying the elementary particles, 1953 The invention of the radioimmunoassay, 1955 The discovery of quasicrystals, 1984 Paper outlining nuclear magnetic resonance (NMR) spectroscopy, 1966 He did not use a name editor like for his discovery reviewers were skeptical that humans could make antibodies small enough to bind to things like insulin It was rejected on the grounds that it will not interest physicists rejected twice by the J. Chem. Phys. to be finally accepted and published in the Review of Scientific Instruments 4

Is your idea so radical that your professor thinks you are out of your mind? Do you discover nature? Are you trying to break conventional wisdom? If you are doing deep learning for healthcare, are you listening to good doctors? engineers building a tennis racket must ask the tennis players 5

Generative Adversarial Networks This (GAN),... is the most interesting idea in the last 10 years in ML, in my opinion. Yann LeCun 6

Generative 7

Generative sampled noise input image generated Generator Network https://www.linkedin.com/pulse/unsupervised-representation-learning-deep-generative-diego 8

Adversarial 9

When the generator network is not trained well Generator Network get nonsense instead of a nice picture 10

Training the generator via an adversarial the discriminator network Discriminator Network Generator Network 11

Training the generator via an adversarial the discriminator network Discriminator Network Generator Network 12

The original GAN diagram real data x x = g(z) Discriminator Network prediction y noise z Generator Network 13

Classical Supervised Learning The passive learner x+ x - SVM RandForest CNN MLP etc etc Training... 14

Classical Supervised Learning The passive learner It eats what you gave to it No data = dead learner Bad data = bad learner Has totally no control over data Retrain when data change a little bit Is this an intelligent machine? 15

Humans The active learner Select to pay attention or not pay attention Creative and perform thought experiments Generalises well, just little bit of data is ok We perform experiments to generate data Learn by analogy :: 举反三 16

Passive Learner Humans Active Learner where is GAN? X X? Select to pay attention or not pay attention Creative and perform thought experiments Generalises well, just little bit of data is ok Performs experiments to generate data 17

Generative models and their problems 18

All generative models boils down to sampling data from a distribution (x 1,x 2 ) a 1 N(µ 1, 1)+a 2 N(µ 2, 2) 19

Sampling black / white pixels (x 1,x 2, x N ) exp( E(x 1, x N )/T ) Z Z = 2 N X x i = ±1 x 1,x 2, x N exp( E(x 1, x N )/T ) E(x 1, x N )= X number of neighboring x i with the same sign x1 x6 x11 x16 x21 x26 x2 x7 x12 x17 x22 x27 x3 x8 x13 x18 x23 x28 x4 x9 x14 x19 x24 x29 x5 x10 x15 x20 x25 x30 20

Sampling black / white pixels (x 1,x 2, x N ) exp(e(x 1, x N )/T ) x i = ±1 E(x 1, x N )= X number of neighboring x i with the same sign Z Z = 2 N X x 1,x 2, x N exp(e(x 1, x N )/T ) 21

How to sample is the key problem in generative models This one is easy (x 1,x 2 ) a 1 N(µ 1, 1)+a 2 N(µ 2, 2) This one is harder but can do with advanced methods (x 1,x 2, x N ) exp( E(x 1, x N )/T ) Z In general if we can build a distribution, theoretical or empirical distribution will do, we can find a way to sample 22

The difficulty is building a distribution We don t have the slightest idea how the distribution looks like How can a distribution be build in a high dimension space where data is very sparse? e.g. images with one mega pixels of greyscale [0,255], is embedded in a volume 255^{1,000,000} 23

Image data are certainly always very very sparse, it is believed that they lie in some very complex manifold. Many real world data too occupy infinitesimal volume of feature space and form very complex manifold. https://www.researchgate.net/figure/286513346_fig1_figure-10-the-swiss-roll-dataset-a-the-euclidean-distance-bluedashed-line-of-two 24

C = Z x near x i f(x 0 )dx 0 + Z x everywhere else f(x 0 )dx 0 f(x) f(x1,x x x 26 x

Given a data set x i Distribution can be constructed by finding a function f(x) that is large `near data points and small elsewhere by minimising a cost function over f(x) C = Z x near x i f(x 0 )dx 0 + Z x everywhere else f(x 0 )dx 0 Two difficulties here: Defining `near - regularisation issue High dimensional integrals, 1,000,000 dimension for image data 27

Neural networks offer a pseudo-solution 28

Lets start by reviewing the classical supervise learning on two classes x+ Classifier high value e.g. 1.0 29

Lets start by reviewing the classical supervise learning on two classes x - Classifier low value e.g. 0.0 30

x i Classifier w f w (x i ) Minimize a cost function with respect to C(w) = 1 N + N + X i f w (x i )+ 1 N w NX j f w (x j ) An example C(w) = 0 @ 1 N + X i log(d w (x i )) + 1 N X j 1 log(1 D w (x j )) A D w (x i ) 2 (0, 1) 31

f(x1,x2) x1 x2 32

f(x1,x2) x2 x1 x2 x1 33

Idea of GAN (Generative Adversarial Networks) Given a set of example real data Generate fake data that looks like read data Generator is a neural network parametrised by Sample a set of random noise z i i =1, N Generate set of fake data from sampled noise g (z i ) i =1, N 34

Generator Network z 1 z 2... g (z 1 ) g (z 2 )... z N g (z N ) If we adjust well, g (z i ) will look like this 35

Training Generator network using Discriminator network Discriminator network needs positive and negative training examples x i Discriminator w f w (x i ) C(w) = 1 N + N + X i f w (x i )+ 1 N NX j f w (x j ) put real data examples here put fake generated data here 36

First train the Discriminator network C(w, ) = 1 N + N + X i f w (x i )+ 1 N NX j f w (g (z j )) put real data examples here put fake generated data here w = arg min w C(w, ) keeping constant Discriminator network is trained to grade generated data as negative It criticised any imperfections of the generator 37

Discriminator : x -> 10 -> 10 -> 10 -> out 38

Next train the Generator network C(w, ) = 1 N + N + X i f w (x i )+ 1 N NX j f w (g (z j )) put real data examples here put fake generated data here w = arg min w C(w, ) keeping constant = arg max C(w, ) keeping w constant 39

Generator : x -> 5 -> 5 -> 5 -> out 40

C(w, ) = Cost function use in original GAN paper 0 @ 1 N + X i log(d w (x i )) + 1 N X j 1 log(1 D w (g (z j ))) A w = arg min w C(w, ) keeping constant = arg max C(w, ) keeping w constant w 1.Randomly initialise and 2.Use g (z i ) to generate fake data 3.Keep w constant and optimise for some n iterations 4.Keep constant and optimise for some k iterations w 41

Discriminator : x -> 10 -> 10 -> 10 -> out Generator : x -> 5 -> 5 -> 5 -> out 42

Discriminator : x -> 10 -> 10 -> 10 -> out Generator : x -> 5 -> 5 -> 5 -> out 43

Shortcomings of GAN 44

When two adversarial of equal strength competes both can improve 45

Discriminator Generator 46

Discriminator fail Discriminator : x -> 10 -> 10 -> 10 -> out Generator : x -> 5 -> 5 -> 5 -> out 47

Generator fail due to zero gradient 48

Generator fail due to zero gradient (surrounded) 49

Mode collapse 50

Improved Techniques for Training GANs T. Salimans, I Goodfellow et al 51

Generator Network 1. Feature matching for training the generator Discriminator Network ~f(x) Cost function for generator C g (w, ) = 1 N X ~f w (x i ) i 1 M X j ~f w (g (z j )) 2 2 = arg min C g (w, ) 52

Discriminator cost function use in original GAN paper C(w, ) = 0 @ 1 N + X i log(d w (x i )) + 1 N X j 1 log(1 D w (g (z j ))) A = arg max C(w, ) keeping w constant 53

2. Minibatch discrimination for training the discriminator ~f(x i ) T ~o(x i ) Read paper for details o contains information on how this data point cluster with other data points in the minibatch Discriminator Network ~f(x) Concatenate features to create a longer feature ~f(xi ), ~o(x i ) Put this new feature back into subsequent layers of discriminator 54

3. Historical averaging Add a term in cost function 1 t X i i 2 Effect, at every step, theta does not change too much in magnitude and direction 55

4. One-sided label smoothing Changes in discriminator training Instead of asking discriminator to output 1 for real data, ask discriminator to output a softer value like 0.9 for real data. Fake data still has label 0 Encourage a softer decision boundary, e.g. don t need to saturate out the sigmoid and make gradient zero 56

5. Virtual batch normalisation Instead of normalising differently for each minibatch, use a reference batch so that normalisation is consistent across minbatches. What I would do is to use one statistics for normalisation across all data, e.g. compute z-score using one mean and one variance What paper proposed is to recalculate mean and variance for each data point, which will increase computational time a lot 57

Wasserstein GAN 58

Earth Mover s Distance 1 2 3 4 5 6 7 8 9 cost = 3 1 2 3 4 5 6 7 8 9 59

Earth Mover s Distance A B C D E 1 2 3 4 5 6 7 8 9 cost = 14 A(2) B(2) C(3) D(3) E(4) 1 2 3 4 5 6 7 8 9 60

Earth Mover s Distance A B C D E 1 2 3 4 5 6 7 8 9 cost = 12 A(2) C(1) E(1) B(3) D(5) 1 2 3 4 5 6 7 8 9 61

Earth Mover s Distance A B C D E F G H I 1 2 3 4 5 6 7 8 9 B(3) D(3) C(5) cost = 22 G(0) H(0) F(3) E(2) I(0) A(6) 1 2 3 4 5 6 7 8 9 62

Distribution of real and fake data P r (x) is the distribution of real data, it is constant P (g (z)) is the distribution of generated data it changes when generator change W (P r (x),p (g (z))) is the Earth Mover s Distance between real and generated data 63

Idea of Wasserstein GAN No discriminator (in the paper they use the word critic ) Given P r (x) Initialize g (z i ) and generate P (g (z)) Adjust generator such that W (P r (x),p (g (z))) is minimised 64

Kantorovich-Rubinstein duality I did not prove it 0 1 W (P r,p )= sup kfk<1 @ 1 N X i f(x i ) 1 M X j f(g(z j )) A over 1-Lipschitz functions (loosely speaking gradient does not exceed one) Sample from P r (x) Sample from P (g (z)) 65

Kantorovich-Rubinstein duality I did not prove it 0 1 W (P r,p )= sup kfk<1 @ 1 N X i f(x i ) 1 M X j f(g(z j )) A Sample from P r (x) Sample from P (g (z)) This is a constrained optimisation problem! 66

Representation of 1-Lipschitz function One way to parameterise a function is to use a neural network. By adjusting weights, we create different functions. When network is small, class of function obtained is limited. When network is deep, we can represent more functions. 67

W (P r,p )= sup kfk<1 0 @ 1 N X i f(x i ) 1 M X j 1 f(g(z j )) A Represent f( ) by a neural network f w ( ) Use gradient ascend to find W (P r (x),p (g (z))) 68

Algorithm Initialize w and Generates P (g (z)) (1) Maximize by adjusting w 0 W (P r,p )= sup @ 1 X f w (x i ) kfk<1 N i 1 M X j 1 f w (g(z j )) A (2) Minimize Earth Mover s distance arg min W (P r,p ) (3) Repest (1) & (2) 69

Assignment #3: There is something strange about this algorithm. What is it?

Deep Convolutional GAN A. Radford, L. Metz, S. Chintala Unsupervised Representation Learning with Deep Convolutional GAN, ICLR2016 72

one epoch 73

5 epoch 74

one epoch 5 epoch 75

Generator 76

Every 10 iterations Final generated digits 78

Every 10 iterations Final generated digits 79

Every 10 iterations Final generated digits 80

Every 10 iterations Final generated digits 81

BiGAN Bidirectional GAN 82

x Discriminator Network y z Generator Network x = g(z) 83

Give z, generate x noise z Generator Network x = g(z) how about give x, generate z z = E(x) x Encoder Network 84

x Discriminator Network y z Generator Network x = g(z) 85

z = E(x) Encoder Network x Discriminator Network y z Generator Network x = g(z) 86

z = E(x) Encoder Network x x,z x,z Discriminator Network y z Generator Network x = g(z) 87

C(w,, )= 0 @ 1 N + X i Cost function log{d w (x i,e (x i ))} + 1 N X j 1 log{1 D w (g (z j ),z j } A max, min w C(w,, ) Adjust generator and encoder such that all fake data looks like real ones g (z j )=x 0! x E (x i )=z 0! z By this procedure, g(e (x i ))! x i E g 1 88

x g(e(x)) x g(e(x)) 89

x g(e(x)) x g(e(x)) x g(e(x)) x g(e(x)) 90

Assignment #3 we will announce this weekend 91