Do you like to be successful? Able to see the big picture

Size: px

Start display at page:

Download "Do you like to be successful? Able to see the big picture"

Sabina Barker
5 years ago
Views:

1 Do you like to be successful? Able to see the big picture 1

2 Are you able to recognise a scientific GEM 2

3 How to recognise good work? suggestions please item#1 1st of its kind item#2 solve problem item#3 learning process effort put in value added 3

4 All these works resulted in Nobel Prizes Description Enrico Fermi's seminal paper on weak interaction, 1933 Reason for rejection It contained speculations too remote from reality to be of interest to the reader Hans Krebs' paper on the citric acid cycle, AKA the Krebs cycle, 1937 Nature publishing had a huge backlog Murray Gell-Mann's work on classifying the elementary particles, 1953 The invention of the radioimmunoassay, 1955 The discovery of quasicrystals, 1984 Paper outlining nuclear magnetic resonance (NMR) spectroscopy, 1966 He did not use a name editor like for his discovery reviewers were skeptical that humans could make antibodies small enough to bind to things like insulin It was rejected on the grounds that it will not interest physicists rejected twice by the J. Chem. Phys. to be finally accepted and published in the Review of Scientific Instruments 4

5 Is your idea so radical that your professor thinks you are out of your mind? Do you discover nature? Are you trying to break conventional wisdom? If you are doing deep learning for healthcare, are you listening to good doctors? engineers building a tennis racket must ask the tennis players 5

6 Generative Adversarial Networks This (GAN),... is the most interesting idea in the last 10 years in ML, in my opinion. Yann LeCun 6

7 Generative 7

8 Generative sampled noise input image generated Generator Network 8

9 Adversarial 9

10 When the generator network is not trained well Generator Network get nonsense instead of a nice picture 10

11 Training the generator via an adversarial the discriminator network Discriminator Network Generator Network 11

12 Training the generator via an adversarial the discriminator network Discriminator Network Generator Network 12

13 The original GAN diagram real data x x = g(z) Discriminator Network prediction y noise z Generator Network 13

14 Classical Supervised Learning The passive learner x+ x - SVM RandForest CNN MLP etc etc Training... 14

15 Classical Supervised Learning The passive learner It eats what you gave to it No data = dead learner Bad data = bad learner Has totally no control over data Retrain when data change a little bit Is this an intelligent machine? 15

16 Humans The active learner Select to pay attention or not pay attention Creative and perform thought experiments Generalises well, just little bit of data is ok We perform experiments to generate data Learn by analogy :: 举反三 16

17 Passive Learner Humans Active Learner where is GAN? X X? Select to pay attention or not pay attention Creative and perform thought experiments Generalises well, just little bit of data is ok Performs experiments to generate data 17

18 Generative models and their problems 18

19 All generative models boils down to sampling data from a distribution (x 1,x 2 ) a 1 N(µ 1, 1)+a 2 N(µ 2, 2) 19

20 Sampling black / white pixels (x 1,x 2, x N ) exp( E(x 1, x N )/T ) Z Z = 2 N X x i = ±1 x 1,x 2, x N exp( E(x 1, x N )/T ) E(x 1, x N )= X number of neighboring x i with the same sign x1 x6 x11 x16 x21 x26 x2 x7 x12 x17 x22 x27 x3 x8 x13 x18 x23 x28 x4 x9 x14 x19 x24 x29 x5 x10 x15 x20 x25 x30 20

21 Sampling black / white pixels (x 1,x 2, x N ) exp(e(x 1, x N )/T ) x i = ±1 E(x 1, x N )= X number of neighboring x i with the same sign Z Z = 2 N X x 1,x 2, x N exp(e(x 1, x N )/T ) 21

22 How to sample is the key problem in generative models This one is easy (x 1,x 2 ) a 1 N(µ 1, 1)+a 2 N(µ 2, 2) This one is harder but can do with advanced methods (x 1,x 2, x N ) exp( E(x 1, x N )/T ) Z In general if we can build a distribution, theoretical or empirical distribution will do, we can find a way to sample 22

23 The difficulty is building a distribution We don t have the slightest idea how the distribution looks like How can a distribution be build in a high dimension space where data is very sparse? e.g. images with one mega pixels of greyscale [0,255], is embedded in a volume 255^{1,000,000} 23

24 Image data are certainly always very very sparse, it is believed that they lie in some very complex manifold. Many real world data too occupy infinitesimal volume of feature space and form very complex manifold. 24

25 Given a data set x i Distribution can be constructed by finding a function f(x) that is large `near data points and small elsewhere by minimising a cost function over f(x) C = Z x near x i f(x 0 )dx 0 + Z x everywhere else f(x 0 )dx 0 25

26 C = Z x near x i f(x 0 )dx 0 + Z x everywhere else f(x 0 )dx 0 f(x) f(x1,x x x 26 x

27 Given a data set x i Distribution can be constructed by finding a function f(x) that is large `near data points and small elsewhere by minimising a cost function over f(x) C = Z x near x i f(x 0 )dx 0 + Z x everywhere else f(x 0 )dx 0 Two difficulties here: Defining `near - regularisation issue High dimensional integrals, 1,000,000 dimension for image data 27

28 Neural networks offer a pseudo-solution 28

29 Lets start by reviewing the classical supervise learning on two classes x+ Classifier high value e.g

30 Lets start by reviewing the classical supervise learning on two classes x - Classifier low value e.g

31 x i Classifier w f w (x i ) Minimize a cost function with respect to C(w) = 1 N + N + X i f w (x i )+ 1 N w NX j f w (x j ) An example C(w) = 1 N + X i log(d w (x i )) + 1 N X j 1 log(1 D w (x j )) A D w (x i ) 2 (0, 1) 31

32 f(x1,x2) x1 x2 32

33 f(x1,x2) x2 x1 x2 x1 33

34 Idea of GAN (Generative Adversarial Networks) Given a set of example real data Generate fake data that looks like read data Generator is a neural network parametrised by Sample a set of random noise z i i =1, N Generate set of fake data from sampled noise g (z i ) i =1, N 34

35 Generator Network z 1 z 2... g (z 1 ) g (z 2 )... z N g (z N ) If we adjust well, g (z i ) will look like this 35

36 Training Generator network using Discriminator network Discriminator network needs positive and negative training examples x i Discriminator w f w (x i ) C(w) = 1 N + N + X i f w (x i )+ 1 N NX j f w (x j ) put real data examples here put fake generated data here 36

37 First train the Discriminator network C(w, ) = 1 N + N + X i f w (x i )+ 1 N NX j f w (g (z j )) put real data examples here put fake generated data here w = arg min w C(w, ) keeping constant Discriminator network is trained to grade generated data as negative It criticised any imperfections of the generator 37

38 Discriminator : x -> 10 -> 10 -> 10 -> out 38

39 Next train the Generator network C(w, ) = 1 N + N + X i f w (x i )+ 1 N NX j f w (g (z j )) put real data examples here put fake generated data here w = arg min w C(w, ) keeping constant = arg max C(w, ) keeping w constant 39

40 Generator : x -> 5 -> 5 -> 5 -> out 40

41 C(w, ) = Cost function use in original GAN paper 1 N + X i log(d w (x i )) + 1 N X j 1 log(1 D w (g (z j ))) A w = arg min w C(w, ) keeping constant = arg max C(w, ) keeping w constant w 1.Randomly initialise and 2.Use g (z i ) to generate fake data 3.Keep w constant and optimise for some n iterations 4.Keep constant and optimise for some k iterations w 41

42 Discriminator : x -> 10 -> 10 -> 10 -> out Generator : x -> 5 -> 5 -> 5 -> out 42

43 Discriminator : x -> 10 -> 10 -> 10 -> out Generator : x -> 5 -> 5 -> 5 -> out 43

44 Shortcomings of GAN 44

45 When two adversarial of equal strength competes both can improve 45

46 Discriminator Generator 46

47 Discriminator fail Discriminator : x -> 10 -> 10 -> 10 -> out Generator : x -> 5 -> 5 -> 5 -> out 47

48 Generator fail due to zero gradient 48

49 Generator fail due to zero gradient (surrounded) 49

50 Mode collapse 50

51 Improved Techniques for Training GANs T. Salimans, I Goodfellow et al 51

52 Generator Network 1. Feature matching for training the generator Discriminator Network ~f(x) Cost function for generator C g (w, ) = 1 N X ~f w (x i ) i 1 M X j ~f w (g (z j )) 2 2 = arg min C g (w, ) 52

53 Discriminator cost function use in original GAN paper C(w, ) = 1 N + X i log(d w (x i )) + 1 N X j 1 log(1 D w (g (z j ))) A = arg max C(w, ) keeping w constant 53

54 2. Minibatch discrimination for training the discriminator ~f(x i ) T ~o(x i ) Read paper for details o contains information on how this data point cluster with other data points in the minibatch Discriminator Network ~f(x) Concatenate features to create a longer feature ~f(xi ), ~o(x i ) Put this new feature back into subsequent layers of discriminator 54

55 3. Historical averaging Add a term in cost function 1 t X i i 2 Effect, at every step, theta does not change too much in magnitude and direction 55

56 4. One-sided label smoothing Changes in discriminator training Instead of asking discriminator to output 1 for real data, ask discriminator to output a softer value like 0.9 for real data. Fake data still has label 0 Encourage a softer decision boundary, e.g. don t need to saturate out the sigmoid and make gradient zero 56

57 5. Virtual batch normalisation Instead of normalising differently for each minibatch, use a reference batch so that normalisation is consistent across minbatches. What I would do is to use one statistics for normalisation across all data, e.g. compute z-score using one mean and one variance What paper proposed is to recalculate mean and variance for each data point, which will increase computational time a lot 57

58 Wasserstein GAN 58

59 Earth Mover s Distance cost =

60 Earth Mover s Distance A B C D E cost = 14 A(2) B(2) C(3) D(3) E(4)

61 Earth Mover s Distance A B C D E cost = 12 A(2) C(1) E(1) B(3) D(5)

62 Earth Mover s Distance A B C D E F G H I B(3) D(3) C(5) cost = 22 G(0) H(0) F(3) E(2) I(0) A(6)

63 Distribution of real and fake data P r (x) is the distribution of real data, it is constant P (g (z)) is the distribution of generated data it changes when generator change W (P r (x),p (g (z))) is the Earth Mover s Distance between real and generated data 63

64 Idea of Wasserstein GAN No discriminator (in the paper they use the word critic ) Given P r (x) Initialize g (z i ) and generate P (g (z)) Adjust generator such that W (P r (x),p (g (z))) is minimised 64

65 Kantorovich-Rubinstein duality I did not prove it 0 1 W (P r,p )= sup 1 N X i f(x i ) 1 M X j f(g(z j )) A over 1-Lipschitz functions (loosely speaking gradient does not exceed one) Sample from P r (x) Sample from P (g (z)) 65

66 Kantorovich-Rubinstein duality I did not prove it 0 1 W (P r,p )= sup 1 N X i f(x i ) 1 M X j f(g(z j )) A Sample from P r (x) Sample from P (g (z)) This is a constrained optimisation problem! 66

67 Representation of 1-Lipschitz function One way to parameterise a function is to use a neural network. By adjusting weights, we create different functions. When network is small, class of function obtained is limited. When network is deep, we can represent more functions. 67

68 W (P r,p )= sup kfk<1 1 N X i f(x i ) 1 M X j 1 f(g(z j )) A Represent f( ) by a neural network f w ( ) Use gradient ascend to find W (P r (x),p (g (z))) 68

69 Algorithm Initialize w and Generates P (g (z)) (1) Maximize by adjusting w 0 W (P r,p )= 1 X f w (x i ) kfk<1 N i 1 M X j 1 f w (g(z j )) A (2) Minimize Earth Mover s distance arg min W (P r,p ) (3) Repest (1) & (2) 69

71 Assignment #3: There is something strange about this algorithm. What is it?

72 Deep Convolutional GAN A. Radford, L. Metz, S. Chintala Unsupervised Representation Learning with Deep Convolutional GAN, ICLR

73 one epoch 73

74 5 epoch 74

75 one epoch 5 epoch 75

76 Generator 76

77 77

78 Every 10 iterations Final generated digits 78

79 Every 10 iterations Final generated digits 79

80 Every 10 iterations Final generated digits 80

81 Every 10 iterations Final generated digits 81

82 BiGAN Bidirectional GAN 82

83 x Discriminator Network y z Generator Network x = g(z) 83

84 Give z, generate x noise z Generator Network x = g(z) how about give x, generate z z = E(x) x Encoder Network 84

85 x Discriminator Network y z Generator Network x = g(z) 85

86 z = E(x) Encoder Network x Discriminator Network y z Generator Network x = g(z) 86

87 z = E(x) Encoder Network x x,z x,z Discriminator Network y z Generator Network x = g(z) 87

88 C(w,, )= 1 N + X i Cost function log{d w (x i,e (x i ))} + 1 N X j 1 log{1 D w (g (z j ),z j } A max, min w C(w,, ) Adjust generator and encoder such that all fake data looks like real ones g (z j )=x 0! x E (x i )=z 0! z By this procedure, g(e (x i ))! x i E g 1 88

89 x g(e(x)) x g(e(x)) 89

90 x g(e(x)) x g(e(x)) x g(e(x)) x g(e(x)) 90

91 Assignment #3 we will announce this weekend 91

92 92

Lecture 14: Deep Generative Learning

Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han