Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018

Size: px

Start display at page:

Download "Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018"

Valentine Daniels
5 years ago
Views:

1 Some theoretical properties of GANs Gérard Biau Toulouse, September 2018

2 Coauthors Benoît Cadre (ENS Rennes) Maxime Sangnier (Sorbonne University) Ugo Tanielian (Sorbonne University & Criteo) 1

3 video Source: Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018). Progressive growing of GANs for improved quality, stability, and variation, ICLR

4 Source: 3

5 video Source: 4

6 Original paper Generative Adversarial Nets Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio Département d informatique et de recherche opérationnelle Université de Montréal Montréal, QC H3C 3J7 Abstract We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1 2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples. 1 Introduction The promise of deep learning is to discover rich, hierarchical models [2] that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. So far, the most striking successes in deep learning have involved discriminative models, usually those that map a high-dimensional, rich sensory input to a class label [14, 20]. These striking successes have primarily been based on the backpropagation and dropout algorithms, using piecewise linear units [17, 8, 9] which have a particularly well-behaved gradient. Deep generative models have had less of an impact, due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context. We propose a new generative model estimation procedure that sidesteps these difficulties. 1 In the proposed adversarial nets framework, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistiguishable from the genuine articles. Ian Goodfellow is now a research scientist at Google, but did this work earlier as a UdeM student Jean Pouget-Abadie did this work while visiting Université de Montréal from Ecole Polytechnique. Sherjil Ozair is visiting Université de Montréal from Indian Institute of Technology Delhi Yoshua Bengio is a CIFAR Senior Fellow. 1 All code and hyperparameters available at 1 5

7 Source: 6

8 Outline 1. Mathematical context of GANs 2. Optimality properties 3. Approximation properties 4. Statistical analysis 7

9 Mathematical context of GANs

10 Generators Data: X 1,..., X n, i.i.d. according to some unknown density p on E R d. 8

11 Generators Data: X 1,..., X n, i.i.d. according to some unknown density p on E R d. Generators: a parametric family of functions from R d to E (d d). Notation: G = {G θ } θ Θ, Θ R p. 8

12 Generators Data: X 1,..., X n, i.i.d. according to some unknown density p on E R d. Generators: a parametric family of functions from R d to E (d d). Notation: G = {G θ } θ Θ, Θ R p. Principle: Ω Z R d G θ E. By definition, G θ (Z) L = p θ dµ. 8

13 Generators Data: X 1,..., X n, i.i.d. according to some unknown density p on E R d. Generators: a parametric family of functions from R d to E (d d). Notation: G = {G θ } θ Θ, Θ R p. Principle: Ω Z R d G θ E. By definition, G θ (Z) L = p θ dµ. Associated family of densities: P = {p θ } θ Θ. 8

14 Generators Data: X 1,..., X n, i.i.d. according to some unknown density p on E R d. Generators: a parametric family of functions from R d to E (d d). Notation: G = {G θ } θ Θ, Θ R p. Principle: Ω Z R d G θ E. By definition, G θ (Z) L = p θ dµ. Associated family of densities: P = {p θ } θ Θ. Each density p θ is a potential candidate to represent p. 8

15 A specific framework In GANs algorithms, G = {G θ } θ Θ is a neural network. 9

16 A specific framework In GANs algorithms, G = {G θ } θ Θ is a neural network. Bad ideas: Exhaustive description of p by a classical parametric model. 9

17 A specific framework In GANs algorithms, G = {G θ } θ Θ is a neural network. Bad ideas: Exhaustive description of p by a classical parametric model. Estimation by a traditional maximum likelihood approach. 9

18 A specific framework In GANs algorithms, G = {G θ } θ Θ is a neural network. Bad ideas: Exhaustive description of p by a classical parametric model. Estimation by a traditional maximum likelihood approach. Any strategy based on nonparametric density estimation techniques. 9

19 A specific framework In GANs algorithms, G = {G θ } θ Θ is a neural network. Bad ideas: Exhaustive description of p by a classical parametric model. Estimation by a traditional maximum likelihood approach. Any strategy based on nonparametric density estimation techniques. It is not assumed that p belongs to P. 9

20 A specific framework In GANs algorithms, G = {G θ } θ Θ is a neural network. Bad ideas: Exhaustive description of p by a classical parametric model. Estimation by a traditional maximum likelihood approach. Any strategy based on nonparametric density estimation techniques. It is not assumed that p belongs to P. Think parametric but forget classical rules. 9

21 Discriminators Discriminators: a family of functions from E to [0, 1]. Notation: D. 10

22 Discriminators Discriminators: a family of functions from E to [0, 1]. Notation: D. Often (but not always): D = {D α } α Λ, Λ R q. 10

23 Discriminators Discriminators: a family of functions from E to [0, 1]. Notation: D. Often (but not always): D = {D α } α Λ, Λ R q. In GANs algorithms, D is a neural network. 10

24 Discriminators Discriminators: a family of functions from E to [0, 1]. Notation: D. Often (but not always): D = {D α } α Λ, Λ R q. In GANs algorithms, D is a neural network. The higher D(x), the higher the probability that x is drawn from p. 10

25 Discriminators Discriminators: a family of functions from E to [0, 1]. Notation: D. Often (but not always): D = {D α } α Λ, Λ R q. In GANs algorithms, D is a neural network. The higher D(x), the higher the probability that x is drawn from p. Generators and discriminators have opposite objectives. Source: 10

26 Source: 11

27 Adversarial principle Auxiliary random variables: Z 1,..., Z n, i.i.d. according to Z. 12

28 Adversarial principle Auxiliary random variables: Z 1,..., Z n, i.i.d. according to Z. Objective: solve in θ inf θ Θ sup D D [ n D(X i ) i=1 n ] (1 D G θ (Z i )). i=1 12

29 Adversarial principle Auxiliary random variables: Z 1,..., Z n, i.i.d. according to Z. Objective: solve in θ inf θ Θ sup D D [ n D(X i ) i=1 Equivalently: find ˆθ Θ such that n ] (1 D G θ (Z i )). i=1 sup ˆL(ˆθ, D) sup ˆL(θ, D), θ Θ, D D D D where ˆL(θ, D) def = n ln D(X i ) + i=1 n ln(1 D G θ (Z i )). i=1 12

30 Some comments The criterion ˆL(θ, D) is the original one. 13

31 Some comments The criterion ˆL(θ, D) is the original one. Many variants: MMD-GANs, f-gans, Wasserstein-GANs, scattering transforms, etc. They are based on different criteria galaxy of GANs. 13

32 Some comments The criterion ˆL(θ, D) is the original one. Many variants: MMD-GANs, f-gans, Wasserstein-GANs, scattering transforms, etc. They are based on different criteria galaxy of GANs. Our objective today: 13

33 Some comments The criterion ˆL(θ, D) is the original one. Many variants: MMD-GANs, f-gans, Wasserstein-GANs, scattering transforms, etc. They are based on different criteria galaxy of GANs. Our objective today: Roadmap: Basic properties of ˆL(θ, D)? 13

34 Some comments The criterion ˆL(θ, D) is the original one. Many variants: MMD-GANs, f-gans, Wasserstein-GANs, scattering transforms, etc. They are based on different criteria galaxy of GANs. Our objective today: Roadmap: Basic properties of ˆL(θ, D)? Impact of the discriminators on the quality of the approximation? 13

35 Some comments The criterion ˆL(θ, D) is the original one. Many variants: MMD-GANs, f-gans, Wasserstein-GANs, scattering transforms, etc. They are based on different criteria galaxy of GANs. Our objective today: Roadmap: Basic properties of ˆL(θ, D)? Impact of the discriminators on the quality of the approximation? Statistical consistency? rates of convergence? 13

Some comments The criterion ˆL(θ, D) is the original one. Many variants: MMD-GANs, f-gans, Wasserstein-GANs, scattering transforms, etc. They are based on different criteria galaxy of GANs.

36 Some comments The criterion ˆL(θ, D) is the original one. Many variants: MMD-GANs, f-gans, Wasserstein-GANs, scattering transforms, etc. They are based on different criteria galaxy of GANs. Our objective today: Roadmap: Basic properties of ˆL(θ, D)? Impact of the discriminators on the quality of the approximation? Statistical consistency? rates of convergence? Play with simple examples. 13

37 Optimality properties

38 Reminders For P Q probability measures on E, D KL (P Q) = ln dp dq dp. Properties: D KL (P Q) 0 and D KL (P Q) = 0 P = Q. 14

39 Reminders For P Q probability measures on E, D KL (P Q) = ln dp dq dp. Properties: D KL (P Q) 0 and D KL (P Q) = 0 P = Q. If p = dp dµ dq and q = dµ, then D KL (P Q) = p ln p q dµ. 14

40 Reminders For P Q probability measures on E, D KL (P Q) = ln dp dq dp. Properties: D KL (P Q) 0 and D KL (P Q) = 0 P = Q. If p = dp dµ dq and q = dµ, then D KL (P Q) = p ln p q dµ. For P and Q probability measures on E, D JS (P, Q) = 1 ( 2 D KL P P + Q 2 ) + 1 ( 2 D KL Q P + Q Properties: 0 D JS (P, Q) ln 2 and D JS (P, Q) is a distance. 2 ). 14

41 Role of the Jensen-Shannon divergence ˆL(θ, D) is the empirical version of L(θ, D) def = ln(d)p dµ + ln(1 D)p θ dµ. 15

42 Role of the Jensen-Shannon divergence ˆL(θ, D) is the empirical version of L(θ, D) def = ln(d)p dµ + ln(1 D)p θ dµ. Idealization: D = D, the set of all functions from E to [0, 1]. 15

43 Role of the Jensen-Shannon divergence ˆL(θ, D) is the empirical version of L(θ, D) def = ln(d)p dµ + ln(1 D)p θ dµ. Idealization: D = D, the set of all functions from E to [0, 1]. In this case [ sup L(θ, D) = sup ln(d)p ] + ln(1 D)p θ dµ D D D D [ sup ln(d)p ] + ln(1 D)p θ dµ D D = L(θ, D θ ), 15

44 Role of the Jensen-Shannon divergence ˆL(θ, D) is the empirical version of L(θ, D) def = ln(d)p dµ + ln(1 D)p θ dµ. Idealization: D = D, the set of all functions from E to [0, 1]. In this case [ sup L(θ, D) = sup ln(d)p ] + ln(1 D)p θ dµ D D D D [ sup ln(d)p ] + ln(1 D)p θ dµ D D = L(θ, D θ ), where D θ def = p p + p θ 15

45 Role of the Jensen-Shannon divergence ˆL(θ, D) is the empirical version of L(θ, D) def = ln(d)p dµ + ln(1 D)p θ dµ. Idealization: D = D, the set of all functions from E to [0, 1]. In this case [ sup L(θ, D) = sup ln(d)p ] + ln(1 D)p θ dµ D D D D [ sup ln(d)p ] + ln(1 D)p θ dµ D D = L(θ, D θ ), where D θ def = p p + p θ optimal discriminator. 15

46 Role of the Jensen-Shannon divergence ˆL(θ, D) is the empirical version of L(θ, D) def = ln(d)p dµ + ln(1 D)p θ dµ. Idealization: D = D, the set of all functions from E to [0, 1]. In this case [ sup L(θ, D) = sup ln(d)p ] + ln(1 D)p θ dµ D D D D [ sup ln(d)p ] + ln(1 D)p θ dµ D D = L(θ, D θ ), where D θ def = p p + p θ optimal discriminator. Conclusion: sup D D L(θ, D) = L(θ, D θ ) 15

47 Role of the Jensen-Shannon divergence ˆL(θ, D) is the empirical version of L(θ, D) def = ln(d)p dµ + ln(1 D)p θ dµ. Idealization: D = D, the set of all functions from E to [0, 1]. In this case [ sup L(θ, D) = sup ln(d)p ] + ln(1 D)p θ dµ D D D D [ sup ln(d)p ] + ln(1 D)p θ dµ D D = L(θ, D θ ), where D θ def = p p + p θ optimal discriminator. Conclusion: sup D D L(θ, D) = L(θ, D θ ) = 2D JS(p, p θ ) ln 4. 15

48 Role of the Jensen-Shannon divergence For all θ Θ, sup D D L(θ, D) = 2D JS (p, p θ ) ln 4. 16

49 Role of the Jensen-Shannon divergence For all θ Θ, sup D D L(θ, D) = 2D JS (p, p θ ) ln 4. A possible interpretation: minimize D JS (p, p θ ) over Θ. 16

50 Role of the Jensen-Shannon divergence For all θ Θ, sup D D L(θ, D) = 2D JS (p, p θ ) ln 4. A possible interpretation: minimize D JS (p, p θ ) over Θ. Many GANs algorithms try to learn the optimal discriminator D θ. 16

51 Role of the Jensen-Shannon divergence For all θ Θ, sup D D L(θ, D) = 2D JS (p, p θ ) ln 4. A possible interpretation: minimize D JS (p, p θ ) over Θ. Many GANs algorithms try to learn the optimal discriminator D θ. Theorem Let θ Θ be such that p θ > 0 µ-almost everywhere. Then the function Dθ is the unique discriminator that achieves the supremum of the functional D L(θ, D) over D. 16

52 Oracle parameter By definition, for all θ Θ, sup D D L(θ, D) = 2D JS (p, p θ ) ln 4. 17

53 Oracle parameter By definition, for all θ Θ, We let θ Θ be defined as sup D D L(θ, D) = 2D JS (p, p θ ) ln 4. D JS (p, p θ ) D JS (p, p θ ), θ Θ. 17

54 Oracle parameter By definition, for all θ Θ, We let θ Θ be defined as sup D D L(θ, D) = 2D JS (p, p θ ) ln 4. D JS (p, p θ ) D JS (p, p θ ), θ Θ. θ is the oracle parameter in terms of Jensen-Shannon divergence. 17

55 Oracle parameter By definition, for all θ Θ, We let θ Θ be defined as sup D D L(θ, D) = 2D JS (p, p θ ) ln 4. D JS (p, p θ ) D JS (p, p θ ), θ Θ. θ is the oracle parameter in terms of Jensen-Shannon divergence. Whenever p P, then p = p θ, D JS (p, p θ ) = 0, and D θ = 1/2. This is, however, a very special case, which is of no interest. 17

56 Oracle parameter By definition, for all θ Θ, We let θ Θ be defined as sup D D L(θ, D) = 2D JS (p, p θ ) ln 4. D JS (p, p θ ) D JS (p, p θ ), θ Θ. θ is the oracle parameter in terms of Jensen-Shannon divergence. Whenever p P, then p = p θ, D JS (p, p θ ) = 0, and D θ = 1/2. This is, however, a very special case, which is of no interest. We need sufficient conditions for the existence and unicity of θ. 17

57 Oracle parameter Theorem Assume that the model {P θ } θ Θ is identifiable, convex, and compact for the metric δ. Assume, in addition, that there exist 0 < m M such that m p M and, for all θ Θ, p θ M. Then there exists a unique θ Θ such that {θ } =arg min θ Θ D JS (p, p θ ). 18

58 Oracle parameter Theorem Assume that the model {P θ } θ Θ is identifiable, convex, and compact for the metric δ. Assume, in addition, that there exist 0 < m M such that m p M and, for all θ Θ, p θ M. Then there exists a unique θ Θ such that {θ } =arg min θ Θ D JS (p, p θ ). Compactness conditions (i) Θ is compact and {P θ } θ Θ is convex. (ii) For all x E, the function θ p θ (x) is continuous on Θ. (iii) One has sup (θ,θ ) Θ p 2 θ ln p θ L 1 (µ). 18

59 Approximation properties

60 Parameterized discriminators The assumption D = D is comfortable but questionable. In practice, one has always D = {D α } α Λ, Λ R q. 19

61 Parameterized discriminators The assumption D = D is comfortable but questionable. In practice, one has always D = {D α } α Λ, Λ R q. GANs are a likelihood-type problem involving two parametric families. 19

62 Parameterized discriminators The assumption D = D is comfortable but questionable. In practice, one has always D = {D α } α Λ, Λ R q. GANs are a likelihood-type problem involving two parametric families. It is logical to consider the parameter θ Θ defined by sup L( θ, D) sup L(θ, D), θ Θ. D D D D 19

63 Parameterized discriminators The assumption D = D is comfortable but questionable. In practice, one has always D = {D α } α Λ, Λ R q. GANs are a likelihood-type problem involving two parametric families. It is logical to consider the parameter θ Θ defined by sup L( θ, D) sup L(θ, D), θ Θ. D D D D The density p θ is the best candidate to imitate p θ. 19

64 Parameterized discriminators The assumption D = D is comfortable but questionable. In practice, one has always D = {D α } α Λ, Λ R q. GANs are a likelihood-type problem involving two parametric families. It is logical to consider the parameter θ Θ defined by sup L( θ, D) sup L(θ, D), θ Θ. D D D D The density p θ is the best candidate to imitate p θ. Objective: quantify the proximity between p θ and p θ. 19

65 Approximation theorem Assumption (H 0 ) There exists a positive constant t (0, 1/2] such that min(d θ, 1 D θ ) t, θ Θ. 20

66 Approximation theorem Assumption (H 0 ) There exists a positive constant t (0, 1/2] such that min(d θ, 1 D θ ) t, θ Θ. Assumption (H ε ) There exists ε (0, t) and D D such that D D θ ε. 20

67 Approximation theorem Assumption (H 0 ) There exists a positive constant t (0, 1/2] such that min(d θ, 1 D θ ) t, θ Θ. Assumption (H ε ) There exists ε (0, t) and D D such that D D θ ε. Theorem Under Assumptions (H 0 ) and (H ε ), there exists a positive constant c such that 0 D JS (p, p θ) D JS (p, p θ ) cε 2. 20

68 Statistical analysis

69 Context Parametric generators G = {G θ } θ Θ and discriminators D = {D α } α Λ. 21

70 Context Parametric generators G = {G θ } θ Θ and discriminators D = {D α } α Λ. The data-dependent parameter ˆθ is such that sup ˆL(ˆθ, D) sup ˆL(θ, D), θ Θ. D D D D 21

71 Context Parametric generators G = {G θ } θ Θ and discriminators D = {D α } α Λ. The data-dependent parameter ˆθ is such that Questions: sup ˆL(ˆθ, D) sup ˆL(θ, D), θ Θ. D D D D Can we say something about D JS(p, p ˆθ) D JS(p, p θ )? 21

72 Context Parametric generators G = {G θ } θ Θ and discriminators D = {D α } α Λ. The data-dependent parameter ˆθ is such that Questions: sup ˆL(ˆθ, D) sup ˆL(θ, D), θ Θ. D D D D Can we say something about D JS(p, p ˆθ) D JS(p, p θ )? Convergence of ˆθ towards θ as n? 21

73 Context Parametric generators G = {G θ } θ Θ and discriminators D = {D α } α Λ. The data-dependent parameter ˆθ is such that Questions: sup ˆL(ˆθ, D) sup ˆL(θ, D), θ Θ. D D D D Can we say something about D JS(p, p ˆθ) D JS(p, p θ )? Convergence of ˆθ towards θ as n? Asymptotic distribution of ˆθ θ? 21

74 Jensen-Shannon convergence Assumptions (H reg ) Regularity conditions of order 1 on the models. Assumption (H ε) There exists ε (0, t) such that: for all θ Θ, there exists D D such that D D θ ε. 22

75 Jensen-Shannon convergence Assumptions (H reg ) Regularity conditions of order 1 on the models. Assumption (H ε) There exists ε (0, t) such that: for all θ Θ, there exists D D such that D D θ ε. Theorem Under Assumptions (H 0 ), (H reg ), and (H ε), one has ( ED JS (p, pˆθ ) D JS(p, p θ ) = O ε ). n 22

76 Illustration p (x) = e x/s, s = s(1+e x/s ) 2 G and D: fully connected neural networks. n = is fixed and the number of layers varies. 23

77 Illustration p (x) = e x/s, s = s(1+e x/s ) 2 G and D: fully connected neural networks. n = is fixed and the number of layers varies. 23

78 Discriminator depth = 2, generator depth = 3 24

79 Discriminator depth = 5, generator depth = 3 25

80 video 26

81 Convergence of ˆθ Assumptions (H reg) Regularity conditions of order 2 on the models. Assumption (H 1 ) The pair ( θ, ᾱ) is unique and belongs to Θ Λ. 27

82 Convergence of ˆθ Assumptions (H reg) Regularity conditions of order 2 on the models. Assumption (H 1 ) The pair ( θ, ᾱ) is unique and belongs to Θ Λ. Theorem Under Assumptions (H reg) and (H 1 ), one has ˆθ θ almost surely and ˆα ᾱ almost surely. 27

83 Targets p 28

84 Generators and discriminators Model p P = {p θ } θ Θ D = {D α } α Λ x 1 1 Laplace-Gaussian 2b e b 2πθ e x2 2θ α 1 α e x2 2 (α 2 1 α 2 0 ) 0 b = 1.5 Θ = [10 1, 10 3 ] Λ = Θ Θ Claw-Gaussian p claw (x) 1 2πθ e x2 2θ α 1 α e x2 2 (α 2 1 α 2 0 ) 0 Θ = [10 1, 10 3 ] Λ = Θ Θ Exponential-Uniform λe λx 1 θ 1 1 [0,θ](x) 1+ α 1 α e x2 2 (α 2 1 α 2 0 ) 0 λ = 1 Θ = [10 3, 10 3 ] Λ = Θ Θ 29

85 Laplace-Gaussian 30

86 Laplace-Gaussian 31

87 Claw-Gaussian 32

88 Claw-Gaussian 33

89 Exponential-Uniform 34

90 Exponential-Uniform 35

91 Central limit theorem Assumptions (H loc ) Local smoothness conditions around ( θ, ᾱ). 36

92 Central limit theorem Assumptions (H loc ) Local smoothness conditions around ( θ, ᾱ). Theorem Under Assumptions (H reg), (H 1 ), and (H loc ), one has L n(ˆθ θ) Z, where Z is a Gaussian random variable with mean 0 and variance V. 36

93 Laplace-Gaussian 37

94 Claw-Gaussian 38

95 Exponential-Uniform 39

arxiv: v1 [stat.ml] 21 Mar 2018

arxiv: v1 [stat.ml] 21 Mar 2018 Some Theoretical Properties of GANs arxiv:1803.07819v1 [stat.ml] 21 Mar 2018 G. Biau Sorbonne Université, CNRS, LPSM Paris, France gerard.biau@upmc.fr M. Sangnier Sorbonne Université, CNRS, LPSM Paris,