Nishant Gurnani. GAN Reading Group. April 14th, / 107
|
|
- Beryl Daniels
- 5 years ago
- Views:
Transcription
1 Nishant Gurnani GAN Reading Group April 14th, / 107
2 Why are these Papers Important? 2 / 107
3 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... 3 / 107
4 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance 4 / 107
5 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN 5 / 107
6 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN Empirical results show correlation between discriminator loss and perceptual quality thus providing a rough measure of training progress 6 / 107
7 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN Empirical results show correlation between discriminator loss and perceptual quality thus providing a rough measure of training progress Goal - Convince you that WGAN is the best GAN 7 / 107
8 Outline Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 8 / 107
9 Outline Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 9 / 107
10 What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 10 / 107
11 What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 1. Directly learn the probability density function P θ and then optimize through maximum likelihood estimation 11 / 107
12 What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 1. Directly learn the probability density function P θ and then optimize through maximum likelihood estimation 2. Learn a function that transforms an existing distribution Z into P θ. Here, g θ is some differentiable function, Z is a common distribution (usually uniform or Gaussian), and P θ = g θ (Z) 12 / 107
13 Maximum Likelihood approach is problematic Recall that for continuous distributions P and Q the KL divergence is: KL(P Q) = P(x) log P(x) Q(x) dx and given function P θ, the MLE objective is 1 max θ R d m x m log P θ (x (i) ) i=1 In the limit (as m ), samples will appear based on the data distribution P r, so lim max 1 m θ R d m m i=1 log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x 13 / 107
14 Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) x P r (x) log P θ 14 / 107
15 Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) x P r (x) log P θ 15 / 107
16 Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) Typical remedy is to add a noise term to the model distribution to ensure distribution is defined everywhere x P r (x) log P θ 16 / 107
17 Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) Typical remedy is to add a noise term to the model distribution to ensure distribution is defined everywhere This unfortunately introduces some error, and empirically people have needed to add a lot of random noise to make models train x P r (x) log P θ 17 / 107
18 Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: 18 / 107
19 Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold 19 / 107
20 Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) 20 / 107
21 Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) VAEs and GANs are well known examples of this approach VAEs focus on the approximate likelihood of the examples and so share the limitation that you need to fiddle with additional noise terms. 21 / 107
22 Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) VAEs and GANs are well known examples of this approach VAEs focus on the approximate likelihood of the examples and so share the limitation that you need to fiddle with additional noise terms. GANs offer much more flexibility but their training is unstable. 22 / 107
23 Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). 23 / 107
24 Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties 24 / 107
25 Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d 25 / 107
26 Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function 26 / 107
27 Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function We can minimize d(p r, P θ ) with respect to θ as long as the mapping θ P θ is continuous (true if g θ is a neural network) 27 / 107
28 Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function We can minimize d(p r, P θ ) with respect to θ as long as the mapping θ P θ is continuous (true if g θ is a neural network) How do we define d? 28 / 107
29 Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 29 / 107
30 Distance definitions 30 / 107
31 Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ 31 / 107
32 Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ Total Variation (TV) distance δ(p r, P θ ) = sup P r (A) P θ (A) A Σ 32 / 107
33 Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ Total Variation (TV) distance δ(p r, P θ ) = sup P r (A) P θ (A) A Σ Kullback-Leibler (KL) divergence KL(P r P θ ) = P r (x) log P r (x) P θ (x) dµ(x) x where both P r and P θ are assumed to be absolutely continuous with respect to a same measure µ defined on χ. 33 / 107
34 Distance Definitions 34 / 107
35 Distance Definitions Jensen-Shannon (JS) Divergence JS(P r, P θ ) = KL(P r P m ) + KL(P θ P m ) where P m is the mixture (P r + P θ )/2 35 / 107
36 Distance Definitions Jensen-Shannon (JS) Divergence JS(P r, P θ ) = KL(P r P m ) + KL(P θ P m ) where P m is the mixture (P r + P θ )/2 Earth-Mover (EM) distance or Wasserstein-1 W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) where Π(P r, P θ ) denotes the set of all join distributions γ(x, y) whose marginals are respectively P r and P θ 36 / 107
37 Understanding the EM distance 37 / 107
38 Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. 38 / 107
39 Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. 39 / 107
40 Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. 40 / 107
41 Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. The earth mover distance is the minimal effort we need to spend. 41 / 107
42 Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. The earth mover distance is the minimal effort we need to spend. Transport Plan Each γ Π is a transport plan and to execute the plan, for all x, y move γ(x, y) mass from x to y. 42 / 107
43 Understanding the EM distance 43 / 107
44 Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? 44 / 107
45 Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. 45 / 107
46 Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. 46 / 107
47 Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. This shows why the marginals of γ Π must be P r and P θ. For scoring, the effort spent is γ(x, y) x y dydx = E (x,y) γ [ x y ] x y 47 / 107
48 Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. This shows why the marginals of γ Π must be P r and P θ. For scoring, the effort spent is γ(x, y) x y dydx = E (x,y) γ [ x y ] x y Computing the infimum of this over all valid γ gives the earth mover distance 48 / 107
49 Learning Parallel Lines Example 49 / 107
50 Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. 50 / 107
51 Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. 51 / 107
52 Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. We d like our optimization algorithm to learn to move θ to 0. As θ 0, the distance d(p 0, P θ ) should decrease. 52 / 107
53 Learning Parallel Lines Example For many common distance functions, this doesn t happen. 53 / 107
54 Learning Parallel Lines Example For many common distance functions, this doesn t happen. { 1 if θ 0 δ(p 0, P θ ) = 0 if θ = 0 { log 2 if θ 0 JS(P 0, P θ ) = 0 if θ = 0 { + if θ 0 KL(P θ, P 0 ) = KL(P 0, P θ ) = 0 if θ = 0 W (P 0, P θ ) = θ 54 / 107
55 Theoretical Justification Theorem (1) Let P r be a fixed distribution over χ. Let Z be a random variable over another space Z. Let g : Z R d χ be a function, that will be denote g θ (z) with z the first coordinate and θ the second. Let P θ denote the distribution of g θ (). Then, 1. If g is continuous in θ, so is W (P r, P θ ). 2. If g is locally Lipschitz and satisfies regularity assumption 1, then W (P r, P θ ) is continuous everywhere, and differentiable almost everywhere 3. Statements 1-2 are false for the Jensen-Shannon divergence JS(P r, P θ ) and all the KLs. 55 / 107
56 Theoretical Justification Theorem (2) Let P be a distribution on a compact space χ and (P n ) n N be a sequence of distributions on χ. Then, considering all limits as n, 1. The following statements are equivalent δ(p n, P) 0 JS(Pn, P) 0 2. The following statements are equivalent W (P n, P) 0 Pn D P where D represents convergence in distribution for random variables 3. KL(P n P) 0 or KL(P P n ) 0 imply the statements in (1). 4. The statements in (1) imply the statements in (2). 56 / 107
57 Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 57 / 107
58 Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. 58 / 107
59 Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. 59 / 107
60 Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. The discriminator network D receives either a generated sample or a true data sample and must distinguish between the two. 60 / 107
61 Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. The discriminator network D receives either a generated sample or a true data sample and must distinguish between the two. The generator is trained to fool the discriminator. 61 / 107
62 Generative Adversarial Networks Formally we can express the game between the generator G and the discriminator D with the minimax objective: min G max E x P r [log(d(x))] + E x Pg [log(1 D( x)] D where P r is the data distribution and P g is the model distribution implicitly defined by x = G(z), z p(z) 62 / 107
63 Generative Adversarial Networks Remarks 63 / 107
64 Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x 64 / 107
65 Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x This is expensive and often leads to vanishing gradients as the discriminator saturates 65 / 107
66 Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x This is expensive and often leads to vanishing gradients as the discriminator saturates In practice, this requirement is relaxed, and the generator and discriminator are update simultaneously 66 / 107
67 Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 67 / 107
68 Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) 68 / 107
69 Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) However, a result from the Kantorivich-Rubinstein Duality (Villani 2008) shows W is equivalent to W (P r, P θ ) = sup E x Pr [f (x)] E x Pθ [f (x)] f L 1 where the supremum is taken over all 1-Lipschitz functions 69 / 107
70 Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) However, a result from the Kantorivich-Rubinstein Duality (Villani 2008) shows W is equivalent to W (P r, P θ ) = sup E x Pr [f (x)] E x Pθ [f (x)] f L 1 where the supremum is taken over all 1-Lipschitz functions Calculating this is still intractable, but now it s easier to approximate. 70 / 107
71 Wasserstein GAN Approximation 71 / 107
72 Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. 72 / 107
73 Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. 73 / 107
74 Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) 74 / 107
75 Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) If {f w } w W contains the true supremum among K-Lipschitz functions, this gives the distance exactly. 75 / 107
76 Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) If {f w } w W contains the true supremum among K-Lipschitz functions, this gives the distance exactly. In practice this won t be true! 76 / 107
77 Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. 77 / 107
78 Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. Intuitively, given a fixed g θ, we can compute the optimal f w for the Wasserstein distance. 78 / 107
79 Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. Intuitively, given a fixed g θ, we can compute the optimal f w for the Wasserstein distance. We can then backpropagate through W (P r, g θ (Z)) to get the gradient for θ. θ W (P r, P θ ) = θ (E x Pr [f w (x)] E z Z [f w (g θ (z))]) = E z Z [ θ f w (g θ (z))] 79 / 107
80 Wasserstein GAN Algorithm The training process now has three steps: 80 / 107
81 Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 81 / 107
82 Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 82 / 107
83 Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process 83 / 107
84 Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process Important Detail The entire derivation only works when the function family {f w } w W is K-Lipschitz. 84 / 107
85 Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process Important Detail The entire derivation only works when the function family {f w } w W is K-Lipschitz. To guarantee this is true, the authors use weight clamping. The weights w are constrained to lie within [ c, c], by clipping w after every update to w. 85 / 107
86 Wasserstein GAN Algorithm 86 / 107
87 Comparison with Standard GANs 87 / 107
88 Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) i=1 88 / 107
89 Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator i=1 89 / 107
90 Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence i=1 90 / 107
91 Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence Consequently, we re updating G against an objective that kind of aims towards the JS divergence, but doesn t go all the way i=1 91 / 107
92 Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence Consequently, we re updating G against an objective that kind of aims towards the JS divergence, but doesn t go all the way In constrast, because the Wasserstein distance is differentiable nearly everywhere, we can (and should) train f w to convergence before each generator update, to get as accurate an estimate of W (P r, P θ ) as possible i=1 92 / 107
93 Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 93 / 107
94 Properties of optimal WGAN critic 94 / 107
95 Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? 95 / 107
96 Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? Previously (Arjovksy et. al 2017) we ve seen that you can clip the weights of the critic to lie within the compact space [ c, c] 96 / 107
97 Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? Previously (Arjovksy et. al 2017) we ve seen that you can clip the weights of the critic to lie within the compact space [ c, c] To understand why weight clipping is problematic in WGAN critic we need to understand what are the properties of the optimal WGAN critic? 97 / 107
98 Properties of optimal WGAN critic 98 / 107
99 Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. 99 / 107
100 Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. In other words, D (x t ) = y xt y x. t 100 / 107
101 Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. In other words, D (x t ) = y xt y x. t This implies that the optimal WGAN critic has gradients with norm 1 almost everywhere under P r and P θ 101 / 107
102 Gradient penalty 102 / 107
103 Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. 103 / 107
104 Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. 104 / 107
105 Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. This implies we should directly constrain the gradient norm of our critic function with respect to its input. 105 / 107
106 Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. This implies we should directly constrain the gradient norm of our critic function with respect to its input. Enforcing a soft version of this we get: 106 / 107
107 WGAN with gradient penalty 107 / 107
Wasserstein GAN. Juho Lee. Jan 23, 2017
Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)
More informationAdaGAN: Boosting Generative Models
AdaGAN: Boosting Generative Models Ilya Tolstikhin ilya@tuebingen.mpg.de joint work with Gelly 2, Bousquet 2, Simon-Gabriel 1, Schölkopf 1 1 MPI for Intelligent Systems 2 Google Brain Radford et al., 2015)
More informationCS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018
CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs
More informationtopics about f-divergence
topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments
More informationLecture 14: Deep Generative Learning
Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationGENERATIVE ADVERSARIAL LEARNING
GENERATIVE ADVERSARIAL LEARNING OF MARKOV CHAINS Jiaming Song, Shengjia Zhao & Stefano Ermon Computer Science Department Stanford University {tsong,zhaosj12,ermon}@cs.stanford.edu ABSTRACT We investigate
More informationMMD GAN 1 Fisher GAN 2
MMD GAN 1 Fisher GAN 1 Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos (CMU, IBM Research) Youssef Mroueh, and Tom Sercu (IBM Research) Presented by Rui-Yi(Roy) Zhang Decemeber
More informationVariational Autoencoders (VAEs)
September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p
More informationarxiv: v7 [cs.lg] 27 Jul 2018
How Generative Adversarial Networks and Their Variants Work: An Overview of GAN Yongjun Hong, Uiwon Hwang, Jaeyoon Yoo and Sungroh Yoon Department of Electrical & Computer Engineering Seoul National University,
More informationGenerative Adversarial Networks
Generative Adversarial Networks Stefano Ermon, Aditya Grover Stanford University Lecture 10 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 10 1 / 17 Selected GANs https://github.com/hindupuravinash/the-gan-zoo
More informationEnergy-Based Generative Adversarial Network
Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network J. Zhao, M. Mathieu and Y. LeCun Learning to Draw Samples: With Application to Amoritized MLE for Generalized Adversarial
More informationChapter 20. Deep Generative Models
Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationSome theoretical properties of GANs. Gérard Biau Toulouse, September 2018
Some theoretical properties of GANs Gérard Biau Toulouse, September 2018 Coauthors Benoît Cadre (ENS Rennes) Maxime Sangnier (Sorbonne University) Ugo Tanielian (Sorbonne University & Criteo) 1 video Source:
More informationTheory and Applications of Wasserstein Distance. Yunsheng Bai Mar 2, 2018
Theory and Applications of Wasserstein Distance Yunsheng Bai Mar 2, 2018 Roadmap 1. 2. 3. Why Study Wasserstein Distance? Elementary Distances Between Two Distributions Applications of Wasserstein Distance
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationGenerative Adversarial Networks. Presented by Yi Zhang
Generative Adversarial Networks Presented by Yi Zhang Deep Generative Models N(O, I) Variational Auto-Encoders GANs Unreasonable Effectiveness of GANs GANs Discriminator tries to distinguish genuine data
More informationGenerative adversarial networks
14-1: Generative adversarial networks Prof. J.C. Kao, UCLA Generative adversarial networks Why GANs? GAN intuition GAN equilibrium GAN implementation Practical considerations Much of these notes are based
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationGradient descent GAN optimization is locally stable
Gradient descent GAN optimization is locally stable Advances in Neural Information Processing Systems, 2017 Vaishnavh Nagarajan J. Zico Kolter Carnegie Mellon University 05 January 2018 Presented by: Kevin
More informationInformation theoretic perspectives on learning algorithms
Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez
More informationarxiv: v3 [cs.lg] 2 Nov 2018
PacGAN: The power of two samples in generative adversarial networks Zinan Lin, Ashish Khetan, Giulia Fanti, Sewoong Oh Carnegie Mellon University, University of Illinois at Urbana-Champaign arxiv:72.486v3
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationRobustness and duality of maximum entropy and exponential family distributions
Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat
More informationarxiv: v3 [stat.ml] 6 Dec 2017
Wasserstein GAN arxiv:1701.07875v3 [stat.ml] 6 Dec 2017 Martin Arjovsky 1, Soumith Chintala 2, and Léon Bottou 1,2 1 Introduction 1 Courant Institute of Mathematical Sciences 2 Facebook AI Research The
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationGenerative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,
Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab, 2016-08-31 Generative Modeling Density estimation Sample generation
More informationAuto-Encoding Variational Bayes
Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationOptimal Transport and Wasserstein Distance
Optimal Transport and Wasserstein Distance The Wasserstein distance which arises from the idea of optimal transport is being used more and more in Statistics and Machine Learning. In these notes we review
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationWasserstein Generative Adversarial Networks
Martin Arjovsky 1 Soumith Chintala 2 Léon Bottou 1 2 Abstract We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability
More informationLecture 21: Minimax Theory
Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways
More informationNormed and Banach spaces
Normed and Banach spaces László Erdős Nov 11, 2006 1 Norms We recall that the norm is a function on a vectorspace V, : V R +, satisfying the following properties x + y x + y cx = c x x = 0 x = 0 We always
More informationTheory of Probability Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 8.75 Theory of Probability Fall 008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Section Prekopa-Leindler inequality,
More informationarxiv: v1 [stat.ml] 19 Jan 2018
Composite Functional Gradient Learning of Generative Adversarial Models arxiv:80.06309v [stat.ml] 9 Jan 208 Rie Johnson RJ Research Consulting Tarrytown, NY, USA riejohnson@gmail.com Abstract Tong Zhang
More informationarxiv: v1 [cs.lg] 20 Apr 2017
Softmax GAN Min Lin Qihoo 360 Technology co. ltd Beijing, China, 0087 mavenlin@gmail.com arxiv:704.069v [cs.lg] 0 Apr 07 Abstract Softmax GAN is a novel variant of Generative Adversarial Network (GAN).
More informationDiscrete transport problems and the concavity of entropy
Discrete transport problems and the concavity of entropy Bristol Probability and Statistics Seminar, March 2014 Funded by EPSRC Information Geometry of Graphs EP/I009450/1 Paper arxiv:1303.3381 Motivating
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationDeep Bayesian Inversion Computational uncertainty quantification for large scale inverse problems
Deep Bayesian Inversion Computational uncertainty quantification for large scale inverse problems arxiv:1811.05910v1 stat.ml] 14 Nov 018 Jonas Adler Department of Mathematics KTH - Royal institute of Technology
More informationEnergy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13
Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent
More informationDo you like to be successful? Able to see the big picture
Do you like to be successful? Able to see the big picture 1 Are you able to recognise a scientific GEM 2 How to recognise good work? suggestions please item#1 1st of its kind item#2 solve problem item#3
More informationWasserstein Training of Boltzmann Machines
Wasserstein Training of Boltzmann Machines Grégoire Montavon, Klaus-Rober Muller, Marco Cuturi Presenter: Shiyu Liang December 1, 2016 Coordinated Science Laboratory Department of Electrical and Computer
More informationDistirbutional robustness, regularizing variance, and adversaries
Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned
More informationThe power of two samples in generative adversarial networks
The power of two samples in generative adversarial networks Sewoong Oh Department of Industrial and Enterprise Systems Engineering University of Illinois at Urbana-Champaign / 2 Generative models learn
More informationLecture 5 Channel Coding over Continuous Channels
Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationApplications of Information Geometry to Hypothesis Testing and Signal Detection
CMCAA 2016 Applications of Information Geometry to Hypothesis Testing and Signal Detection Yongqiang Cheng National University of Defense Technology July 2016 Outline 1. Principles of Information Geometry
More informationSummary of A Few Recent Papers about Discrete Generative models
Summary of A Few Recent Papers about Discrete Generative models Presenter: Ji Gao Department of Computer Science, University of Virginia https://qdata.github.io/deep2read/ Outline SeqGAN BGAN: Boundary
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationOptimal Transport Methods in Operations Research and Statistics
Optimal Transport Methods in Operations Research and Statistics Jose Blanchet (based on work with F. He, Y. Kang, K. Murthy, F. Zhang). Stanford University (Management Science and Engineering), and Columbia
More informationSupport Vector Machines and Bayes Regression
Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationEstimation of Dynamic Regression Models
University of Pavia 2007 Estimation of Dynamic Regression Models Eduardo Rossi University of Pavia Factorization of the density DGP: D t (x t χ t 1, d t ; Ψ) x t represent all the variables in the economy.
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationVariational Autoencoder
Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational
More informationGenerative MaxEnt Learning for Multiclass Classification
Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,
More informationReview and Motivation
Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which
More informationIan Goodfellow, Staff Research Scientist, Google Brain. Seminar at CERN Geneva,
MedGAN ID-CGAN CoGAN LR-GAN CGAN IcGAN b-gan LS-GAN AffGAN LAPGAN DiscoGANMPM-GAN AdaGAN LSGAN InfoGAN CatGAN AMGAN igan IAN Open Challenges for Improving GANs McGAN Ian Goodfellow, Staff Research Scientist,
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms
More informationECO Class 6 Nonparametric Econometrics
ECO 523 - Class 6 Nonparametric Econometrics Carolina Caetano Contents 1 Nonparametric instrumental variable regression 1 2 Nonparametric Estimation of Average Treatment Effects 3 2.1 Asymptotic results................................
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationGANs. Machine Learning: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM GRAHAM NEUBIG
GANs Machine Learning: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM GRAHAM NEUBIG Machine Learning: Jordan Boyd-Graber UMD GANs 1 / 7 Problems with Generation Generative Models Ain t Perfect
More informationGaussian Mixture Models
Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro
More informationGenerative Adversarial Networks
Generative Adversarial Networks SIBGRAPI 2017 Tutorial Everything you wanted to know about Deep Learning for Computer Vision but were afraid to ask Presentation content inspired by Ian Goodfellow s tutorial
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationGANs, GANs everywhere
GANs, GANs everywhere particularly, in High Energy Physics Maxim Borisyak Yandex, NRU Higher School of Economics Generative Generative models Given samples of a random variable X find X such as: P X P
More informationAlgorithms for Variational Learning of Mixture of Gaussians
Algorithms for Variational Learning of Mixture of Gaussians Instructors: Tapani Raiko and Antti Honkela Bayes Group Adaptive Informatics Research Center 28.08.2008 Variational Bayesian Inference Mixture
More informationA Few Notes on Fisher Information (WIP)
A Few Notes on Fisher Information (WIP) David Meyer dmm@{-4-5.net,uoregon.edu} Last update: April 30, 208 Definitions There are so many interesting things about Fisher Information and its theoretical properties
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationStatistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation
Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider
More informationSinging Voice Separation using Generative Adversarial Networks
Singing Voice Separation using Generative Adversarial Networks Hyeong-seok Choi, Kyogu Lee Music and Audio Research Group Graduate School of Convergence Science and Technology Seoul National University
More informationGeometrical Insights for Implicit Generative Modeling
Geometrical Insights for Implicit Generative Modeling Leon Bottou a,b, Martin Arjovsky b, David Lopez-Paz a, Maxime Oquab a,c a Facebook AI Research, New York, Paris b New York University, New York c Inria,
More informationStability results for Logarithmic Sobolev inequality
Stability results for Logarithmic Sobolev inequality Daesung Kim (joint work with Emanuel Indrei) Department of Mathematics Purdue University September 20, 2017 Daesung Kim (Purdue) Stability for LSI Probability
More informationg-priors for Linear Regression
Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationA Unified View of Deep Generative Models
SAILING LAB Laboratory for Statistical Artificial InteLigence & INtegreative Genomics A Unified View of Deep Generative Models Zhiting Hu and Eric Xing Petuum Inc. Carnegie Mellon University 1 Deep generative
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationAdvanced Machine Learning & Perception
Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels
More informationGenerative Models and Optimal Transport
Generative Models and Optimal Transport Marco Cuturi Joint work / work in progress with G. Peyré, A. Genevay (ENS), F. Bach (INRIA), G. Montavon, K-R Müller (TU Berlin) Statistics 0.1 : Density Fitting
More informationLecture 2: Statistical Decision Theory (Part I)
Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationOptimal Transport in Risk Analysis
Optimal Transport in Risk Analysis Jose Blanchet (based on work with Y. Kang and K. Murthy) Stanford University (Management Science and Engineering), and Columbia University (Department of Statistics and
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationA new Hellinger-Kantorovich distance between positive measures and optimal Entropy-Transport problems
A new Hellinger-Kantorovich distance between positive measures and optimal Entropy-Transport problems Giuseppe Savaré http://www.imati.cnr.it/ savare Dipartimento di Matematica, Università di Pavia Nonlocal
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationSupplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization
Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization Sebastian Nowozin, Botond Cseke, Ryota Tomioka Machine Intelligence and Perception Group
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More information