Deep Learning for Computer Vision

Size: px

Start display at page:

Download "Deep Learning for Computer Vision"

Bryan Harvey
5 years ago
Views:

1 Deep Learning for Computer Vision Spring (primary) (grade, etc.) FB: DLCV Spring 2018 Yu-Chiang Frank Wang 王鈺強, Associate Professor Dept. Electrical Engineering, National Taiwan University 2018/05/09

2 What s to Be Covered Today Visualization of NNs t-sne Visualization of Feature Maps Neural Style Transfer Understanding NNs Image Translation Feature Disentanglement Recurrent Neural Networks LSTM & GRU Many slides from Fei-Fei Li, Yaser Sheikh, Simon Lucey, Kaiming He, J.-B. Huang, Yuying Yeh, and Hsuan-I Ho

NIPS 16 ACGAN Odena et al. ICML 17 Chen et al.

3 Representation Disentanglement Goal Interpretable deep feature representation Disentangle attribute of interest c from the derived latent representation z Unsupervised: InfoGAN Supervised: AC-GAN InfoGAN Chen et al. NIPS 16 ACGAN Odena et al. ICML 17 Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets., NIPS Odena et al., Conditional image synthesis with auxiliary classifier GANs. ICML 17 3

4 AC-GAN Supervised Disentanglement Learning Overall objective function D G = arg min G max D Adversarial Loss L GGGGGG G, D + L cccccc G, D yy (real) Supervised cc G(zz, cc) G zz L GGGGGG G, D = EE log 1 D(G(zz, cc)) + EE log D yy Disentanglement loss L cccccc G, D = EE log D cls (cc yy) + EE log D cls (cc G(xx, cc)) Real data w.r.t. its domain label Generated data w.r.t. assigned label Odena et al., Conditional image synthesis with auxiliary classifier GANs. ICML 17 4

5 AC-GAN Supervised Disentanglement D yy (real) Supervised cc G(zz, cc) G zz Different cc values Odena et al., Conditional image synthesis with auxiliary classifier GANs. ICML 17 5

6 InfoGAN Unsupervised Disentanglement Learning Overall objective function G = arg min G max D Adversarial Loss L GGGGGG G, D + L cccccc G, D L GGGGGG G, D = EE log 1 D(G(zz, cc)) Disentanglement loss L cccccc G, D = EE log D cls (cc G(xx, cc)) + EE log D yy Generated data w.r.t. assigned label Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets., NIPS

7 InfoGAN Unsupervised Disentanglement No guarantee in disentangling particular semantics Different cc Different cc Rotation Angle Width Loss Training process Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets., NIPS Time 7

8 From Image Understanding to Image Manipulation Representation Disentanglement InfoGAN: Unsupervised representation disentanglement AC-GAN: Supervised representation disentanglement Image Translation Pix2pix (CVPR 17): Pairwise cross-domain training data CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data UNIT (NIPS 17): Learning cross-domain image representation (with unpaired training data) DTN (ICLR 17) : Learning cross-domain image representation (with unpaired training data) Joint Image Translation & Disentanglement StarGAN (CVPR 18) : Image translation via representation disentanglement CDRD (CVPR 18) : Cross-domain representation disentanglement and translation 8

9 Pix2pix Image-to-image translation with conditional adversarial networks (CVPR 17) Can be viewed as image style transfer Sketch Photo Isola et al. " Image-to-image translation with conditional adversarial networks." CVPR

10 Pix2pix Testing Phase Goal / Problem Setting Image translation across two distinct domains (e.g., sketch v.s. photo) Pairwise training data Method: Conditional GAN Example: Sketch to Photo Generator (VAE) Input: Sketch Output: Photo Discriminator Input Input: Concatenation of Input(Sketch) & Synthesized/Real(Photo) images Output: Real or Fake Training Phase Generated Input real Concat Concat Input Isola et al. " Image-to-image translation with conditional adversarial networks." CVPR

Pix2pix Learning the model Training Phase Input Generated Overall objective function G = arg min max L cccccccc G, D + L LLL (G) G D Input Real Concat Concat Conditional GAN loss Concatenate L

11 Pix2pix Learning the model Training Phase Input Generated Overall objective function G = arg min max L cccccccc G, D + L LLL (G) G D Input Real Concat Concat Conditional GAN loss Concatenate L cccccccc (G, D) = EE xx log 1 D(xx, G(xx)) Reconstruction Loss L LLL (G) = EE xx,yy yy G(xx) 1 Fake (Generated) Concatenate + EE xx,yy log D xx, yy Real Input Isola et al. " Image-to-image translation with conditional adversarial networks." CVPR

Pix2pix Experiment results Demo page: https://affinelayer.com/pixsrv/ Isola et al.

12 Pix2pix Experiment results Demo page: Isola et al. " Image-to-image translation with conditional adversarial networks." CVPR

13 From Image Understanding to Image Manipulation Representation Disentanglement InfoGAN: Unsupervised representation disentanglement AC-GAN: Supervised representation disentanglement Image Translation Pix2pix (CVPR 17): Pairwise cross-domain training data CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data UNIT (NIPS 17): Learning cross-domain image representation (with unpaired training data) DTN (ICLR 17) : Learning cross-domain image representation (with unpaired training data) Joint Image Translation & Disentanglement StarGAN (CVPR 18) : Image translation via representation disentanglement CDRD (CVPR 18) : Cross-domain representation disentanglement and translation

CycleGAN/DiscoGAN/DualGAN CycleGAN (CVPR 17) Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks -to-image translation with conditional adversarial networks Paired

14 CycleGAN/DiscoGAN/DualGAN CycleGAN (CVPR 17) Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks -to-image translation with conditional adversarial networks Paired Unpaired Easier to collect training data More practical 1-to-1 Correspondence No Correspondence Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

15 CycleGAN Training data Goal / Problem Setting Image translation across two distinct domains Photo Unpaired Painting Unpaired training data Idea Autoencoding-like image translation Cycle consistency between two domains Photo Painting Photo Cycle Consistency Painting Photo Painting Cycle Consistency Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

16 CycleGAN Method (Example: Photo & Painting) Based on 2 GANs First GAN (G1, D1): Photo to Painting Second GAN (G2, D2): Painting to Photo Photo (Input) G1 Painting (Generated) or D1 Real / Fake Cycle Consistency Photo consistency Painting consistency Painting (Input) G2 Painting (Real) Photo (Generated) or D2 Real / Fake Photo (Real) Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

17 CycleGAN Method (Example: Photo vs. Painting) Based on 2 GANs First GAN (G1, D1): Photo to Painting Second GAN (G2, D2): Photo to Painting Photo Painting Photo G1 G2 Photo Consistency Cycle Consistency Photo consistency Painting consistency Painting Photo Painting G2 G1 Painting Consistency Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

18 Learning CycleGAN Photo (Input) xx G1 Painting (Generated) G 1 (xx) or D1 Overall objective function G 1, G 2 = arg min max L GGGGGG G 1, D 1 G 1,G 2 D 1,D 2 + L GGGGGG G 2, D 2 + L cccccc G 1, G 2 First GAN Second GAN Adversarial Loss First GAN (G1, D1): L GGGGGG G 1, D 1 = EE log 1 D 1 (G 1 (xx)) + EE log D 1 yy Painting (Input) yy G2 yy Painting (Real) Photo (Generated) G 2 (yy) or D2 Real / Fake Second GAN (G2, D2): L GGGGGG G 2, D 2 = EE log 1 D 2 (G 2 (yy)) + EE log D 2 xx xx Photo (Real) Real / Fake Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

19 CycleGAN Photo Painting Photo xx G 1 xx G 2 G 1 xx G1 G2 Learning Overall objective function G 1, G 2 = arg min G 1,G 2 max D 1,D 2 L GGGGGG G 1, D 1 + L GGGGGG G 2, D 2 + L cccccc G 1, G 2 Consistency Loss Photo and Painting consistency L cccccc G 1, G 2 = EE G 2 G 1 xx xx 1 + G 1 G 2 yy yy 1 Photo Consistency Cycle Consistency Painting Photo Painting yy G 2 yy G 1 G 2 yy G2 G1 Painting Consistency Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

20 CycleGAN Example results Project Page: Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

"Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017. Kim et al.

21 Image Translation Using Unpaired Training Data CycleGAN, DiscoGAN, and DualGAN CycleGAN ICCV 17 DiscoGAN ICML 17 DualGAN ICCV 17 Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR Kim et al. "Learning to Discover Cross-Domain Relations with Generative Adversarial Networks., ICML 2017 Yi et al. "DualGAN: Unsupervised dual learning for image-to-image translation." ICCV 2017

22 From Image Understanding to Image Manipulation Representation Disentanglement InfoGAN: Unsupervised representation disentanglement AC-GAN: Supervised representation disentanglement Image Translation Pix2pix (CVPR 17): Pairwise cross-domain training data CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data UNIT (NIPS 17): Learning cross-domain image representation (with unpaired training data) DTN (ICLR 17) : Learning cross-domain image representation (with unpaired training data) Joint Image Translation & Disentanglement StarGAN (CVPR 18) : Image translation via representation disentanglement CDRD (CVPR 18) : Cross-domain representation disentanglement and translation

UNIT Unsupervised Image-to-Image Translation Networks (NIPS 17) Image translation via learning cross-domain joint representation Stage1: Encode to the joint space ZZ: Joint latent space zz

23 UNIT Unsupervised Image-to-Image Translation Networks (NIPS 17) Image translation via learning cross-domain joint representation Stage1: Encode to the joint space ZZ: Joint latent space zz Stage2: Generate cross-domain images ZZ: Joint latent space zz Day Night Day Night xx 2 xx xx 1 1 XX 1 XX XX 2 1 XX 2 xx 2 Liu et al., "Unsupervised image-to-image translation networks., NIPS 2017

24 UNIT Goal/Problem Setting Image translation across two distinct domains Unpaired training image data Idea Based on two parallel VAE-GAN models VAE GAN Liu et al., "Unsupervised image-to-image translation networks., NIPS 2017

VAE-GAN models Learning of joint representation across image

25 UNIT Goal/Problem Setting Image translation across two distinct domains Unpaired training image data Idea Based on two parallel VAE-GAN models Learning of joint representation across image domains Liu et al., "Unsupervised image-to-image translation networks., NIPS 2017

UNIT Goal/Problem Setting Image translation across two distinct domains Unpaired training image data Idea Based on two parallel VAE-GAN models Learning of joint

26 UNIT Goal/Problem Setting Image translation across two distinct domains Unpaired training image data Idea Based on two parallel VAE-GAN models Learning of joint representation across image domains Generate cross-domain images from joint representation Liu et al., "Unsupervised image-to-image translation networks., NIPS 2017

27 UNIT Learning Overall objective function G = arg min G max D L VVVVVV E 1, G 1, E 2, G 2 + L GGGGGG G 1, D 1, G 2, D 2 Variation Autoencoder Adversarial VAE E 1 G 1 D 1 G 1 (zz) G 2 (zz) GAN Variation Autoencoder Loss L VVVVVV E 1, G 1, E 2, G 2 = EE G 1 E 1 xx 1 xx EE KKL(qq 1 (zz) pp(zz)) EE G 2 E 2 xx 2 xx EE KKL(qq 2 (zz) pp(zz)) Adversarial Loss L GGGGGG G 1, D 1, G 2, D 2 = EE log 1 D 1 (G 1 (zz) + EE log D 1 yy 1 EE log 1 D 2 (G 2 (zz) + EE log D 2 yy 2 E 2 G 2 D 2

28 UNIT Learning Overall objective function G = arg min max L VVVVVV E 1, G 1, E 2, G 2 + L GGGGGG G 1, D 1, G 2, D 2 G D Variation Autoencoder Loss VAE E 1 G 1 D 1 G 1 E 1 xx 1 G 2 E 2 xx 2 E 2 G 2 D 2 L VVVVVV E 1, G 1, E 2, G 2 = EE G 1 E 1 xx 1 xx EE KKL(qq 1 (zz) pp(zz)) EE G 2 E 2 xx 2 xx EE KKL(qq 2 (zz) pp(zz)) Adversarial Loss Reconstruction L GGGGGG G 1, D 1, G 2, D 2 = EE log 1 D 1 (G 1 (zz) + EE log D 1 yy 1 EE log 1 D 2 (G 2 (zz) + EE log D 2 yy 2

29 UNIT Learning Overall objective function G = arg min max L VVVVVV E 1, G 1, E 2, G 2 + L GGGGGG G 1, D 1, G 2, D 2 G D Variation Autoencoder Loss VAE E 1 G 1 D 1 G 1 E 1 xx 1 G 2 E 2 xx 2 E 2 G 2 D 2 L VVVVVV E 1, G 1, E 2, G 2 = EE G 1 E 1 xx 1 xx EE KKL(qq 1 (zz) pp(zz)) EE G 2 E 2 xx 2 xx EE KKL(qq 2 (zz) pp(zz)) Adversarial Loss Prior Loss L GGGGGG G 1, D 1, G 2, D 2 = EE log 1 D 1 (G 1 (zz) + EE log D 1 yy 1 EE log 1 D 2 (G 2 (zz) + EE log D 2 yy 2

30 UNIT Learning Overall objective function G = arg min max L VVVVVV E 1, G 1, E 2, G 2 + L GGGGGG G 1, D 1, G 2, D 2 G D E 1 D 1 G 1 G 1 (zz) GAN yy 1 Variation Autoencoder Loss L VVVVVV E 1, G 1, E 2, G 2 = EE G 1 E 1 xx 1 xx EE KKL(qq 1 (zz) pp(zz)) EE G 2 E 2 xx 2 xx EE KKL(qq 2 (zz) pp(zz)) Adversarial Loss L GGGGGG G 1, D 1, G 2, D 2 = EE log 1 D 1 (G 1 (zz) + EE log D 1 yy 1 EE log 1 D 2 (G 2 (zz) + EE log D 2 yy 2 Generated G 2 (zz) yy 2 E 2 G 2 D 2

31 UNIT Learning Overall objective function G = arg min max L VVVVVV E 1, G 1, E 2, G 2 + L GGGGGG G 1, D 1, G 2, D 2 G D E 1 GAN Real G 1 yy 1 D 1 G 1 (zz) G 2 (zz) Variation Autoencoder Loss L VVVVVV E 1, G 1, E 2, G 2 = EE G 1 E 1 xx 1 xx EE KKL(qq 1 (zz) pp(zz)) EE G 2 E 2 xx 2 xx EE KKL(qq 2 (zz) pp(zz)) E 2 G yy 2 2 D 2 Real Adversarial Loss L GGGGGG G 1, D 1, G 2, D 2 = EE log 1 D 1 (G 1 (zz) + EE log D 1 yy 1 EE log 1 D 2 (G 2 (zz) + EE log D 2 yy 2 Real

32 UNIT Example results Sunny Rainy Real Street-view Synthetic Street-view Rainy Sunny Synthetic Street-view Real Street-view Github Page: Liu et al., "Unsupervised image-to-image translation networks., NIPS 2017

33 From Image Understanding to Image Manipulation Representation Disentanglement InfoGAN: Unsupervised representation disentanglement AC-GAN: Supervised representation disentanglement Image Translation Pix2pix (CVPR 17): Pairwise cross-domain training data CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data UNIT (NIPS 17): Learning cross-domain image representation (with unpaired training data) DTN (ICLR 17) : Learning cross-domain image representation (with unpaired training data) Joint Image Translation & Disentanglement StarGAN (CVPR 18) : Image translation via representation disentanglement CDRD (CVPR 18) : Cross-domain representation disentanglement and translation

34 Domain Transfer Networks Unsupervised Cross-Domain Image Generation (ICLR 17) Goal/Problem Setting Image translation across two domains One-way only translation Unpaired training data Idea Apply unified model to learn joint representation across domains. Taigman et al., "Unsupervised cross-domain image generation., ICLR 2016

35 Domain Transfer Networks Unsupervised Cross-Domain Image Generation (ICLR 17) Goal/Problem Setting Image translation across two domains One-way only translation Unpaired training data Idea Apply unified model to learn joint representation across domains. Consistency observed in image and feature spaces Image consistency feature consistency Taigman et al., "Unsupervised cross-domain image generation., ICLR 2016

36 Domain Transfer Networks Learning Unified model to translate across domains G = arg min G max L iiiiii G + L ffffffff G + L GGGGGG G, D D Consistency of feature and image space L iiiiii G = EE gg ff yy yy 2 G D L ffffffff G = EE ff(gg ff xx ) ff(xx) 2 Adversarial loss L GGGGGG G, D = EE log 1 D(G(xx) + EE log 1 D(G(yy) + EE log D yy

37 Domain Transfer Networks Learning Unified model to translate across domains G = arg min G max L iiiiii G + L ffffffff G + L GGGGGG G, D D Consistency of image and feature space L iiiiii G = EE gg ff yy yy 2 yy xx Image consistency G D L ffffffff G = EE ff(gg ff xx ) ff(xx) 2 feature consistency G = {ff, gg} Adversarial loss L GGGGGG G, D = EE log 1 D(G(xx) + EE log 1 D(G(yy) + EE log D yy

38 Domain Transfer Networks Learning Unified model to translate across domains G = arg min G max L iiiiii G + L ffffffff G + L GGGGGG G, D D Consistency of feature and image space L iiiiii G = EE gg ff yy yy 2 yy xx G G(yy) G(xx) D L ffffffff G = EE ff(gg ff xx ) ff(xx) 2 Adversarial loss L GGGGGG G, D = EE log 1 D(G(xx) + EE log 1 D(G(yy) + EE log D yy

Domain Transfer Networks Learning Unified model to translate across domains G = arg min G max L iiiiii G + L ffffffff G + L GGGGGG G, D D Consistency of feature and image

39 Domain Transfer Networks Learning Unified model to translate across domains G = arg min G max L iiiiii G + L ffffffff G + L GGGGGG G, D D Consistency of feature and image space L iiiiii G = EE gg ff yy yy 2 yy xx G G(yy) G(xx) D L ffffffff G = EE ff(gg ff xx ) ff(xx) 2 Adversarial loss L GGGGGG G, D = EE log 1 D(G(xx) + EE log 1 D(G(yy) + EE log D yy

40 DTN Example results SVHN 2 MNIST Photo 2 Emoji Taigman et al., "Unsupervised cross-domain image generation., ICLR 2016

41 From Image Understanding to Image Manipulation Representation Disentanglement InfoGAN: Unsupervised representation disentanglement AC-GAN: Supervised representation disentanglement Image Translation Pix2pix (CVPR 17): Pairwise cross-domain training data CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data UNIT (NIPS 17): Learning cross-domain image representation (with unpaired training data) DTN (ICLR 17) : Learning cross-domain image representation (with unpaired training data) Joint Image Translation & Disentanglement StarGAN (CVPR 18) : Image translation via representation disentanglement CDRD (CVPR 18) : Cross-domain representation disentanglement and translation

42 StarGAN Goal Unified GAN for multi-domain image-to-image translation Traditional Cross-Domain Models Unified Multi-Domain Model (StarGAN) Choi et al. "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation." CVPR 2018

GG 14 GG 13 GG 24 GG 23 Unified G GG 34 Choi et al.

43 StarGAN Goal Unified GAN for multi-domain image-to-image translation Traditional Cross-Domain Models Unified Multi-Domain Model (StarGAN) GG 12 GG 14 GG 13 GG 24 GG 23 Unified G GG 34 Choi et al. "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation." CVPR 2018

44 StarGAN Goal / Problem Setting Single image translation model across multiple domains Unpaired training data

45 StarGAN Goal / Problem Setting Single Image translation model across multiple domains Unpaired training data Idea Concatenate image and target domain label as input of generator Auxiliary domain classifier on Discriminator Target domain Image

46 StarGAN Goal / Problem Setting Single Image translation model across multiple domains Unpaired training data Idea Concatenate image and target domain label as input of Generator Auxiliary domain classifier as discriminator too

47 StarGAN Goal / Problem Setting Single Image translation model across multiple domains Unpaired training data Idea Concatenate image and target domain label as input of Generator Auxiliary domain classifier on Discriminator Cycle consistency across domains

48 StarGAN Goal / Problem Setting Single Image translation model across multiple domains Unpaired training data Idea Auxiliary domain classifier as discriminator Concatenate image and target domain label as input Cycle consistency across domains

49 StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D

50 StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Adversarial Loss G(xx, cc) yy Adversarial Loss L GGGGGG G, D = EE log 1 D(G(xx, cc)) + EE log D yy cc xx

51 StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Domain Classification Loss G(xx, cc) yy Adversarial Loss L GGGGGG G, D = EE log 1 D(G(xx, cc)) + EE log D yy cc xx Domain Classification Loss (Disentanglement) L cccccc G, D = EE log D cls (cc yy) + EE log D cls (cc G(xx, cc))

52 StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Domain Classification Loss G(xx, cc) yy Adversarial L GGGGGG G, D Loss = EE log 1 D(G(xx, cc)) + EE log D yy Domain Classification Loss (Disentanglement) L cccccc G, D = EE log D cls (cc yy) + EE log D cls (cc G(xx, cc)) cc xx D cls (cc yy) Real data w.r.t. its domain label

StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Domain Classification Loss G(xx, cc) yy L GGGGGG G, D = EE log 1 D(G(xx, cc))

53 StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Domain Classification Loss G(xx, cc) yy L GGGGGG G, D = EE log 1 D(G(xx, cc)) Adversarial Loss + EE log D yy cc xx Domain Classification Loss (Disentanglement) D cls (cc G(xx, cc)) L cccccc G, D = EE log D cls (cc yy) + EE log D cls (cc G(xx, cc)) Generated data w.r.t. assigned label

54 StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Consistency Loss G(xx, cc) Adversarial Loss L GGGGGG G, D = EE log 1 D(G(xx, cc)) + EE log D yy cc xx L cccccc G, D = EE log D cls (cc yy) + EE log D cls (cc G(xx, cc)) Domain Classification Loss (Disentanglement) cc xx G G xx, cc, cc xx Cycle Consistency Loss L cccccc G = EE G G xx, cc, cc xx xx 1

55 StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Adversarial Loss L GGGGGG G, D = EE log 1 D(G(xx, cc)) + EE log D yy Domain Classification Loss L cccccc G, D = EE log D cls (cc yy) + EE log D cls (cc G(xx, cc)) Cycle Consistency Loss L cccccc G = EE G G xx, cc, cc xx xx 1

Multiple Domains Multiple Domains Github Page: https://github.

56 StarGAN Example results StarGAN can somehow be viewed as a representation disentanglement model, instead of an image translation one. Multiple Domains Multiple Domains Github Page: Choi et al. "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation." CVPR 2018

57 From Image Understanding to Image Manipulation Representation Disentanglement InfoGAN: Unsupervised representation disentanglement AC-GAN: Supervised representation disentanglement Image Translation Pix2pix (CVPR 17): Pairwise cross-domain training data CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data UNIT (NIPS 17): Learning cross-domain image representation (with unpaired training data) DTN (ICLR 17) : Learning cross-domain image representation (with unpaired training data) Joint Image Translation & Disentanglement StarGAN (CVPR 18) : Image translation via representation disentanglement CDRD (CVPR 18) : Cross-domain representation disentanglement and translation 57

58 Cross-Domain Representation Disnentanglement (CDRD) Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest Idea Bridge the domain gap across domains Auxiliary attribute classifier on Discriminator Semantics of disentangled factor is learned from label in source domain Joint Space Wang et al., Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation. CVPR

59 CDRD Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest Supervised (w/ or w/o Eyeglass) Unsupervised Idea Bridge the domain gap across domains Auxiliary attribute classifier on Discriminator Semantics of disentangled factor is learned from label in source domain 59

60 CDRD Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest w/ eyeglasses w/o eyeglasses Idea Bridge the domain gap across domains Auxiliary attribute classifier on Discriminator Semantics of disentangled factor is learned from label in source domain w/ eyeglasses w/o eyeglasses 60

61 CDRD Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest Real or Fake D Discriminator Idea Based on GAN Auxiliary attribute classifier on Discriminator Bridge the domain gap across domains Semantics of disentangled factor is learned from label in source domain XX SS XX SS Generator G zz 61

62 CDRD Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest Real or Fake D Discriminator Classification w.r.t. ll SS or ll Idea Based on GAN Auxiliary attribute classifier as Discriminator Bridge the domain gap across domains Semantics of disentangled factor is learned from label in source domain XX SS Supervised ll SS XX SS Generator G zz ll 62

63 CDRD Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest Idea Based on GAN Auxiliary attribute classifier on Discriminator Bridge the domain gap with division of high and low-level layers Semantics of disentangled factor is learned from label in source domain 63

64 CDRD Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest Idea Based on GAN Auxiliary attribute classifier on Discriminator Bridge the domain gap with division of high- and lowlevel layer Semantics of disentangled factor is learned from label info in source domain 64

65 CDRD Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D G D 65

66 CDRD Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D G D Adversarial Loss SS TT L GGGGGG G, D = L GGGGGG G, D + L GGGGGG G, D SS G, D = EE log D CC (D S (XX SS )) + EE log 1 D CC (D S ( XX SS )) L GGGGGG TT G, D = EE log D CC (D T (XX TT )) + EE log 1 D CC (D T ( XX TT )) L GGGGGG 66

67 CDRD Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D G D Adversarial Loss SS L GGGGGG G, D = L GGGGGG TT G, D + L GGGGGG G, D SS L GGGGGG G, D = EE log D CC (D S (XX SS )) + EE log 1 D CC (D S ( XX SS )) TT L GGGGGG G, D = EE log D CC (D T (XX TT )) + EE log 1 D CC (D T ( XX TT )) Real 67

68 CDRD Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D G D Adversarial Loss SS L GGGGGG G, D = L GGGGGG TT G, D + L GGGGGG G, D SS L GGGGGG G, D = EE log D CC (D S (XX SS )) + EE log 1 D CC (D S ( XX SS )) TT L GGGGGG G, D = EE log D CC (D T (XX TT )) + EE log 1 D CC (D T ( XX TT )) Generated 68

69 CDRD Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D G D Adversarial Loss SS TT L GGGGGG G, D = L GGGGGG G, D + L GGGGGG G, D SS G, D = EE log D CC (D S (XX SS )) + EE log 1 D CC (D S ( XX SS )) TT G, D = EE log D CC (D T (XX TT )) + EE log 1 D CC (D T ( XX TT )) L GGGGGG L GGGGGG Disentangle Loss SS L cccccc G, D = L cccccc TT G, D + L cccccc G, D SS G, D = EE log PP(ll = ll XX SS ) + EE log PP(ll = ll SS XX SS ) L cccccc TT L cccccc G, D = EE log PP(ll = ll XX TT ) AC-GAN 69

70 CDRD Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D G D Adversarial Loss SS TT L GGGGGG G, D = L GGGGGG G, D + L GGGGGG G, D SS G, D = EE log D CC (D S (XX SS )) + EE log 1 D CC (D S ( XX SS )) TT G, D = EE log D CC (D T (XX TT )) + EE log 1 D CC (D T ( XX TT )) L GGGGGG L GGGGGG Disentangle Loss SS L cccccc G, D = L cccccc TT G, D + L cccccc G, D SS G, D = EE log PP(ll = ll XX SS ) + EE log PP(ll = ll SS XX SS ) L cccccc TT L cccccc G, D = EE log PP(ll = ll XX TT ) InfoGAN 70

71 CDRD Add an additional encoder Input: Gaussian Noise Image Image translation with attribute of interest 71

72 CDRD Experiment results No Label Supervision Input Wang et al., Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation. CVPR

17 Face NIPS 16 NIPS 17 Scene Figure: t-sne for digit Wang et al.

73 CDRD Experiment results Cross-Domain Classification CVPR 13 CVPR 14 PAMI 17 CVPR 17 ICCV 15 ICML 15 NIPS 16 CVPR 17 ECCV 16 arxiv 16 Digits NIPS 16 NIPS 17 Face NIPS 16 NIPS 17 Scene Figure: t-sne for digit Wang et al., Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation. CVPR

Pix2pix X X X X CycleGAN O X O X StarGAN O O O X Cannot disentangle representation UNIT O X O O DTN

74 Comparisons Cross-Domain Image Translation Representation Disentanglement Unpaired Training Data Multidomains Bi-direction Joint Representation Unsupervised Interpretability of disentangled factor Pix2pix X X X X CycleGAN O X O X StarGAN O O O X Cannot disentangle representation UNIT O X O O DTN O X X O infogan Cannot translate image across domains O X AC-GAN X O CDRD (Ours) O O O O Partially O 74

75 What s to Be Covered Today Visualization of NNs t-sne Visualization of Feature Maps Neural Style Transfer Understanding NNs Image Translation Feature Disentanglement Recurrent Neural Networks LSTM & GRU Many slides from Fei-Fei Li, Bhiksha Raj, Yuying Yeh, and Hsuan-I Ho 75

76 What We Have Learned CNN for Image Classification DOG CAT MONKEY 76

77 What We Have Learned Generative Models (e.g., GAN) for Image Synthesis 77

78 But Is it sufficiently effective and robust to perform visual analysis from a single image? E.g., Are the hands of this man tied? 78

79 But Is it sufficiently effective and robust to perform visual analysis from a single image? E.g., Are the hands of this man tied? 79

80 So What Are The Limitations of CNN? Can t easily model series data Both input and output might be sequential data (scalar or vector) Simply feed-forward processing Cannot memorize, no long-term feedback DOG CAT MONKEY 80

81 Example of (Visual) Sequential Data 81

82 More Applications in Vision Image Captioning Figure from Vinyals et al, Show and tell: A neural image caption generator, CVPR

83 More Applications in Vision Visual Question Answering (VQA) Input output Figure from Zhu et al, Visual 7W: Grounded Question Answering in Images, CVPR

84 How to Model Sequential Data? Deep learning for sequential data Recurrent neural networks (RNN) 3-dimensional convolution neural networks RNN 3D convolution 84

85 Recurrent Neural Networks Parameter sharing + unrolling Keeps the number of parameters fixed Allows sequential data with varying lengths Memory ability Capture and preserve information which has been extracted 85

86 Recurrence Formula Same function and parameters used at every time step: h tt, yy tt = ff WW h tt 1, xx tt yy new state for time t output vector at time t state at time t-1 input vector at time t function with parameters W RNN xx 86

87 Recurrence Formula Same function and parameters used at every time step: h tt, yy tt = ff WW h tt 1, xx tt yy RNN h tt = ttttttt WW hh h tt 1 + WW xxx xx tt yy tt = WW hyy h tt xx 87

88 Multiple Recurrent Layers 88

89 Multiple Recurrent Layers 89

90 h 0 ff WW h 1 xx 1 90

91 h 0 ff WW h 1 ff WW h 2 xx 1 xx 2 91

92 h 0 ff WW h 1 ff WW h 2 ff WW h 3 h TT xx 1 xx 2 xx 3 92

93 W hh h 0 ff WW h 1 ff WW h 2 ff WW h 3 h TT xx 1 xx 2 xx 3 W xh 93

94 W hy yy 1 yy 2 yy 3 yy TT h 0 ff WW h 1 ff WW h 2 ff WW h 3 h TT xx 1 xx 2 xx 3 94

95 e.g., image caption e.g., action recognition e.g., video prediction e.g., video indexing 95

96 Example: Image Captioning Figure from Karpathy et a, Deep Visual-Semantic Alignments for Generating Image Descriptions, CVPR

97 CNN 97

98 98

99 99

100 100

101 101

102 102

103 103

104 Example: Action Recognition Figure from Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features 104

105 yy 1 yy 2 yy 3 yy TT h 0 ff WW h 1 ff WW h 2 ff WW h 3 h TT xx 1 xx 2 xx 3 105

106 golf standing exercise softmax yy 1 yy 2 yy 3 yy TT h 0 ff WW h 1 ff WW h 2 ff WW h 3 h TT xx 1 xx 2 xx 3 106

107 Back propagation through time cross entropy loss action label softmax yy 1 yy 2 yy 3 yy TT h 0 ff WW h 1 ff WW h 2 ff WW h 3 h TT xx 1 xx 2 xx 3 107

108 Back Propagation Through Time (BPTT) 108

109 Back Propagation Through Time (BPTT) Pascanu et al, On the difficulty of training recurrent neural networks, ICML

110 Training RNNs via BPTT 110

111 Training RNNs: Forward Pass Forward pass each training instance: Pass the entire data sequence through the net Generate outputs 111

112 Training RNNs: Computing Gradients After forward passing each training instance: Backward pass: compute gradients via BP More specifically, BPTT 112

113 Training RNNs: BPTT Let s focus on one training instance. The divergence computed is between the sequence of outputs by the network and the desired sequence of outputs. Generally, this is not just the sum of the divergences at individual times. 113

114 Gradient Vanishing & Exploding Computing gradient involves many factors of W Exploding gradients : Largest singular value > 1 Vanishing gradients : Largest singular value < 1 114

115 Solutions Gradients clipping : rescale gradients if too large standard gradient descent trajectories gradient clipping to fix problem How about vanishing gradients? 115

116 Variants of RNN Long Short-term Memory (LSTM) [Hochreiter et al., 1997] Additional memory cell Input/Forget/Output Gates Handle gradient vanishing Learn long-term dependencies Gated Recurrent Unit (GRU) [Cho et al., EMNLP 2014] Similar to LSTM No additional memory cell Reset / Update Gates Fewer parameters than LSTM Comparable performance to LSTM [Chung et al., NIPS Workshop 2014] 116

117 Vanilla RNN vs. LSTM LSTM Output in time t Input in time t Forget gate Input gate Input activation Output gate Cell state Hidden state Memory Cell Output in time t Input in time t 117

118 Long Short-Term Memory (LSTM) Forget gate Input gate Input activation Output gate Cell state Hidden state Memory Cell Output in time t Input in time t Image Credit: Hung-Yi Lee 118

119 Long Short-Term Memory (LSTM) Cell state Hidden state tanh tanh Forget gate Input gate Input activation Output gate Cell state Hidden state 119

120 Long Short-Term Memory (LSTM) Cell state Calculate forget gate f Forget gate Hidden state 120

121 Long Short-Term Memory (LSTM) Cell state Calculate forget gate f Input gate Hidden state 121

122 Long Short-Term Memory (LSTM) Cell state Calculate input activation Input activation tanh Hidden state 122

123 Long Short-Term Memory (LSTM) Cell state Calculate output gate Output gate tanh Hidden state 123

124 Long Short-Term Memory (LSTM) Cell state Update memory cell Hidden state tanh Cell state 124

125 Long Short-Term Memory (LSTM) Cell state Calculate output ht tanh tanh Hidden state Hidden state 125

126 Long Short-Term Memory (LSTM) Cell state Prevent gradient vanishing if forget gate is open (>0). tanh tanh Hidden state Remarks: - Forget gate f provides shortcut connection across time steps - Linear relationship between c t and c t-1 instead of multiplication relationship of h t and h t-1 in vanilla RNN 126

127 Variants of RNN Long Short-term Memory (LSTM) [Hochreiter et al., 1997] Additional memory cell Input / Forget / Output Gates Handle gradient vanishing Learn long-term dependencies Gated Recurrent Unit (GRU) [Cho et al., EMNLP 2014] Similar to LSTM No additional memory cell Reset/Update Gates Fewer parameters than LSTM Comparable performance to LSTM [Chung et al., NIPS Workshop 2014] 127

128 LSTM vs. GRU LSTM GRU Forget gate Input gate Input activation Output gate Cell state Hidden state Reset gate Update gate Input activation Hidden state 128

129 Gated Recurrent Unit (GRU) Hidden state tanh 129

130 Gated Recurrent Unit (GRU) Hidden state Calculate reset gate r Reset gate 130

131 Gated Recurrent Unit (GRU) Hidden state Calculate update gate z Update gate 131

132 Gated Recurrent Unit (GRU) Hidden state Calculate input activation tanh Input activation 132

133 Gated Recurrent Unit (GRU) Hidden state Calculate output tanh Hidden state 133

134 Gated Recurrent Unit (GRU) Hidden state Prevent gradient vanishing if update gate is open! tanh Remarks: - Update gate z provides shortcut connection across time steps - Linear relationship between ht and ht-1 instead of multiplication relationship of ht and ht-1 in vanilla RNN 134

135 Vanilla RNN, LSTM, & GRU yy RNN Output in time t Input in time t xx tanh tanh tanh 135

136 LSTM vs. GRU Cell state Number of Gates Parameters Gradient Vanishing / Exploding Vanilla RNN LSTM GRU 136

137 References book VT CS231n Jia-Bing Huang MLDS adl/doc/ _attention.pdf 137

138 Web Tutorial

139 What We ve Learned Today Understanding NNs Image Translation Feature Disentanglement Recurrent Neural Networks LSTM & GRU HW4 is due 5/18 Fri 23:59! 139

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step