Deep Learning for Computer Vision

Deep Learning for Computer Vision Spring 2018 http://vllab.ee.ntu.edu.tw/dlcv.html (primary) https://ceiba.ntu.edu.tw/1062dlcv (grade, etc.) FB: DLCV Spring 2018 Yu-Chiang Frank Wang 王鈺強, Associate Professor Dept. Electrical Engineering, National Taiwan University 2018/05/09

What s to Be Covered Today Visualization of NNs t-sne Visualization of Feature Maps Neural Style Transfer Understanding NNs Image Translation Feature Disentanglement Recurrent Neural Networks LSTM & GRU Many slides from Fei-Fei Li, Yaser Sheikh, Simon Lucey, Kaiming He, J.-B. Huang, Yuying Yeh, and Hsuan-I Ho

Representation Disentanglement Goal Interpretable deep feature representation Disentangle attribute of interest c from the derived latent representation z Unsupervised: InfoGAN Supervised: AC-GAN InfoGAN Chen et al. NIPS 16 ACGAN Odena et al. ICML 17 Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets., NIPS 2016. Odena et al., Conditional image synthesis with auxiliary classifier GANs. ICML 17 3

AC-GAN Supervised Disentanglement Learning Overall objective function D G = arg min G max D Adversarial Loss L GGGGGG G, D + L cccccc G, D yy (real) Supervised cc G(zz, cc) G zz L GGGGGG G, D = EE log 1 D(G(zz, cc)) + EE log D yy Disentanglement loss L cccccc G, D = EE log D cls (cc yy) + EE log D cls (cc G(xx, cc)) Real data w.r.t. its domain label Generated data w.r.t. assigned label Odena et al., Conditional image synthesis with auxiliary classifier GANs. ICML 17 4

AC-GAN Supervised Disentanglement D yy (real) Supervised cc G(zz, cc) G zz Different cc values Odena et al., Conditional image synthesis with auxiliary classifier GANs. ICML 17 5

InfoGAN Unsupervised Disentanglement Learning Overall objective function G = arg min G max D Adversarial Loss L GGGGGG G, D + L cccccc G, D L GGGGGG G, D = EE log 1 D(G(zz, cc)) Disentanglement loss L cccccc G, D = EE log D cls (cc G(xx, cc)) + EE log D yy Generated data w.r.t. assigned label Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets., NIPS 2016. 6

InfoGAN Unsupervised Disentanglement No guarantee in disentangling particular semantics Different cc Different cc Rotation Angle Width Loss Training process Chen et al., InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets., NIPS 2016. Time 7

From Image Understanding to Image Manipulation Representation Disentanglement InfoGAN: Unsupervised representation disentanglement AC-GAN: Supervised representation disentanglement Image Translation Pix2pix (CVPR 17): Pairwise cross-domain training data CycleGAN/DualGAN/DiscoGAN: Unpaired cross-domain training data UNIT (NIPS 17): Learning cross-domain image representation (with unpaired training data) DTN (ICLR 17) : Learning cross-domain image representation (with unpaired training data) Joint Image Translation & Disentanglement StarGAN (CVPR 18) : Image translation via representation disentanglement CDRD (CVPR 18) : Cross-domain representation disentanglement and translation 8

Pix2pix Image-to-image translation with conditional adversarial networks (CVPR 17) Can be viewed as image style transfer Sketch Photo Isola et al. " Image-to-image translation with conditional adversarial networks." CVPR 2017. 9

Pix2pix Testing Phase Goal / Problem Setting Image translation across two distinct domains (e.g., sketch v.s. photo) Pairwise training data Method: Conditional GAN Example: Sketch to Photo Generator (VAE) Input: Sketch Output: Photo Discriminator Input Input: Concatenation of Input(Sketch) & Synthesized/Real(Photo) images Output: Real or Fake Training Phase Generated Input real Concat Concat Input Isola et al. " Image-to-image translation with conditional adversarial networks." CVPR 2017. 10

Pix2pix Learning the model Training Phase Input Generated Overall objective function G = arg min max L cccccccc G, D + L LLL (G) G D Input Real Concat Concat Conditional GAN loss Concatenate L cccccccc (G, D) = EE xx log 1 D(xx, G(xx)) Reconstruction Loss L LLL (G) = EE xx,yy yy G(xx) 1 Fake (Generated) Concatenate + EE xx,yy log D xx, yy Real Input Isola et al. " Image-to-image translation with conditional adversarial networks." CVPR 2017. 11

Pix2pix Experiment results Demo page: https://affinelayer.com/pixsrv/ Isola et al. " Image-to-image translation with conditional adversarial networks." CVPR 2017. 12

CycleGAN/DiscoGAN/DualGAN CycleGAN (CVPR 17) Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks -to-image translation with conditional adversarial networks Paired Unpaired Easier to collect training data More practical 1-to-1 Correspondence No Correspondence Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

CycleGAN Training data Goal / Problem Setting Image translation across two distinct domains Photo Unpaired Painting Unpaired training data Idea Autoencoding-like image translation Cycle consistency between two domains Photo Painting Photo Cycle Consistency Painting Photo Painting Cycle Consistency Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

CycleGAN Method (Example: Photo & Painting) Based on 2 GANs First GAN (G1, D1): Photo to Painting Second GAN (G2, D2): Painting to Photo Photo (Input) G1 Painting (Generated) or D1 Real / Fake Cycle Consistency Photo consistency Painting consistency Painting (Input) G2 Painting (Real) Photo (Generated) or D2 Real / Fake Photo (Real) Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

CycleGAN Method (Example: Photo vs. Painting) Based on 2 GANs First GAN (G1, D1): Photo to Painting Second GAN (G2, D2): Photo to Painting Photo Painting Photo G1 G2 Photo Consistency Cycle Consistency Photo consistency Painting consistency Painting Photo Painting G2 G1 Painting Consistency Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

Learning CycleGAN Photo (Input) xx G1 Painting (Generated) G 1 (xx) or D1 Overall objective function G 1, G 2 = arg min max L GGGGGG G 1, D 1 G 1,G 2 D 1,D 2 + L GGGGGG G 2, D 2 + L cccccc G 1, G 2 First GAN Second GAN Adversarial Loss First GAN (G1, D1): L GGGGGG G 1, D 1 = EE log 1 D 1 (G 1 (xx)) + EE log D 1 yy Painting (Input) yy G2 yy Painting (Real) Photo (Generated) G 2 (yy) or D2 Real / Fake Second GAN (G2, D2): L GGGGGG G 2, D 2 = EE log 1 D 2 (G 2 (yy)) + EE log D 2 xx xx Photo (Real) Real / Fake Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

CycleGAN Photo Painting Photo xx G 1 xx G 2 G 1 xx G1 G2 Learning Overall objective function G 1, G 2 = arg min G 1,G 2 max D 1,D 2 L GGGGGG G 1, D 1 + L GGGGGG G 2, D 2 + L cccccc G 1, G 2 Consistency Loss Photo and Painting consistency L cccccc G 1, G 2 = EE G 2 G 1 xx xx 1 + G 1 G 2 yy yy 1 Photo Consistency Cycle Consistency Painting Photo Painting yy G 2 yy G 1 G 2 yy G2 G1 Painting Consistency Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

CycleGAN Example results Project Page: https://junyanz.github.io/cyclegan/ Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017.

Image Translation Using Unpaired Training Data CycleGAN, DiscoGAN, and DualGAN CycleGAN ICCV 17 DiscoGAN ICML 17 DualGAN ICCV 17 Zhu et al. "Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks." CVPR 2017. Kim et al. "Learning to Discover Cross-Domain Relations with Generative Adversarial Networks., ICML 2017 Yi et al. "DualGAN: Unsupervised dual learning for image-to-image translation." ICCV 2017

UNIT Unsupervised Image-to-Image Translation Networks (NIPS 17) Image translation via learning cross-domain joint representation Stage1: Encode to the joint space ZZ: Joint latent space zz Stage2: Generate cross-domain images ZZ: Joint latent space zz Day Night Day Night xx 2 xx xx 1 1 XX 1 XX XX 2 1 XX 2 xx 2 Liu et al., "Unsupervised image-to-image translation networks., NIPS 2017

UNIT Goal/Problem Setting Image translation across two distinct domains Unpaired training image data Idea Based on two parallel VAE-GAN models VAE GAN Liu et al., "Unsupervised image-to-image translation networks., NIPS 2017

UNIT Goal/Problem Setting Image translation across two distinct domains Unpaired training image data Idea Based on two parallel VAE-GAN models Learning of joint representation across image domains Liu et al., "Unsupervised image-to-image translation networks., NIPS 2017

UNIT Goal/Problem Setting Image translation across two distinct domains Unpaired training image data Idea Based on two parallel VAE-GAN models Learning of joint representation across image domains Generate cross-domain images from joint representation Liu et al., "Unsupervised image-to-image translation networks., NIPS 2017

UNIT Learning Overall objective function G = arg min G max D L VVVVVV E 1, G 1, E 2, G 2 + L GGGGGG G 1, D 1, G 2, D 2 Variation Autoencoder Adversarial VAE E 1 G 1 D 1 G 1 (zz) G 2 (zz) GAN Variation Autoencoder Loss L VVVVVV E 1, G 1, E 2, G 2 = EE G 1 E 1 xx 1 xx 1 2 + EE KKL(qq 1 (zz) pp(zz)) EE G 2 E 2 xx 2 xx 2 2 + EE KKL(qq 2 (zz) pp(zz)) Adversarial Loss L GGGGGG G 1, D 1, G 2, D 2 = EE log 1 D 1 (G 1 (zz) + EE log D 1 yy 1 EE log 1 D 2 (G 2 (zz) + EE log D 2 yy 2 E 2 G 2 D 2

UNIT Learning Overall objective function G = arg min max L VVVVVV E 1, G 1, E 2, G 2 + L GGGGGG G 1, D 1, G 2, D 2 G D Variation Autoencoder Loss VAE E 1 G 1 D 1 G 1 E 1 xx 1 G 2 E 2 xx 2 E 2 G 2 D 2 L VVVVVV E 1, G 1, E 2, G 2 = EE G 1 E 1 xx 1 xx 1 2 + EE KKL(qq 1 (zz) pp(zz)) EE G 2 E 2 xx 2 xx 2 2 + EE KKL(qq 2 (zz) pp(zz)) Adversarial Loss Reconstruction L GGGGGG G 1, D 1, G 2, D 2 = EE log 1 D 1 (G 1 (zz) + EE log D 1 yy 1 EE log 1 D 2 (G 2 (zz) + EE log D 2 yy 2

UNIT Learning Overall objective function G = arg min max L VVVVVV E 1, G 1, E 2, G 2 + L GGGGGG G 1, D 1, G 2, D 2 G D Variation Autoencoder Loss VAE E 1 G 1 D 1 G 1 E 1 xx 1 G 2 E 2 xx 2 E 2 G 2 D 2 L VVVVVV E 1, G 1, E 2, G 2 = EE G 1 E 1 xx 1 xx 1 2 + EE KKL(qq 1 (zz) pp(zz)) EE G 2 E 2 xx 2 xx 2 2 + EE KKL(qq 2 (zz) pp(zz)) Adversarial Loss Prior Loss L GGGGGG G 1, D 1, G 2, D 2 = EE log 1 D 1 (G 1 (zz) + EE log D 1 yy 1 EE log 1 D 2 (G 2 (zz) + EE log D 2 yy 2

UNIT Learning Overall objective function G = arg min max L VVVVVV E 1, G 1, E 2, G 2 + L GGGGGG G 1, D 1, G 2, D 2 G D E 1 D 1 G 1 G 1 (zz) GAN yy 1 Variation Autoencoder Loss L VVVVVV E 1, G 1, E 2, G 2 = EE G 1 E 1 xx 1 xx 1 2 + EE KKL(qq 1 (zz) pp(zz)) EE G 2 E 2 xx 2 xx 2 2 + EE KKL(qq 2 (zz) pp(zz)) Adversarial Loss L GGGGGG G 1, D 1, G 2, D 2 = EE log 1 D 1 (G 1 (zz) + EE log D 1 yy 1 EE log 1 D 2 (G 2 (zz) + EE log D 2 yy 2 Generated G 2 (zz) yy 2 E 2 G 2 D 2

UNIT Learning Overall objective function G = arg min max L VVVVVV E 1, G 1, E 2, G 2 + L GGGGGG G 1, D 1, G 2, D 2 G D E 1 GAN Real G 1 yy 1 D 1 G 1 (zz) G 2 (zz) Variation Autoencoder Loss L VVVVVV E 1, G 1, E 2, G 2 = EE G 1 E 1 xx 1 xx 1 2 + EE KKL(qq 1 (zz) pp(zz)) EE G 2 E 2 xx 2 xx 2 2 + EE KKL(qq 2 (zz) pp(zz)) E 2 G yy 2 2 D 2 Real Adversarial Loss L GGGGGG G 1, D 1, G 2, D 2 = EE log 1 D 1 (G 1 (zz) + EE log D 1 yy 1 EE log 1 D 2 (G 2 (zz) + EE log D 2 yy 2 Real

UNIT Example results Sunny Rainy Real Street-view Synthetic Street-view Rainy Sunny Synthetic Street-view Real Street-view Github Page: https://github.com/mingyuliutw/unit Liu et al., "Unsupervised image-to-image translation networks., NIPS 2017

Domain Transfer Networks Unsupervised Cross-Domain Image Generation (ICLR 17) Goal/Problem Setting Image translation across two domains One-way only translation Unpaired training data Idea Apply unified model to learn joint representation across domains. Taigman et al., "Unsupervised cross-domain image generation., ICLR 2016

Domain Transfer Networks Unsupervised Cross-Domain Image Generation (ICLR 17) Goal/Problem Setting Image translation across two domains One-way only translation Unpaired training data Idea Apply unified model to learn joint representation across domains. Consistency observed in image and feature spaces Image consistency feature consistency Taigman et al., "Unsupervised cross-domain image generation., ICLR 2016

Domain Transfer Networks Learning Unified model to translate across domains G = arg min G max L iiiiii G + L ffffffff G + L GGGGGG G, D D Consistency of feature and image space L iiiiii G = EE gg ff yy yy 2 G D L ffffffff G = EE ff(gg ff xx ) ff(xx) 2 Adversarial loss L GGGGGG G, D = EE log 1 D(G(xx) + EE log 1 D(G(yy) + EE log D yy

Domain Transfer Networks Learning Unified model to translate across domains G = arg min G max L iiiiii G + L ffffffff G + L GGGGGG G, D D Consistency of image and feature space L iiiiii G = EE gg ff yy yy 2 yy xx Image consistency G D L ffffffff G = EE ff(gg ff xx ) ff(xx) 2 feature consistency G = {ff, gg} Adversarial loss L GGGGGG G, D = EE log 1 D(G(xx) + EE log 1 D(G(yy) + EE log D yy

Domain Transfer Networks Learning Unified model to translate across domains G = arg min G max L iiiiii G + L ffffffff G + L GGGGGG G, D D Consistency of feature and image space L iiiiii G = EE gg ff yy yy 2 yy xx G G(yy) G(xx) D L ffffffff G = EE ff(gg ff xx ) ff(xx) 2 Adversarial loss L GGGGGG G, D = EE log 1 D(G(xx) + EE log 1 D(G(yy) + EE log D yy

DTN Example results SVHN 2 MNIST Photo 2 Emoji Taigman et al., "Unsupervised cross-domain image generation., ICLR 2016

StarGAN Goal Unified GAN for multi-domain image-to-image translation Traditional Cross-Domain Models Unified Multi-Domain Model (StarGAN) Choi et al. "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation." CVPR 2018

StarGAN Goal Unified GAN for multi-domain image-to-image translation Traditional Cross-Domain Models Unified Multi-Domain Model (StarGAN) GG 12 GG 14 GG 13 GG 24 GG 23 Unified G GG 34 Choi et al. "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation." CVPR 2018

StarGAN Goal / Problem Setting Single image translation model across multiple domains Unpaired training data

StarGAN Goal / Problem Setting Single Image translation model across multiple domains Unpaired training data Idea Concatenate image and target domain label as input of generator Auxiliary domain classifier on Discriminator Target domain Image

StarGAN Goal / Problem Setting Single Image translation model across multiple domains Unpaired training data Idea Concatenate image and target domain label as input of Generator Auxiliary domain classifier as discriminator too

StarGAN Goal / Problem Setting Single Image translation model across multiple domains Unpaired training data Idea Auxiliary domain classifier as discriminator Concatenate image and target domain label as input Cycle consistency across domains

StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D

StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Adversarial Loss G(xx, cc) yy Adversarial Loss L GGGGGG G, D = EE log 1 D(G(xx, cc)) + EE log D yy cc xx

StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Domain Classification Loss G(xx, cc) yy Adversarial Loss L GGGGGG G, D = EE log 1 D(G(xx, cc)) + EE log D yy cc xx Domain Classification Loss (Disentanglement) L cccccc G, D = EE log D cls (cc yy) + EE log D cls (cc G(xx, cc))

StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Domain Classification Loss G(xx, cc) yy Adversarial L GGGGGG G, D Loss = EE log 1 D(G(xx, cc)) + EE log D yy Domain Classification Loss (Disentanglement) L cccccc G, D = EE log D cls (cc yy) + EE log D cls (cc G(xx, cc)) cc xx D cls (cc yy) Real data w.r.t. its domain label

StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Domain Classification Loss G(xx, cc) yy L GGGGGG G, D = EE log 1 D(G(xx, cc)) Adversarial Loss + EE log D yy cc xx Domain Classification Loss (Disentanglement) D cls (cc G(xx, cc)) L cccccc G, D = EE log D cls (cc yy) + EE log D cls (cc G(xx, cc)) Generated data w.r.t. assigned label

StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Consistency Loss G(xx, cc) Adversarial Loss L GGGGGG G, D = EE log 1 D(G(xx, cc)) + EE log D yy cc xx L cccccc G, D = EE log D cls (cc yy) + EE log D cls (cc G(xx, cc)) Domain Classification Loss (Disentanglement) cc xx G G xx, cc, cc xx Cycle Consistency Loss L cccccc G = EE G G xx, cc, cc xx xx 1

StarGAN Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D + L cccccc G G D Adversarial Loss L GGGGGG G, D = EE log 1 D(G(xx, cc)) + EE log D yy Domain Classification Loss L cccccc G, D = EE log D cls (cc yy) + EE log D cls (cc G(xx, cc)) Cycle Consistency Loss L cccccc G = EE G G xx, cc, cc xx xx 1

StarGAN Example results StarGAN can somehow be viewed as a representation disentanglement model, instead of an image translation one. Multiple Domains Multiple Domains Github Page: https://github.com/yunjey/stargan Choi et al. "StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation." CVPR 2018

Cross-Domain Representation Disnentanglement (CDRD) Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest Idea Bridge the domain gap across domains Auxiliary attribute classifier on Discriminator Semantics of disentangled factor is learned from label in source domain Joint Space Wang et al., Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation. CVPR 2018 58

CDRD Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest Supervised (w/ or w/o Eyeglass) Unsupervised Idea Bridge the domain gap across domains Auxiliary attribute classifier on Discriminator Semantics of disentangled factor is learned from label in source domain 59

CDRD Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest w/ eyeglasses w/o eyeglasses Idea Bridge the domain gap across domains Auxiliary attribute classifier on Discriminator Semantics of disentangled factor is learned from label in source domain w/ eyeglasses w/o eyeglasses 60

CDRD Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest Real or Fake D Discriminator Idea Based on GAN Auxiliary attribute classifier on Discriminator Bridge the domain gap across domains Semantics of disentangled factor is learned from label in source domain XX SS XX SS Generator G zz 61

CDRD Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest Real or Fake D Discriminator Classification w.r.t. ll SS or ll Idea Based on GAN Auxiliary attribute classifier as Discriminator Bridge the domain gap across domains Semantics of disentangled factor is learned from label in source domain XX SS Supervised ll SS XX SS Generator G zz ll 62

CDRD Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest Idea Based on GAN Auxiliary attribute classifier on Discriminator Bridge the domain gap with division of high and low-level layers Semantics of disentangled factor is learned from label in source domain 63

CDRD Goal / Problem Setting Learning cross-domain joint disentangled representation Single domain supervision Cross-domain image translation with attribute of interest Idea Based on GAN Auxiliary attribute classifier on Discriminator Bridge the domain gap with division of high- and lowlevel layer Semantics of disentangled factor is learned from label info in source domain 64

CDRD Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D G D 65

CDRD Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D G D Adversarial Loss SS L GGGGGG G, D = L GGGGGG TT G, D + L GGGGGG G, D SS L GGGGGG G, D = EE log D CC (D S (XX SS )) + EE log 1 D CC (D S ( XX SS )) TT L GGGGGG G, D = EE log D CC (D T (XX TT )) + EE log 1 D CC (D T ( XX TT )) Real 67

CDRD Learning Overall objective function G = arg min max L GGGGGG G, D + L cccccc G, D G D Adversarial Loss SS TT L GGGGGG G, D = L GGGGGG G, D + L GGGGGG G, D SS G, D = EE log D CC (D S (XX SS )) + EE log 1 D CC (D S ( XX SS )) TT G, D = EE log D CC (D T (XX TT )) + EE log 1 D CC (D T ( XX TT )) L GGGGGG L GGGGGG Disentangle Loss SS L cccccc G, D = L cccccc TT G, D + L cccccc G, D SS G, D = EE log PP(ll = ll XX SS ) + EE log PP(ll = ll SS XX SS ) L cccccc TT L cccccc G, D = EE log PP(ll = ll XX TT ) AC-GAN 69

CDRD Add an additional encoder Input: Gaussian Noise Image Image translation with attribute of interest 71

CDRD Experiment results No Label Supervision Input Wang et al., Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation. CVPR 2018 72

CDRD Experiment results Cross-Domain Classification CVPR 13 CVPR 14 PAMI 17 CVPR 17 ICCV 15 ICML 15 NIPS 16 CVPR 17 ECCV 16 arxiv 16 Digits NIPS 16 NIPS 17 Face NIPS 16 NIPS 17 Scene Figure: t-sne for digit Wang et al., Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation. CVPR 2018 73

Comparisons Cross-Domain Image Translation Representation Disentanglement Unpaired Training Data Multidomains Bi-direction Joint Representation Unsupervised Interpretability of disentangled factor Pix2pix X X X X CycleGAN O X O X StarGAN O O O X Cannot disentangle representation UNIT O X O O DTN O X X O infogan Cannot translate image across domains O X AC-GAN X O CDRD (Ours) O O O O Partially O 74

What We Have Learned CNN for Image Classification DOG CAT MONKEY 76

What We Have Learned Generative Models (e.g., GAN) for Image Synthesis 77

But Is it sufficiently effective and robust to perform visual analysis from a single image? E.g., Are the hands of this man tied? 78

But Is it sufficiently effective and robust to perform visual analysis from a single image? E.g., Are the hands of this man tied? 79

So What Are The Limitations of CNN? Can t easily model series data Both input and output might be sequential data (scalar or vector) Simply feed-forward processing Cannot memorize, no long-term feedback DOG CAT MONKEY 80

Example of (Visual) Sequential Data https://quickdraw.withgoogle.com/# 81

More Applications in Vision Image Captioning Figure from Vinyals et al, Show and tell: A neural image caption generator, CVPR 2015 82

More Applications in Vision Visual Question Answering (VQA) Input output Figure from Zhu et al, Visual 7W: Grounded Question Answering in Images, CVPR 2016 83

How to Model Sequential Data? Deep learning for sequential data Recurrent neural networks (RNN) 3-dimensional convolution neural networks RNN 3D convolution 84

Recurrent Neural Networks Parameter sharing + unrolling Keeps the number of parameters fixed Allows sequential data with varying lengths Memory ability Capture and preserve information which has been extracted 85

Recurrence Formula Same function and parameters used at every time step: h tt, yy tt = ff WW h tt 1, xx tt yy new state for time t output vector at time t state at time t-1 input vector at time t function with parameters W RNN xx 86

Recurrence Formula Same function and parameters used at every time step: h tt, yy tt = ff WW h tt 1, xx tt yy RNN h tt = ttttttt WW hh h tt 1 + WW xxx xx tt yy tt = WW hyy h tt xx 87

Multiple Recurrent Layers 88

Multiple Recurrent Layers 89

h 0 ff WW h 1 xx 1 90

h 0 ff WW h 1 ff WW h 2 xx 1 xx 2 91

h 0 ff WW h 1 ff WW h 2 ff WW h 3 h TT xx 1 xx 2 xx 3 92

W hh h 0 ff WW h 1 ff WW h 2 ff WW h 3 h TT xx 1 xx 2 xx 3 W xh 93

W hy yy 1 yy 2 yy 3 yy TT h 0 ff WW h 1 ff WW h 2 ff WW h 3 h TT xx 1 xx 2 xx 3 94

e.g., image caption e.g., action recognition e.g., video prediction e.g., video indexing 95

Example: Image Captioning Figure from Karpathy et a, Deep Visual-Semantic Alignments for Generating Image Descriptions, CVPR 2015 96

CNN 97

100

101

102

103

Example: Action Recognition Figure from Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features 104

yy 1 yy 2 yy 3 yy TT h 0 ff WW h 1 ff WW h 2 ff WW h 3 h TT xx 1 xx 2 xx 3 105

golf standing exercise softmax yy 1 yy 2 yy 3 yy TT h 0 ff WW h 1 ff WW h 2 ff WW h 3 h TT xx 1 xx 2 xx 3 106

Back propagation through time cross entropy loss action label softmax yy 1 yy 2 yy 3 yy TT h 0 ff WW h 1 ff WW h 2 ff WW h 3 h TT xx 1 xx 2 xx 3 107

Back Propagation Through Time (BPTT) 108

Back Propagation Through Time (BPTT) Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 109

Training RNNs via BPTT 110

Training RNNs: Forward Pass Forward pass each training instance: Pass the entire data sequence through the net Generate outputs 111

Training RNNs: Computing Gradients After forward passing each training instance: Backward pass: compute gradients via BP More specifically, BPTT 112

Training RNNs: BPTT Let s focus on one training instance. The divergence computed is between the sequence of outputs by the network and the desired sequence of outputs. Generally, this is not just the sum of the divergences at individual times. 113

Gradient Vanishing & Exploding Computing gradient involves many factors of W Exploding gradients : Largest singular value > 1 Vanishing gradients : Largest singular value < 1 114

Solutions Gradients clipping : rescale gradients if too large standard gradient descent trajectories gradient clipping to fix problem How about vanishing gradients? 115

Variants of RNN Long Short-term Memory (LSTM) [Hochreiter et al., 1997] Additional memory cell Input/Forget/Output Gates Handle gradient vanishing Learn long-term dependencies Gated Recurrent Unit (GRU) [Cho et al., EMNLP 2014] Similar to LSTM No additional memory cell Reset / Update Gates Fewer parameters than LSTM Comparable performance to LSTM [Chung et al., NIPS Workshop 2014] 116

Vanilla RNN vs. LSTM LSTM Output in time t Input in time t Forget gate Input gate Input activation Output gate Cell state Hidden state Memory Cell Output in time t Input in time t 117

Long Short-Term Memory (LSTM) Forget gate Input gate Input activation Output gate Cell state Hidden state Memory Cell Output in time t Input in time t Image Credit: Hung-Yi Lee 118

Long Short-Term Memory (LSTM) Cell state Hidden state tanh tanh Forget gate Input gate Input activation Output gate Cell state Hidden state 119

Long Short-Term Memory (LSTM) Cell state Calculate forget gate f Forget gate Hidden state 120

Long Short-Term Memory (LSTM) Cell state Calculate forget gate f Input gate Hidden state 121

Long Short-Term Memory (LSTM) Cell state Calculate input activation Input activation tanh Hidden state 122

Long Short-Term Memory (LSTM) Cell state Calculate output gate Output gate tanh Hidden state 123

Long Short-Term Memory (LSTM) Cell state Update memory cell Hidden state tanh Cell state 124

Long Short-Term Memory (LSTM) Cell state Calculate output ht tanh tanh Hidden state Hidden state 125

Long Short-Term Memory (LSTM) Cell state Prevent gradient vanishing if forget gate is open (>0). tanh tanh Hidden state Remarks: - Forget gate f provides shortcut connection across time steps - Linear relationship between c t and c t-1 instead of multiplication relationship of h t and h t-1 in vanilla RNN 126

Variants of RNN Long Short-term Memory (LSTM) [Hochreiter et al., 1997] Additional memory cell Input / Forget / Output Gates Handle gradient vanishing Learn long-term dependencies Gated Recurrent Unit (GRU) [Cho et al., EMNLP 2014] Similar to LSTM No additional memory cell Reset/Update Gates Fewer parameters than LSTM Comparable performance to LSTM [Chung et al., NIPS Workshop 2014] 127

LSTM vs. GRU LSTM GRU Forget gate Input gate Input activation Output gate Cell state Hidden state Reset gate Update gate Input activation Hidden state 128

Gated Recurrent Unit (GRU) Hidden state tanh 129

Gated Recurrent Unit (GRU) Hidden state Calculate reset gate r Reset gate 130

Gated Recurrent Unit (GRU) Hidden state Calculate update gate z Update gate 131

Gated Recurrent Unit (GRU) Hidden state Calculate input activation tanh Input activation 132

Gated Recurrent Unit (GRU) Hidden state Calculate output tanh Hidden state 133

Gated Recurrent Unit (GRU) Hidden state Prevent gradient vanishing if update gate is open! tanh Remarks: - Update gate z provides shortcut connection across time steps - Linear relationship between ht and ht-1 instead of multiplication relationship of ht and ht-1 in vanilla RNN 134

Vanilla RNN, LSTM, & GRU yy RNN Output in time t Input in time t xx tanh tanh tanh 135

LSTM vs. GRU Cell state Number of Gates Parameters Gradient Vanishing / Exploding Vanilla RNN LSTM GRU 136

References book http://www.deeplearningbook.org/contents/rnn.html VT https://computing.ece.vt.edu/~f15ece6504/slides/l14_rnns.pptx.pdf https://computing.ece.vt.edu/~f15ece6504/slides/l15_lstms.pptx.pdf CS231n http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf Jia-Bing Huang https://filebox.ece.vt.edu/~jbhuang/teaching/ece5554-4554/fa17/lectures/lecture_23_actionrec.pdf MLDS https://www.csie.ntu.edu.tw/~yvchen/f106- adl/doc/171030+171102_attention.pdf 137

Web Tutorial http://www.wildml.com/2015/09/recurrent-neural-networkstutorial-part-1-introduction-to-rnns/ https://r2rt.com/written-memories-understanding-derivingand-extending-the-lstm.html 138

What We ve Learned Today Understanding NNs Image Translation Feature Disentanglement Recurrent Neural Networks LSTM & GRU HW4 is due 5/18 Fri 23:59! 139