arxiv: v3 [cs.cv] 24 Jul 2018

Size: px

Start display at page:

Download "arxiv: v3 [cs.cv] 24 Jul 2018"

Erick Doyle
5 years ago
Views:

1 Rankng CGANs: Subjectve Control over Semantc Image Attrbutes Yassr Saqul, Kwang In Km, and Peter Hall Unversty of Bath arxv: v3 [cs.cv] 24 Jul 2018 Abstract In ths paper, we nvestgate the use of generatve adversaral networks n the task of mage generaton accordng to subjectve measures of semantc attrbutes. Unlke the standard (CGAN) that generates mages from dscrete categorcal labels, our archtecture handles both contnuous and dscrete scales. Gven parwse comparsons of mages, our model, called RankCGAN, performs two tasks: t learns to rank mages usng a subjectve measure; and t learns a generatve model that can be controlled by that measure. RankCGAN assocates each subjectve measure of nterest to a dstnct dmenson of some latent space. We perform experments on UT-Zap50K, PubFg and OSR datasets and demonstrate that the model s expressve and dverse enough to conduct two-attrbute exploraton and mage edtng. 1 Introducton Humans routnely subjectvely order objects on the bass of semantc attrbutes. For example, people agree on what s sporty or stylsh, at least to a suffcent extent that meanngful communcaton s possble. Hence one can magne applcatons such as a shoppng assstant n whch a user can browse a collecton usng subjectve terms show me somethng more elegant than the dress I am lookng at now. However, such subjectve concepts are ll defned n a mathematcal sense, makng the buldng of a computatonal model of them a sgnfcant open challenge. In ths paper, we propose a neural archtecture that addresses the problem of syntheszng natural mages by controllng subjectve measures of semantc attrbutes. For nstance, we want a system that produces rank ordered mages of shoes accordng to the subjectve degree to whch they exhbt a semantc attrbute such as sporty, or pars of such attrbutes comfortable and lght, for example. The underlyng mechansm s a Condtonal Generatve Adversaral Network (CGAN) [33]. CGANs provde a low dmensonal latent space for realstc mage generaton where the mages are synthessed by samplng from the latent space. However, controllng CGAN to generate an mage wth a specfc feature s challengng, because the nput to CGAN s an n dmensonal nose vector. Our work shows t s possble to decompose ths latent space nto subspaces: one s a random space as usual n CGAN, the other subspace s occuped by control varables n partcular subjectvely meanngful control varables. In recent work, semantc (subjectve) attrbutes were defned as categorcal labels [7, 6] ndcatng the presence or the absence of some attrbutes. A condtonal generatve model can be traned, under supervson, where the labels are assgned to a latent varable. Ths results n some control over the generated mages, but that control s lmted to swtchng on or off the desred attrbute: t s bnary. In contrast, we provde a mappng of semantc attrbutes onto a contnuous subjectve scale. Tranng s a partcular ssue for ranked subjectve attrbutes. Although humans can order elements n an object class, there s typcally no scale assocated wth the orderng: people can say one par of shoes s more sporty than another par, but not by how much. To address ths problem we tran usng an annotated set of mage pars where each par s ordered under human supervson. A rankng functon s learned per attrbute whch can predct the rank of a novel mage. The key characterstc of our approach s the addton of a rankng unt that operates alongsde the usual dscrmnator unt. Both the ranker and the dscrmnator receve nputs from a generator. As s usual, the role of the dscrmnator s to dstngush between real and fake mages; the role of the ranker s to nfer subjectve rankng accordng to the semantc attrbutes. We call our archtecture RankCGAN. We evaluated RankCGAN on three datasets, shoes (UT-Zap50K) [51], faces (PubFg) [21] and scenes (OSR) [35] datasets. Results n Secton 4 show that our model can dsentangle multple attrbutes and can keep a correct contnuous varaton of the attrbute s strength wth respect to a rankng score, whch s not guaranteed wth a standard CGAN. 1

2 Contrbutons: In sum, our contrbutons are: (1) To provde a soluton to the problem of the subjectve rank orderng of semantc attrbutes; (2) A novel condtonal generatve model called RankCGAN, that can generate mages under semantc attrbutes that are subject to a global subjectve rankng; (3) A tranng scheme that requres only pars of mages to be subjectvely ranked. There s no need for global rankng or to annotate the whole dataset. The expermental secton, Secton 4, provdes evdence for these clams. We show quanttatve and qualtatve results, as well as applcatons n attrbute-based mage generaton, edtng and transfer tasks. 2 Related work Deep Generatve Model: There s substantal lterature on learnng wth deep generatve models. Early studes were based on unsupervsed learnng by usng restrcted Botlzmann machnes and denosng auto-encoders [10, 11, 48, 44]. Recently, determnstc networks [5, 50, 40] propose archtectures for mage synthess. In comparson, stochastc networks rely on a probablstc formulaton of the problem. The Varatonal Autoencoder (VAE) [19, 41] maxmzes a lower bound on the log-lkelhood of tranng data. Autoregressve models.e. PxelRNN [34] represents drectly the condtonal dstrbuton over the pxel space. Generatve Adversaral networks (GANs) [8] have the ablty to generate sharp mage samples on datasets wth hgher resoluton. Several studes have nvestgated condtonal mage generaton settngs. Most of the methods use a supervsed approach such as text, attrbutes and class label condtonng to learn a desred transformaton [33, 47, 30, 39]. Addtonally, there are works on mage-condtoned models, such as style transfer [15, 55], super-resoluton [42, 24] and cross-doman mage translaton [13, 31, 17, 25]. In contrast, fewer works have focused on decomposng the generatve control varable nto meanngful components. InfoGAN [4] uses an unsupervsed approach to learn semantc features maxmzng mutual nformaton between the latent code and the generated observaton. In the spectrum of supervsed approaches, VAE [19] was used n DC-IGNs [20] to learn latent codes representng the renderng process of 3D objects, smlarly used n Attrbute2Image [49] to generate mages by separately generatng the background and the foreground. Attempts to study the ncorporaton of the adversaral tranng objectves were conducted n VAE [32] and autoencoders [29] settngs. The fader network [22] approach s an end-to-end archtecture where the attrbute s ncorporated n the encoded mage. Traned usng categorcal labels, t controls the presence of an attrbute whle preservng the naturalness of the mage, whch requres a careful parametrzaton durng the tranng process snce the decoder s updated on the reconstructon and adversaral objectves. Independently, CFGAN [16] proposes an extenson of CGAN, where the attrbute value s not fed drectly to the model but assocated to latent vector usng a flterng archtecture whch enables more varatons of the attrbute and hence more control optons, e.g., rado buttons or slders. To the best of our knowledge, we are the frst to ncorporate a parwse ranker n GAN for mage generaton task. In contrast, a recent work [26] proposes a method, alas RankGAN, whch substtutes the dscrmnator of GAN by a ranker for generatng hgh-qualty natural language descrptons, where the ranker s objectve s to order human-wrtten sentences hgher than machne-wrtten sentences wth respect to reference sentences, and the generator s objectve s to produce a synthetc sentence that receves hgher rankng score than those drawn from the real data. If we were to project ths method on mage generaton task, the ranker wll be modelled to order real mage hgher than generated ones. But, n our work, the ranker wll play a dfferent role: t orders the mages accordng to ther present semantc attrbutes rather than the qualty of mages. Image Edtng: Image edtng methods have a long hstory n the research communty. Recently, CNN based approaches have shown promsng results n mage edtng tasks such as mage colorzaton [12, 23, 53], flterng [28] and npantng [36]. These methods use an unsupervsed tranng protocol durng the reconstructon of the mage, that may not capture mportant semantc contents. A handful works are nterested n usng deep generatve models for mage edtng tasks. GAN [54] allows the user to mpose color and shape constrants on an nput mage, then reformulates them nto an optmzaton problem to fnd the best latent code of GAN satsfyng these constrants. The generated mage moton and color flow are transferred durng the nterpolaton to the real mage. Smlarly, the Neural photo edtor [2] proposes an nterface for portrat edtng usng an hybrd VAE/GAN archtecture. Amng for hgh semantc level mage edtng, the Invertble Condtonal GAN [37] tran separately a CGAN on the attrbutes of nterest and an encoder that maps the nput mage to the latent space of CGAN so that t can be modfed usng the attrbute latent varables. In the same context, CFGAN [16] reles on GAN s [54] approach to estmate the latent varables of an nput mage n order to edt ts attrbutes, whle the Fader network [22] can drectly manpulate the attrbutes of the nput mage snce the encoder s ncorporated n the tranng process. 2

3 Relatve Attrbutes: Vsual concepts can be represented or descrbed by semantc attrbutes. In early studes, bnary attrbutes descrbng the presence of an attrbute showed excellent performance n object recognton [45] and acton recognton [27]. However, n order to quantfy the strength of an attrbute, we should am for a better representaton. D. Parkh et al. [35] use relatve attrbutes to learn a global rankng functon on mages usng constrants descrbng the relatve emphass of attrbutes (e.g. parwse comparson of mages). Ths approach s regarded as solvng a learnng-to-rank problem where a lnear functon s learned based on RankSVM [14]. Smlarly, RankNet [3] uses a neural network to model the rankng functon usng gradent descent methods. Also, Tompkn et al. s Crtera Slders [46] apples sem-supervsed learnng to learn user-defned rankng functons. These relatve attrbutes algorthms focus on predctng attrbutes on exstng data entres. Our algorthm can be regarded as an extenson enablng to contnuously synthesze new data. 3 Approach In ths secton, we descrbe our contrbuton: rankng condtonal GAN (RankCGAN), whch allows mage synthess and mage edtng to be controlled by semantc attrbutes usng subjectvely specfed scales. We frst outlne the generatve CGAN [33]; then we outlne the dscrmnatve RankNet [3]; and last we show how to combne these dstnct archtectures to buld RankCGAN. 3.1 Generaton by CGAN The CGAN [33] s an extenson of GAN [8] nto a condtonal settng. The generatve adversaral network (GAN) s a generatve model traned usng a two-player mnmax game. It conssts of two networks: a generator G whch outputs a generated mage G(z) gven a latent varable z p z (z), and a dscrmnator D whch s a bnary classfer that outputs a probablty D(x) of the nput mage x beng real; that s, sampled from true data dstrbuton x p data (x). The mnmax objectve s formulated as the followng: mn G max D E x pdata (x)[log(d(x))] + E z pz (z)[log(1 D(G(z)))]. (1) In ths model, there s no control over the latent space: the pror, p z (.), on the multdmensonal latent varable z s often chosen as a multvarate normal N(0, I) or a multdmensonal unform U( 1, 1) dstrbuton. Thus samplng generates mages at random. CGAN [33] affords some control over the generaton process va the use of an addtonal latent varable, r, that s passed to both the generator and dscrmnator. The objectve of CGAN can be expressed as: mn G max D E r,x pdata (r,x)[log(d(r, x))] + E z pz (z),r p r (r)[log(1 D(r, G(r, z)))]. (2) We can now consder the total latent space to be parttoned nto two parts: a random subspace for z whch operates exactly as n standard GAN, and an attrbute subspace contanng r over whch the user has some control. The novel queston we have answered n ths paper s how to provde subjectve control over ths subspace?, our answer depends on an ablty to dscrmnatvely rank data. 3.2 Dscrmnaton by RankNet RankNet [3] s a dscrmnatve archtecture n the sense that t classfes a par of nputs, x and x j accordng to ther rank order: x x j or x j x. As proposed n [43], the archtecture comprses two parts: () a mappng of any nput x to a feature vector v(x); and () a rankng functon R over these feature vectors. These components are modelled by convolutonal layers. We use the shorthand x x j or R(x ) > R(x j ) n place of R(v(x )) > R(v(x j )), meanng that nput x s ranked hgher than nput x j. The convolutonal layers n the frst part can come pre-traned, or could be traned n-stu, ether way the focus here s on settng the convolutonal layer parameters defnng the rankng functon R. The tranng s supervsed by a set of trplets {(x (1), y )} =1 P where P s the dataset sze, (x(1) ) par of mages and y {0, 1} a bnary label ndcatng whether the mage x (1) exhbts more of some attrbute than x (2), or not. The loss functon for a par of mages (x (1) ) along wth the target label y s defned as a cross bnary entropy: L R (x (1), y ) = y log(p ) (1 y ) log(1 p ), (3) where the posteror probabltes p = P(x (1) x (2) ) make use of the estmated rankng scores p = sg(r(x (1) ) R(x (2) 1 )) := 1 + e. (4) (R(x(1) ) R(x (2) )) 3

In testng, the ranker provdes a rankng score for the nput mage whch can be used to nfer a global attrbute orderng. 3.

4 砀䐀洀愀䜀愀稀䐀砀䜀 Ⰰ 稀䰀刀愀欀䌀䜀䄀一砀刀砀䰀刀愀欀䌀䜀䄀一 Fgure 1: The dfference between CGAN and RankCGAN archtectures. The dark grey represents the latent varable, lght grey represents the observable mages, the lght blue ndcates the output of the dscrmnator and the ranker, and the dark blue ndcates ther loss functons. In testng, the ranker provdes a rankng score for the nput mage whch can be used to nfer a global attrbute orderng. 3.3 Rankng Condtonal Generatve Adversaral Networks Recall our am: to generate mages of a partcular object class, controlled by one or more subjectve attrbutes. Wth CGAN and RankNet avalable, we are now n a poston to construct an archtecture for ths am, an archtecture that s a Rankng Condtonal Generatve Adversaral Network: whch we call RankCGAN. The key novelty of RankCGAN s the addton of an explct rankng component to perform an end-to-end tranng wth the generator and the dscrmnator. Ths tranng scheme s ntended to put semantc orderng constrants n the generaton process wth respect to nput latent varables. A fundamental dfference between RankCGAN and CGAN s n the use of the matchng-aware dscrmnator [39] n CGAN to defne a real mage as a one from the dataset wth the correct label by ncorporatng the label n the dscrmnator. RankCGAN does not use ths settng because t s traned on contnuous controllng varables rather than bnary labels. As llustrated n Fgure 1, the archtecture of RankCGAN s composed of three parts: the generator G, the dscrmnator D, and the ranker R. Ths archtecture s versatle enough to rank over a lne usng one semantc attrbute, specfy ponts n plane usng two semantc attrbutes, and n prncple could operate n three or more dmensons. For smplcty s sake, we open our descrpton wth the one-dmensonal case, followed by a generalzaton to n dmensons. Fnally we augment RankCGAN wth an encoder that allows users to specfy the subjectve degree of semantc attrbutes, and n that way control mage edtng The one-attrbute case The generator G takes two nputs, z N(0, I) whch s the uncondtonal latent vector, and r U( 1, 1) whch s the latent varable controllng the attrbute. The generator outputs an mage x = G(r, z). Ths mage s nput not just to a dscrmnatng unt, D(x), as n CGAN, but also to a rankng unt R(x); as seen n Fgure 1. Consequently, the archtecture has three sets of parameters ΘG, ΘD, and ΘR for the generator, dscrmnator, and ranker, respectvely. The values of these parameters are determned by a tranng that nvolves three loss functons, one for each unt. N P The tranng s supervsed n nature. As for RankNet, let {(x(1), x(2), y )}=1 be parwse comparsons and {x }=1 an mage dataset of sze N. These datasets are used n a mn-batch tranng scheme, of sze B. We defned three loss functons, one for each task: generaton (LG ), dscrmnaton (LD ), and rankng (LR ). For generaton: B 1X LG (I) = log[d(i )], (5) B =1 B B such that {I }=1 = {G(r, z )}=1. For dscrmnaton: LD (I) = B 1X t log[d(i )] + (1 t) log[1 D(I )], B =1 4 (6)

5 1 f {I } =1 B where t = = {x } =1 B, 0 f {I } =1 B = {G(r, z )} =1 B. For rankng: wth L R (I (1), I (2), l) = 2 B { (I (1), I (2), l ) } B/2 =1 = f (r (1), r (2) ) = B/2 ( l log[sg(r(i (1) ) R(I (2) ))] + (1 l ) log[1 (sg(r(i (1) ) R(I (2) )))] ), (7) =1 { (x (1), y ) } B/2, for real mages ; { =1 (G(r (1), z (1) ), G(r (2), z (2) ), f (r (1), r (2) )) } B/2, for synthessed mages ; =1 1 f r (1) > r (2), 0 else. Based on these loss functons, the tranng algorthm for RankCGAN s defned by the adversaral tranng procedure presented as Algorthm 1 RankCGAN. The hyperparameter λ controls the contrbuton of the ranker-dscrmnator durng the updates of the generator Multple semantc attrbutes An nterestng feature about RankCGAN s the ablty to extend the model to multple attrbutes. The archtecture desgn and the tranng procedure reman ntact, the only dfferences are the ncorporaton of new latent varables that control the addtonal attrbutes and the structure of the ranker whch outputs a vector of rankng score wth respect to each attrbute. There are two dfferent ways of desgnng a mult-attrbute ranker, ether usng a separate rankng layer for each attrbute, or a sngle rankng layer shared between all the attrbutes. Let {(x (1), y )} =1 P be parwse comparsons where y s a vector of bnary labels ndcatng whether x (1) x (2) or not wth respect to all S attrbutes. We defne the loss functon of rankng: L R (I (1), I (2), l) = (8) S L R j (I (1), I (2), l j ), (9) j=1 where L R j (I (1), I (2), l j ) s the rankng loss wth respect to the attrbute j, whch s defned smlarly to Equaton (7): L R j (I (1),I (2), l j ) = 2 B such that B/2 ( l j log[sg(r(i (1) ) R(I (2) ))] + (1 l j ) log[1 (sg(r(i (1) ) R(I (2) )))] ), (10) =1 { (I (1), I (2), l j ) } B/2 =1 = f (r (1) j, r (2) j ) = { (x (1), y j ) } B/2, for real mages ; { =1 (G(r (1) j, z (1) ), G(r (2) j, z (2) ), f (r (1) j, r (2) j )) } B/2, for synthessed mages ; =1 1 f r (1) j > r (2) j, 0 else. wth y j j-th element n y and r j -th nput n the mn-batch, assocated to the j-th attrbute latent varable. 3.4 An Encoder for Image Edtng Once traned, RankCGAN can be used for mage synthess. Snce GAN lacks an nference mechansm, we use latent varables estmaton for such tasks. Followng prevous works [54, 37], mage edtng means controllng some attrbutes of an mage x under a latent varable r and random vector z, and generatng the desred mage by manpulatng r. The approach conssts of creatng a dataset of sze M from the generated mages and ther latent varables {r, z, G(r, z )} M =1 and tranng the encoders E r, E z whch encodes to r and z on ths dataset. Ther loss functons are defned n the mn-batch settng as follows: L Ez = 1 B (11) B z E z (G(r, z )) 2 2. (12) =1 5

6 Algorthm 1 RankCGAN Set the learnng rate η, the batch sze B, and the tranng teratons S Intalze each network parameters Θ D, Θ R, Θ G Data Images Set {x } N =1, Pars Set { (x (1), y ) } P =1 for n=1 to S do Get real mages mn-batches: x real = {x } =1 B, x (par) real = { (x (1), y ) } B/2 =1 Get fake mages mn-batches: x f ake = {G(r, z )} =1 B, x (par) f ake = {( G(r (1), z (1) ), G(r (2), z (2) ), f (r (1), r (2) ) )} B/2 =1 Update the dscrmnator D: Θ D Θ D η ( L D (x real ) Update the ranker R: Θ R Θ R η L R(x (par) real ) Θ R Update the generator G: end for Θ G Θ G η Θ D + L D(x f ake ) ) Θ D ( L G (x f ake ) Θ G ) + λ L R(x (par) f ake ) Θ G and L Er = 1 B B r E r (G(r, z )) 2 2. (13) =1 To reach a better estmaton of z and r we can use the manfold projecton method proposed n [54] whch conssts of optmzng the followng functon: r, z = arg mn r,z x G(r, z) 2 2. (14) Unfortunately, ths problem s non-convex, so that obtaned estmates for r and z are strongly contngent upon the ntal values of r, z; good ntal values are provded by the encoders E r, E z. Therefore, we use manfold projecton as a polshng step. 4 Emprcal Results In ths secton, we descrbe our expermental setup, quanttatve and qualtatve experments, and outlne applcatons n mage generaton, edtng and semantc attrbute transfer. 4.1 Datasets We used three datasets that provde relatve attrbutes: the UT-Zap50K dataset [51], the PubFg dataset [21], and the Outdoor Scene Recognton dataset (OSR) [35]. UT-Zap50K [51] conssts of 50,025 shoe mages from Zappos.com. The shoes are n catalog style, beng shown on whte backgrounds, and are mostly of pxels and have the same orentaton. We rescaled them to for GAN tranng purposes. The annotatons for parwse comparsons comprse 4 attrbutes (sporty, ponty, open, comfortable), wth two collectons: coarse and fne-graned pars, where the coarse pars are easy to vsually dscern than the fne-graned pars. For ths reason, we reled on the frst collecton n our experments. It contans 1,500 to 1,800 ordered pars per each attrbute. Addtonally, we defned the black attrbute by comparng the color hstogram of mages. We also conducted experments on the PubFg dataset [21] consstng of 15,738 face mages, successfully downloaded, of dfferent szes rescaled to A subset of PubFg dataset s used to buld a relatve attrbutes dataset [1] contanng 900 facal mages of 60 categores and 29 attrbutes. The orderng of samples n ths dataset s annotated n a category level, where all mages n a specfc category may be ranked hgher, equal, or lower than 6

Fgure 2: Mean FID (sold lne) surrounded by a shaded area bounded by the maxmum and the mnmum over 5 runs for GAN/RankCGAN on PubFg, OSR and UT-Zap50K datasets.

7 Fgure 2: Mean FID (sold lne) surrounded by a shaded area bounded by the maxmum and the mnmum over 5 runs for GAN/RankCGAN on PubFg, OSR and UT-Zap50K datasets. Top left: PubFg wth two RankCGANs traned on masculne and smlng attrbutes, startng at mn-batch update 3k for better vsualsaton. Top Rght: UT-Zap50K wth two RankCGANs traned on sporty and black attrbutes, startng at mn-batch update 3k for better vsualsaton. Bottom: OSR wth one RankCGAN traned on natural attrbute. 㜀㜀㠀㤀㠀㔀㔀㔀㜀㐀㔀㤀㤀㠀㘀㜀㤀㘀㔀㠀㔀㤀㠀㔀㐀㠀㠀㔀㐀㠀㘀㘀㠀㔀㘀㠀㤀㠀㜀㠀㐀㤀㤀㐀㐀㜀㘀㤀㜀㐀㤀㠀㤀㐀㔀㔀㘀㠀䰀瀀㘀㤀㐀䴀瀀 Fgure 3: Examples of generated shoe mages wth respect to sporty attrbute. The abscssa values represent the rankng score of each mage all mages n another category, wth respect to an attrbute. We used 50,000 parwse mage comparsons from the ordered categores, to tran the ranker on a specfc attrbute. Fnally, we evaluated our approach on the Outdoor Scene Recognton dataset (OSR) [35] consstng of 2,688 mages from 8 scene categores and 6 attrbutes. Smlarly to PubFg dataset, the attrbutes are defned n a category level, whch enable to create 50,000 parwse mage comparsons to tran the ranker on the desred attrbute. 4.2 Implementaton Our RankCGAN s bult on top of DCGAN [38] mplementaton wth the same hyperparameters settng and structure of the dscrmnator D and the generator G. Besdes, we modelled the encoders Er, Ez, and the ranker R wth the same archtecture of the dscrmnator D, except for the last sgmod layer. We set the hyperparameter λ = 1, learnng rate to , mn-batch of sze 64, and traned the networks usng mn-batch stochastc gradent descent (SGD) wth Adam optmzer [18]. We traned the networks on UT-Zap50K and PubFg datasets for 300 epochs and on OSR dataset for 1,500 epochs. 7

刀搀眀栀刀愀欀䌀䜀䄀一伀最愀刀搀眀栀䌀䜀䄀一刀搀眀栀刀愀欀䌀䜀䄀一伀最愀刀搀眀栀䌀䜀䄀一䰀瀀䴀瀀 Fgure 4: Examples of the nterpolaton of smlar mages wth respect to sporty attrbute usng RankCGAN (rows 1, 3) and CGAN

The abscssa values represent the rankng score of each mage. 4.3 Quanttatve Results A recent proposed evaluaton method for GAN models s Fre chet Incepton Dstance (FID) [9].

Afterwards, the embeddng layer s consdered as a contnuous multvarate Gaussan dstrbuton, where the mean and covarance are estmated for both real and generated mage.

8 刀搀眀栀刀愀欀䌀䜀䄀一伀最愀刀搀眀栀䌀䜀䄀一刀搀眀栀刀愀欀䌀䜀䄀一伀最愀刀搀眀栀䌀䜀䄀一䰀瀀䴀瀀 Fgure 4: Examples of the nterpolaton of smlar mages wth respect to sporty attrbute usng RankCGAN (rows 1, 3) and CGAN (rows 2, 4). 㠀㘀㘀㠀㘀㜀㘀㘀㔀㔀㜀㔀㘀㐀㠀㔀㐀㜀㘀㜀㐀㐀㘀㤀䰀洀愀㘀㐀㤀㤀㠀㘀㤀㤀㘀㘀㘀㔀㜀㘀㐀㠀㜀㔀㤀㠀㤀㤀㘀㐀㘀㘀㤀㜀㠀㜀㜀㤀㐀䴀洀愀 Fgure 5: Examples of generated face mages wth respect to masculne attrbute. The abscssa values represent the rankng score of each mage. 4.3 Quanttatve Results A recent proposed evaluaton method for GAN models s Fre chet Incepton Dstance (FID) [9]. In order to quantfy the qualty of generated mages, they are embedded n a feature space gven by the Incepton Net pool 3 layer of 2048 dmenson. Afterwards, the embeddng layer s consdered as a contnuous multvarate Gaussan dstrbuton, where the mean and covarance are estmated for both real and generated mage. Then, the Fre chet Dstance between these two Gaussans s used to quantfy the qualty of the samples. In fgure 2, we compare our proposed RankCGAN wth the orgnal GAN on all datasets to see whether the ncorporaton of the ranker has an mpact on the qualty of generated mages. We calculate FID on every 1,000 mn-batch teratons wth sampled generated mages of the same sze as the real mages n the dataset. We notce that the standard GAN s a lttle faster at the begnnng but eventually RankCGAN acheves slghtly better or equal performance to GAN. The FID scores of RankCGAN and standard GAN are hgher n OSR dataset due to the dffculty to generate ts mages whle the FID s lower on PubFg dataset whch shows that our model generatve capablty and qualty are ted to the standard GAN. 4.4 Qualtatve Results We would lke an experment that tests the followng hypothess: Incorporatng an ranker brngs an added value to the system whch a CGAN s unable to acheve wth ts tranng procedure. Unfortunately, conductng an experment to test the hypothess s not straghtforward. The problem s that RankCGAN requres orderng pars for tranng some semantc attrbutes, whereas CGAN requres bnary categorcal labels that ndcate ts presence or absence. To enable a comparson, we follow [46] method n mappng parwse orderng to bnary labels. We use a ranker wth the same nternal archtecture as n RankCGAN, but ndependently traned on the ordered pars to map each mage n the dataset to a real rankng score. Then, all mages wth a negatve score are sad to not have the semantc 8

䰀戀愀欀䴀戀愀欀䰀洀最䴀洀最䰀洀愀䰀瀀䴀洀愀䴀瀀 Fgure 6: Example of two-attrbutes nterpolaton on shoe (left) and face (rght) mages usng ( sporty, black ) and ( masculne, smlng ) attrbutes.

The abscssa values represent the rankng score of each mage. property, whle those wth a postve score are sad to possess t.

9 䰀戀愀欀䴀戀愀欀䰀洀最䴀洀最䰀洀愀䰀瀀䴀洀愀䴀瀀 Fgure 6: Example of two-attrbutes nterpolaton on shoe (left) and face (rght) mages usng ( sporty, black ) and ( masculne, smlng ) attrbutes. 㜀㠀㠀㤀㘀㔀㐀㜀㘀㜀㔀㜀㠀㔀㘀㔀㘀㘀㤀㔀㔀㘀㜀㐀㔀㤀㔀㘀㤀㠀㤀㘀㐀㐀㠀㠀㔀㠀㠀㤀㤀㘀㤀㘀㠀㠀㠀㘀㘀㐀㐀㘀㜀㐀㠀䰀愀愀㠀䴀愀愀 Fgure 7: Examples of generated scene mages wth respect to natural attrbute. The abscssa values represent the rankng score of each mage. property, whle those wth a postve score are sad to possess t. The proporton of mages wth postve and negatve scores n all the datasets s balanced around zero, whch s used as a threshold. Ths method enables us to tran CGAN on all datasets. Qualtatve results are shown n Fgure 4 for both RankCGAN and CGAN, n whch the subjectve scale runs from not sporty to sporty. To use the traned networks to produce Fgure 4, we used our encoders Er, Ez to estmate the latent varables r and z of a gven real mage. Then, we nterpolated the mage wth respect to r to each edge of the nterval [ 1, 1]. The results suggest that RankCGAN s capable of spannng a wder subjectve nterval than CGAN. Evdence for ths s seen n the extreme ends of each nterval. The top lne (second row) of CGAN fals to reach shoes that can reasonably be called sporty, whle the bottom lne (fourth row) fals to nclude a hgh-heel at the not sporty extremty. In contrast, RankCGAN reaches a desrable shoe across the scale n both cases (frst and thrd rows), we see a dress shoe at one end and a sporty shoe at the other. There s another, more subtle dfference: the mages produced by RankCGAN seem to change n a smoother way than those produced by CGAN. For example, the bottom lne of CGAN shows many boots, wth a change to sporty shoes comng late over the nterval and occurrng over a small subjectve nterval. Ths suggests that RankCGAN parametrzes the subjectve space more unformly than nterpolatng, as requred when usng CGAN. 9

㤀㘀㐀㠀㘀㜀㤀㔀㠀㤀㘀㤀㜀㔀㜀㠀㘀㐀㤀㠀㔀 Fgure 8: Examples of 128 128 generated face and shoe mages wth respect to masculne and sporty attrbutes.

attrbute (top) and masculne attrbute (bottom). 4.5 Applcaton: Image Generaton Image generaton s a basc expectaton of any GAN.

10 㤀㘀㐀㠀㘀㜀㤀㔀㠀㤀㘀㤀㜀㔀㜀㠀㘀㐀㤀㠀㔀 Fgure 8: Examples of generated face and shoe mages wth respect to masculne and sporty attrbutes. 䤀瀀洀愀最刀搀洀愀最䴀瀀䰀瀀䤀瀀洀愀最刀搀洀愀最䰀洀愀䴀洀愀 Fgure 9: Examples of mage edtng task, where the latent varables are estmated for an nput mage to perform the edtng wth respect to the sporty attrbute (top) and masculne attrbute (bottom). 4.5 Applcaton: Image Generaton Image generaton s a basc expectaton of any GAN. To demonstrate ths for RankCGAN, we choose sngle semantc attrbutes that span a lne, and pars of semantc attrbutes that span a plane. In all cases, the nose vector z nput to generator s held constant all varances n the generated mages arse from changes n the semantc attrbutes alone. Fgures 3, 5, and 7 show how the generated mage vares wth respect to the value of some subjectve varable. These are sporty, masculne, and natural, respectvely; three examples are shown n each case. We note that plausble shoe and face mages are generated at every pont. Natural scenes are more dffcult to generate for standard GAN, but our mages are reasonable and progress from urban to natural n a pleasng manner. In all cases, the semantc attrbute changes smoothly over subjectve scale. Addtonally, Fgure 8 demonstrates results on generatng mages of faces and shoes datasets usng the StackGAN method [52]. The resoluton s chosen due to our GPU memory lmtaton and the orgnal sze of mages. Concernng the strategy used for tranng RankCGAN on multple attrbutes. We conducted experments on both proposed strateges and notced that we have smlar results. The only reason behnd choosng one over the other s the techncal lmtaton. In fact, havng a sngle shared layer between all attrbutes would requre havng parwse comparsons labels wth respect to every attrbute for the same par of mages whch s not always the exstng case. For nstance, n UT-Zap50K dataset, we have dfferent pars of mage wth respect to a specfc semantc attrbute. In ths case the only resort s to tran a separate rankng layer for each attrbute. In Pubfg and OSR, the orderng s n category level, so we could use both tranng strateges. We just opted for the separate rankng layers strategy for all datasets. Fgure 6 llustrates the mage generaton usng two-attrbutes. We used sporty, and black for the shoe dataset, and masculne, and smlng for face dataset. In both cases plausble mages are generated, and the progresson n both drectons on both planes adheres to the subjectve attrbutes n queston. As wth the one-dmensonal cases, 10

11 the semantc attrbute changes smoothly n each drecton. Ths s evdence that RankCGAN decorrelates multple semantc attrbutes. 4.6 Applcaton: Image Edtng Image edtng s an applcaton that could be bult nto a system that supports user browsng. For example, a user mght ask the system show me shoes that are less sporty than the one I see now. The core of mage edtng s to map a gven mage onto the subjectve scale by estmatng ts latent varables r, z that produce a reconstructed mage, and then move along the scale, one way or the other. The mage edtng task was used to create Fgure 4 n the qualtatve results (Secton 4.4). In fgure 9, we show mage edtng on shoes and on faces, usng the sporty and masculne attrbutes respectvely. The reconstructed mages are framed n red, and the mages generated wth subjectvely less of the chosen attrbute are shown on the top lne, whle the mages generated to have subjectvely more are on the bottom lne. In both cases the nose vector z was held constant, and only the semantc attrbute r vares across the learned subjectve scale. 4.7 Applcaton: Semantc Attrbute Transfer Our fnal applcaton s semantc attrbute transfer. The dea s to extract the subjectve measure of a semantc attrbute from one mage, and apply that measure to another. Ths s possble because the encoder E r s able to quantfy the semantc strength of an attrbute n the mage. Indeed, ths ablty s used n all the results so far, generaton excepted. In order to transfer an attrbute, we quantfy the condtonal varable r of the source mage usng the encoder E r, then edt the target mage wth the new semantc value. Fgure 10 shows some examples of ths task. The reference mages have been ordered, left to rght, by ncreasng the subjectve level of smlng. The correspondng semantc value s then used n conjuncton wth each of the target mages to generate a new expresson for the person n the pcture. We note that the target pctures do not have to be n a neutral expresson to make ths work. Indeed, some target mages show people smlng, others do not. It s at best very dffcult to see how to mplement an applcaton lke ths wthout a system that s able to model subjectve scales, such as RankCGAN. 刀昀吀愀最 Fgure 10: Examples of smlng attrbute transfer task. The latent varable r of reference mages s estmated and then transferred to the target mage n order to express the same strength of attrbute. 5 Dscusson and Concluson We ntroduced RankCGAN, a novel GAN archtecture that synthesses mages wth respect to semantc attrbutes defned relatvely usng a parwse comparsons annotaton. We showed through experments that the desgn and 11

12 tranng scheme of RankCGAN enable latent semantc varables to control the attrbute strength n the generated mages usng a subjectve scale. Our proposed model s generc n the sense that t can be ntegrated nto any extended CGAN model. It follows that, our model s generatve power s ted to that of the GAN n terms of the qualty of generated mages, the dversty of the model and the evaluaton methods. Possble extensons to ths study consst of ncorporatng a flterng archtecture, CFGAN [16], to enhance the RankCGAN controllablty and also ncorporatng an encoder n the RankCGAN to perform an end-to-end tranng n order to mprove mage edtng and attrbute transfer tasks. Acknowledgements We would lke to thank James Tompkn for ntal dscussons about the man deas of the paper. Yassr Saqul thanks the European Unon s Horzon 2020 research and nnovaton programme under the Mare Skłodowska-Cure grant agreement No and the UK s EPSRC Centre for Doctoral Tranng n Dgtal Entertanment (CDE), EP/L016540/1. Kwang In Km thanks EPSRC EP/M00533X/2 and RCUK EP/M023281/1. 12

13 References [1] A. Bswas and D. Parkh. Smultaneous actve learnng of classfers & attrbutes va relatve feedback. In CVPR, [2] A. Brock, T. Lm, J. M. Rtche, and N. Weston. Neural photo edtng wth ntrospectve adversaral networks. In ICLR, [3] C. J. C. Burges, T. Shaked, E. Renshaw, A. Lazer, M. Deeds, N. Hamlton, and G. N. Hullender. Learnng to rank usng gradent descent. In ICML, [4] X. Chen, X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representaton learnng by nformaton maxmzng generatve adversaral nets. In NIPS [5] A. Dosovtsky, J. T. Sprngenberg, and T. Brox. Learnng to generate chars wth convolutonal neural networks. In CVPR, [6] A. Farhad, I. Endres, D. Hoem, and D. Forsyth. Descrbng objects by ther attrbutes. In CVPR, [7] V. Ferrar and A. Zsserman. Learnng vsual attrbutes. In NIPS, [8] I. Goodfellow, J. Pouget-Abade, M. Mrza, B. Xu, D. Warde-Farley, S. Ozar, A. Courvlle, and Y. Bengo. Generatve adversaral nets. In NIPS , 3 [9] M. Heusel, H. Ramsauer, T. Unterthner, B. Nessler, and S. Hochreter. Gans traned by a two tme-scale update rule converge to a local nash equlbrum. In NIPS [10] G. E. Hnton, S. Osndero, and Y. W. Teh. A fast learnng algorthm for deep belef nets. Neural Computaton, 18:2006, [11] G. E. Hnton and R. Salakhutdnov. Reducng the dmensonalty of data wth neural networks. Scence, 313(5786): , [12] S. Izuka, E. Smo-Serra, and H. Ishkawa. Let there be Color!: Jont End-to-end Learnng of Global and Local Image Prors for Automatc Image Colorzaton wth Smultaneous Classfcaton. In SIGGRAPH, [13] P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-mage translaton wth condtonal adversaral networks. In CVPR, [14] T. Joachms. Optmzng search engnes usng clckthrough data. In SIGKDD, [15] J. Johnson, A. Alah, and L. Fe-Fe. Perceptual losses for real-tme style transfer and super-resoluton. In ECCV, [16] T. Kaneko, K. Hramatsu, and K. Kashno. Generatve attrbute controller wth condtonal fltered generatve adversaral networks. In CVPR, , 12 [17] T. Km, B. Km, M. Cha, and J. Km. Unsupervsed vsual attrbute transfer wth reconfgurable generatve adversaral networks. CoRR, abs/ , [18] D. P. Kngma and J. Ba. Adam: A method for stochastc optmzaton. In ICLR, [19] D. P. Kngma and M. Wellng. Auto-encodng varatonal bayes. In ICLR, [20] T. D. Kulkarn, W. F. Whtney, P. Kohl, and J. B. Tenenbaum. Deep convolutonal nverse graphcs network. In NIPS, [21] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attrbute and smle classfers for face verfcaton. In ICCV, , 6 [22] G. Lample, N. Zeghdour, N. Usuner, A. Bordes, L. Denoyer, and M. Ranzato. Fader networks: Manpulatng mages by sldng attrbutes. CoRR, abs/ , [23] G. Larsson, M. Mare, and G. Shakhnarovch. Learnng representatons for automatc colorzaton. In ECCV, [24] C. Ledg, L. Thes, F. Huszar, J. Caballero, A. P. Atken, A. Tejan, J. Totz, Z. Wang, and W. Sh. Photo-realstc sngle mage super-resoluton usng a generatve adversaral network. In CVPR, [25] X. Lang, H. Zhang, and E. P. Xng. Generatve semantc manpulaton wth contrastng GAN. CoRR, abs/ , [26] K. Ln, D. L, X. He, Z. Zhang, and M. tng Sun. Adversaral rankng for language generaton. In NIPS

14 [27] J. Lu, B. Kupers, and S. Savarese. Recognzng human actons by attrbutes. In CVPR, [28] S. Lu, J. Pan, and M. Yang. Learnng recursve flters for low-level vson va a hybrd neural network. In ECCV, [29] A. Makhzan, J. Shlens, N. Jatly, and I. J. Goodfellow. Adversaral autoencoders. In ICLR, [30] E. Mansmov, E. Parsotto, L. J. Ba, and R. Salakhutdnov. Generatng mages from captons wth attenton. ICLR, [31] X. Mao, Q. L, and H. Xe. AlgnGAN: Learnng to algn cross-doman mages wth condtonal generatve adversaral networks. CoRR, abs/ , [32] M. F. Matheu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun. Dsentanglng factors of varaton n deep representaton usng adversaral tranng. In NIPS [33] M. Mrza and S. Osndero. Condtonal generatve adversaral nets. CoRR, abs/ , , 2, 3 [34] A. V. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pxel recurrent neural networks. In ICML, [35] D. Parkh and K. Grauman. Relatve attrbutes. In ICCV, , 3, 6, 7 [36] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros. Context encoders: Feature learnng by npantng. In CVPR, [37] G. Perarnau, J. van de Wejer, B. Raducanu, and J. M. Álvarez. Invertble condtonal gans for mage edtng. In NIPS Workshop on Adversaral Tranng, , 5 [38] A. Radford, L. Metz, and S. Chntala. Unsupervsed representaton learnng wth deep convolutonal generatve adversaral networks. In ICLR, [39] S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schele, and H. Lee. Generatve adversaral text to mage synthess. In ICML, , 4 [40] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep vsual analogy-makng. In NIPS [41] D. J. Rezende, S. Mohamed, and D. Werstra. Stochastc backpropagaton and approxmate nference n deep generatve models. In ICML, [42] C. K. Sønderby, J. Caballero, L. Thes, W. Sh, and F. Huszár. Amortsed MAP nference for mage superresoluton. In ICLR, [43] Y. Sour, E. Noury, and E. Adel. Deep relatve attrbutes. In ACCV, [44] N. Srvastava and R. Salakhutdnov. Multmodal learnng wth deep boltzmann machnes. Journal of Machne Learnng Research, 15: , [45] R. Tao, A. W. M. Smeulders, and S. Chang. Attrbutes and categores for generc nstance search from one example. In CVPR, [46] J. Tompkn, K. I. Km, H. Pfster, and C. Theobalt. Crtera slders: Learnng contnuous database crtera va nteractve rankng. In BMVC, , 8 [47] A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vnyals, and A. Graves. Condtonal mage generaton wth pxelcnn decoders. In NIPS, [48] P. Vncent, H. Larochelle, Y. Bengo, and P.-A. Manzagol. Extractng and composng robust features wth denosng autoencoders. In ICML, [49] X. Yan, J. Yang, K. Sohn, and H. Lee. Attrbute2mage: Condtonal mage generaton from vsual attrbutes. In ECCV, [50] J. Yang, S. E. Reed, M.-H. Yang, and H. Lee. Weakly-supervsed dsentanglng wth recurrent transformatons for 3d vew synthess. In NIPS [51] A. Yu and K. Grauman. Fne-Graned Vsual Comparsons wth Local Learnng. In CVPR, , 6 [52] H. Zhang, T. Xu, H. L, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realstc mage synthess wth stacked generatve adversaral networks. In ICCV, [53] R. Zhang, P. Isola, and A. A. Efros. Colorful mage colorzaton. In ECCV, [54] J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generatve vsual manpulaton on the natural mage manfold. In ECCV, , 5, 6 [55] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpared mage-to-mage translaton usng cycle-consstent adversaral networks. In ICCV,

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to: