Supplementary Material: HCP: A Flexible CNN Framework for Multi-label Image Classification

Size: px

Start display at page:

Download "Supplementary Material: HCP: A Flexible CNN Framework for Multi-label Image Classification"

Junior Mosley
5 years ago
Views:

1 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO.XX, Supplemetary Material: HCP: A Flexible CNN Framework for Multi-label Image Classificatio Yuchao Wei, Wei Xia, Mi Li, Jushi Huag, Bigbig Ni, Jia Dog, Yao Zhao, Seior Member, IEEE Shuicheg Ya, Seior Member, IEEE This documet provides detailed justificatios of model compoets for the results used i the paper. Basically, the experimetal results are based o VOC JUSTIFICATIONS OF IMAGE-FINE-TUNING AND HYPOTHESES-FINE-TUNING This justificatio is based o 10 selected hypotheses geerated by BING [2] for each traiig image. First, to demostrate the ecessity of image-fie-tuig, we compare with the method which does ot utilize image-fie-tuig to iitialize the last fully-coected layer. As show i Table 1, without image-fie-tuig, the performace will drop by 4%. The reaso may be explaied as follows. With the max poolig operatio, our method requires a reasoable predictio for each hypothesis at the begiig so that a advisable hypothesis ca be selected to optimize the parameters durig the back propagatio process. However, without image-fietuig, radom iitialized parameters for the last layer may geerate a irratioal predicted result for each hypothesis at the begiig. The, with the cross-patch max poolig, icorrect hypotheses may be selected to optimize the shared CNN, leadig to a local optimum for the optimizatio. TABLE 1 Compariso of HCP performace i terms of differet I-FT settigs o VOC 2007 with Alex Net [5] as the shared CNN ( i %). Without I-FT With I-FT Secod, to demostrate the ecessity of hypotheses-fie-tuig, we compare the cross-patch max-poolig result based o the image-fie-tuig model with that based o the hypotheses-fie-tuig model. From Table 2, we ca see that the improvemet from sigle image to 500 hypotheses based I-FT is limited, while the classificatio performace ca be sigificatly boosted with HCP. Therefore, hypotheses-fie-tuig ca further improve the classificatio accuracy. TABLE 2 Compariso i terms of 500 hypotheses for I-FT model ad HCP model o VOC 2007 with Alex Net as the shared CNN ( i %). I-FT HCP Sigle Image hypotheses JUSTIFICATION OF LOSS FUNCTION To validate the effectiveess of the squared loss fuctio utilized i the HCP framework, we compare it with other multi-label loss fuctios. Four loss fuctios, i.e., cross-etropy loss [3], hige loss [1], label-wise crossetropy loss ad pairwise rakig loss [3], are maily employed to make the compariso. Details of loss fuctios are show i the followig.

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO.XX, Details of Loss Fuctios Deote X = {x i i = 1,..., } as the image traiig set. Similar to the otatios defied i [3], we deote the deep CNN by f( ) where all the layers filter the give image x i. The f( ) produces a c-dimesioal output of activatios, where c is the umber of the predefied labels Cross-etropy Loss The softmax fuctio is used to compute the posterior probability of a give image x i belogig to the class j, i.e. exp(f j (x i )) p ij = P (j x i ) = c k=1 exp(f (1) k(x i )) where f j (x i ) is the j th activatio value for the image x i. Let y i = [y i1, y i2,, y ic ] be the label vector of the i th image, where y ij = 1 (j = 1,, c) if the image is aotated by the class j, ad y ij = 0 otherwise. The groud-truth probability vector of the i th image is defied as ˆp i = y i / y i 1. Give the etwork predictio i Eq. ( 1), the loss fuctio is the defied as J = 1 c ˆp ij log(p ij ) = 1 i=1 j=1 1 log(p ij ) c i=1 i,+ j C + where c i,+ is the umber of the groud-truth labels for the i th image, ad C + deotes the idex set of the labeled classes Pairwise Rakig Loss This loss fuctio is usually utilized for the aotatio problem. Specifically, the target of this loss fuctio is to rak the positive labels to have higher scores tha egative labels. The loss fuctio is the defied as J = 1 max(0, 1 f j (x i ) + f k (x i )) (3) i=1 k C j C + where C + ad C deote the idex sets of labeled ad o-labeled classes, respectively Hige Loss The hige loss is used for maximum-margi classificatio, which is most otably for support vector machies (SVM). Idicated by [1], the loss fuctio is defied as J = 1 c max (0, 1 y ij f j (x i )) 2 (4) where i=1 j=1 { +1 if xi is aotated with the j y ij = th label 1 otherwise Label-wise Cross-etropy Loss We preset a ew loss fuctio for multi-label classificatio, which is called label-wise cross-etropy (LWCE) loss. The LWCE loss splits the traditioal cross-etropy loss ito multiple oe vs. all cross-etropy losses for each class. Specifically, we use the deep CNN as the predictio mappig ad chage the output dimesio of the last fully-coected layer of the etwork from c to 2c. The two succeedig dimesios are the fed to the softmax fuctio to compute the cross-etropy loss for each class. Deote ˆf j ( ) R 2 as the 2 dimesioal output of activatios for the j th (j = 1 c) label. We formulate the posterior probability of the give image x i belogig to the j th label as exp( p ij = P (j x i ) = ˆf j1 (x i )) 2 k=1 exp( ˆf (5) jk (x i )) where ˆf jk is the j th dimesio of ˆf j for k {1, 2}. (2)

3 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO.XX, Suppose y ij = [y ij1, y ij2 ] is the 2-dimesioal label vector of the j th class, where y ij1 = 1, y ij2 = 0 if the image is aotated with class j, ad otherwise y ij1 = 0, y ij2 = 1. The, the loss fuctio to be miimized ca be writte as J = 1 c 2 y ijk log(p ijk ) i=1 j=1 k=1 = 1 (6) log(p ij1 )+ log(p ij2 ) i=1 j C + j C where C + ad C deote the idex sets of labeled ad o-labeled classes, respectively. 2.2 Results First, from the experimetal results as show i Table 3, it ca be observed that the squared loss ad the cross-etropy loss reach better results tha other loss fuctios. The uderlyig reaso may be explaied as follows. With the softmax operatio across all labels, the cotext relatioship amog labels has bee implicitly utilized (cocurret or exclusive) i squared loss ad cross-etropy loss, which is valuable for multi-label image classificatio as for VOC datasets. However, for both hige loss ad label-wise cross-etropy loss, the optimizatio for each class is idepedet, where the cotext iformatio of other labels i ot take ito cosideratio. Secod, experimetal result based o pairwise rakig loss shows that it is less effective tha other loss fuctios. The target of this loss fuctio is to rak the positive labels with higher predicted scores tha egative labels withi a particular sample. It however caot guaratee that positive samples are raked higher tha egative samples for a specific class. Therefore, its performace may be worse. Third, the experimets show that squared loss is slightly better tha cross-etropy loss, ad the uderlyig reaso may be as follows. As ca be see from the mathematical defiitio i Eq. (2), the cross-etropy loss oly aims to promote the predicted score(s) o groud-truth label(s) ad does ot care about the predicted scores o the egative labels, ad thus it is possible that a particular egative label may also have relatively a large value, which may degrade the multi-label classificatio performace (cofidece value is already ot too big whe the umber of positive labels is larger tha oe). However, squared loss attempts to suppress the predicted scores of all the egative labels to zero. Also miimizig the squared loss mathematically shall ecourage the squared losses for all idividual egative labels of a image to be similar to each other, amely balaced, which may beefit the classificatio accuracy sice there will be o big cofideces for all egative labels. All these experimets are based o 10 selected hypotheses (geerated by BING) for traiig ad 500 hypotheses for testig. TABLE 3 Compariso of HCP performace i terms of differet loss fuctios o VOC 2007 with Alex Net as the shared CNN ( i %). Loss Fuctio Squared Loss 81.5 Cross-etropy Loss [3] 80.2 Hige Loss [1] 79.9 Label-wise Cross-etropy Loss 78.9 Pairwise Rakig Loss [3] JUSTIFICATION OF HYPOTHESIS GENERATORS To validate the effectiveess of usig object proposal method, two other schemes, i.e., radom crops ad uiform crops, are itroduced for compariso. For the radom crops, we estimate a multivariate Gaussia distributio for the boudig box ceter positio, square root area, ad log aspect ratio. After calculatig mea ad covariace o the traiig set, we sample proposals by followig this distributio. For the uiform corps, we uiformly sample the boudig box ceter positio, square root area, ad log aspect ratio. These two cropped methods are cosistet with those proposed i [4]. We utilize the codes released by [4] to geerate the correspodig boudig boxes for each image. Sice o objectess cofidece score is assiged to each boudig box, we radomly select oe hypothesis from each cluster ceter for traiig. I additio, to evaluate

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO.XX, how other better yet more complex proposal geeratio methods affect the performace compared with BING, we add ew experimets based o aother proposal geeratio method, EdgeBoxes [7], due to its acceptable speed ad high quality of proposals. All these experimets are based o 10 selected hypotheses for the traiig of HCP. Top-500 hypotheses geerated by the correspodig geerators (i.e., Uiform, Radom, BING ad Edgeboxes) are used for testig. Table 4 shows the compariso results o the VOC 2007 test set. It ca be observed that our HCP framework still works well based o uiform/radom crops. However, the results of these two schemes are worse tha those proposal-geeratio schemes, i.e., BING ad EdgeBoxes. The uderlyig reaso is that proposal geerator ca cover or recall hypotheses with more complete ad diverse objects compared with radom/uiform crops. TABLE 4 Compariso of HCP performace i terms of differet hypotheses selectio strategies o VOC 2007 with Alex Net as the shared CNN ( i %). Hypotheses Selectio Uiform 80.6 Radom 80.9 BING 81.5 Edgeboxes 82.3 TABLE 5 Compariso of HCP performace i terms of differet umber of clusters ad proposal methods o VOC 2007 ( i %). Alex Net VGG Net BING EdgeBoxes BING EdgeBoxes 5-clusters clusters clusters JUSTIFICATION OF THE NUMBER OF CLUSTERS We vary the umber of clusters durig the HCP traiig stage to observe its ifluece based o two proposal methods, i.e., BING [2] ad EdgeBoxes [7]. The experimets are addressed based o -clusters (=5,10,15), where is the umber of ceters durig the clusterig operatio. The compariso results are show i Table 5 ad all the testig results are based o top-500 hypotheses for each image. From Table 5, it ca be observed that the performace of HCP ca be boosted with the icrease of the cluster umber. This is because more clusters ca brig higher coverage rate for objects of traiig images. I additio, traiig as well as testig with hypotheses produced by Edgeboxes performs better tha that of BING. The mai reaso is the quality of the hypotheses. As idicated i [4], hypotheses geerated by Edgeboxes are better tha those produced by BING. It takes about 1 secod to cluster the top-500 boudig boxes geerated by BING or EdgeBoxes for traiig. Based o the top-500 hypotheses, the testig time of Alex Net ad VGG Net [6] is about 3s/image ad 10s/image o a GPU icludig the time of proposal geeratio (BING: 0.2s/image, EdgeBoxes: 0.25s/image). Details of all experimetal justificatios o VOC 2007 are show i Table 6. REFERENCES [1] K. Chatfield, K. Simoya, A. Vedaldi, ad A. Zisserma. Retur of the devil i the details: Delvig deep ito covolutioal ets. I British Machie Visio Coferece, [2] M.-M. Cheg, Z. Zhag, W.-Y. Li, ad P. H. S. Torr. BING: Biarized ormed gradiets for objectess estimatio at 300fps. I Computer Visio ad Patter Recogitio, [3] Y. Gog, Y. Jia, T. K. leug, A. Toshev, ad S. Ioffe. deep covolutioal rakig for multi label image aotatio. I Iteratioal Coferece o Learig Represetatios, [4] J. Hosag, R. Beeso, P. Dollár, ad B. Schiele. What makes for effective detectio proposals? arxiv preprit arxiv: , [5] A. Krizhevsky, I. Sutskever, ad G. Hito. Imageet classificatio with deep covolutioal eural etworks. I Neural Iformatio Processig Systems, pages , [6] K. Simoya ad A. Zisserma. Very deep covolutioal etworks for large-scale image recogitio. arxiv preprit arxiv: , [7] C. L. Zitick ad P. Dollár. Edge boxes: Locatig object proposals from edges. I Europea Coferece o Computer Visio, pages Spriger, 2014.

5 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO.XX, TABLE 6 Details of experimetal results o VOC plae bike bird boat bottle bus car cat chair Justificatio of loss fuctios (Traiig: 10 clusters, Testig: Top-500 hypotheses, Proposal method: BING, Shared CNN: Alex Net) : Squared Cross-etropy Hige Label-wise Pairwise Rakig Uiform/Radom croppig vs. proposal methods (Traiig: 10 clusters, Testig: Top-500 hypotheses, Shared CNN: Alex Net) : Uiform Radom BING EdgeBoxes Justificatio of proposal methods based o 5 clusters (Testig: Top-500 hypotheses): Alex+BING Alex+EdgeBoxes VGG+BING VGG+EdgeBoxes Justificatio of proposal methods based o 10 clusters (Testig: Top-500 hypotheses): Alex+BING Alex+EdgeBoxes VGG+BING VGG+EdgeBoxes Justificatio of proposal methods based o 15 clusters (Testig: Top-500 hypotheses): Alex+BING Alex+EdgeBoxes VGG+BING VGG+EdgeBoxes cow table dog horse motor perso plat sheep sofa trai tv

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it