Multiscale methods for neural image processing. Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet

Size: px

Start display at page:

Download "Multiscale methods for neural image processing. Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet"

Simon Pearson
5 years ago
Views:

1 Multiscale methods for neural image processing Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet

2 A TALK IN TWO ACTS Part I: Stacked U-nets The globalization problem for segmentation Simple ways to solve it Part 2: Quantized neural nets Putting neural nets on portable devices Binarized training methods, and why they work

3 PROBLEM SETTING PASCAL Visual Object Classes (VOC) Cityscapes

4 SIMPLE APPROACHES Conventional Classification Networks Designed for classification task Parameter heavy Low resolution output Minimal exchange of of context Dilated Classification Networks

5 WHAT MAKES SEGMENTATION HARD? Field of view problem High-resolution outputs Fine scale localization required Small data

6 TV IN 2D (rx) ij =(x i+1,j x i,j,x i,j+1 x i,j ) Anisotropic Isotropic k(rx) ij k = (rx) ij = x i+1,j x i,j + x i,j+1 x i,j ) q (x i+1,j x ij ) 2 +(x i,j+1 x ij ) 2

7 <latexit sha1_base64="l3jeis1bxocqmcp/50/lq1tp684=">aaae7nicbznlbxmxemfdnkajrxaoxcwipeifykygofzeqnalfx1icrrzjndjynu3thdauf4khlkhrpw4cyxvwbfbu91csxtlmx3n7z8zg88myqq3ttv9s7s80rp2/cbqzfat23fu3ltbv39g0lxttk9tkeojgg0txlf9y61gr5lmwblbdsmsx/ddj0wbnqr39ixji4ktxwnosq2u8dogklxbtiw2iji5pecke4ehdcbchogc5ttxeb33xmud7k63plbprjxraduzjndxfqbjsnpjlkucgzomupkdoawtp4l5nsonyzcd4yqng6mwzgbkyr/k4epgmca41efrfpbeqxeos2pojalkie3u1fnhxmsguy1fjhxxww6zohef4lxam8lifub8ipgkmtup9ivdnm6dopxnu2f8u40u+0rtkbgaobrb71bbkrbutffzdipfhhismxwrd706l6m9xkfjcwtshphhnhkisiqrx07ysr0ecuyycj2ogcbcrhbmf9qawvigsdg/dtunai198aosso1xisiiweqgwrvbodbjjrjgujpu1zmxrlftlnk3drpnblalue3q71fpchh9otzb+w/36jb8sbeayqjd CONVEX SEGMENTATION: FINE SCALE Solve speed limited by the CFL min µ ru + ku fk This Depends on This

8 <latexit sha1_base64="l3jeis1bxocqmcp/50/lq1tp684=">aaae7nicbznlbxmxemfdnkajrxaoxcwipeifykygofzeqnalfx1icrrzjndjynu3thdauf4khlkhrpw4cyxvwbfbu91csxtlmx3n7z8zg88myqq3ttv9s7s80rp2/cbqzfat23fu3ltbv39g0lxttk9tkeojgg0txlf9y61gr5lmwblbdsmsx/ddj0wbnqr39ixji4ktxwnosq2u8dogklxbtiw2iji5pecke4ehdcbchogc5ttxeb33xmud7k63plbprjxraduzjndxfqbjsnpjlkucgzomupkdoawtp4l5nsonyzcd4yqng6mwzgbkyr/k4epgmca41efrfpbeqxeos2pojalkie3u1fnhxmsguy1fjhxxww6zohef4lxam8lifub8ipgkmtup9ivdnm6dopxnu2f8u40u+0rtkbgaobrb71bbkrbutffzdipfhhismxwrd706l6m9xkfjcwtshphhnhkisiqrx07ysr0ecuyycj2ogcbcrhbmf9qawvigsdg/dtunai198aosso1xisiiweqgwrvbodbjjrjgujpu1zmxrlftlnk3drpnblalue3q71fpchh9otzb+w/36jb8sbeayqjd CONVEX SEGMENTATION: COARSE SCALE TV changes the CFL condition min µ ru + ku fk This Depends on This

9 SCALE AFFECTS SPEED! Segmentation using TV 15 iterations 280 iterations 4500 iterations Solution: use multi-grid!

10 NEURAL NETS HAVE THE SAME PROBLEMS X X far away points talk to each other

11 CLASSIFICATION NET Globalize via pooling x y label vector ball car hat grumpy cat output contains global info

12 CLASSIFICATION NET Globalize via convolution x High-res output CFL condition becomes field of view problem

13 FIELD OF VIEW BREAKDOWN Building Skyscraper Example from: Zhao et al. Pyramid Scene Parsing Network

14 GLOBALIZATION METHODS THINGS GET COMPLICATED

15 APPROACHES TO GLOBALIZATION Dilated/de-conv modules Chen et al. Semantic image segmentation 2014 Yu & Koltun, Multi-scale context aggregation 2015 Multi-scale feature ensembling Long et al., Fully Convolution Semantic Segmentation 2015 Chen et al. Attention to scale 2016 Xia et al. Zoom better to see clearer 2016 Hariharan et al. Hypercolumns for object segmentation 2015 Hand-crafted features Lazebnik, Schmid, Ponce, Beyond bags of features 2006 Lucchi et al. Are spatial constraints necessary for segmentation 2011 Chen et al. DeepLab Hengshuang et al. PSPNet

16 SEGMENTATION NET standard net de-conv units High resolution Low localization LC Chen et al. DeepLab

17 DEEPLAB CRF or graph cuts segmentation LC Chen et al. DeepLab

18 PSPNET High memory requirements Hengshuang Zhao et al. Pyramid scene parsing network

19 BACK TO BASICS A NO-FRILLS APPROACH

20 MULTI-GRID SOLVERS

21 THE U-NET A V-cycle for neural nets pool deconv pool deconv Ronneberger et al, 2015

22 MEDICAL IMAGING resolution is high class complexity is low Ronneberger et al, 2015

23 NEURAL NET BUILDING BLOCKS input image Conv Relu Conv Relu

24 RESNET BLOCK input image Conv Relu Conv Relu +

25 1<latexit sha1_base64="xtytqk/naqjj/df88nu05b7 2<latexit sha1_base64="yv89rdpdqj0utzcfxkmioift sha1_base64="gastc9klzhpqd4p6fb6lm+wv 4<latexit sha1_base64="46swzbubqig4kt0cvm0m1behe 4<latexit sha1_base64="46swzbubqig4kt0cvm0m1behe 8<latexit sha1_base64="3zypumahocy2uimapsblr2m 8<latexit sha1_base64="3zypumahocy2uimapsblr2m 4<latexit sha1_base64="46swzbubqig4kt0cvm0m1behe 4<latexit sha1_base64="46swzbubqig4kt0cvm0m1behe 2<latexit sha1_base64="yv89rdpdqj0utzcfxkmioift sha1_base64="gastc9klzhpqd4p6fb6lm+wv 1<latexit sha1_base64="xtytqk/naqjj/df88nu05b7 <latexit sha1_base64="movfet8jhv0 MODIFIED U-NET BLOCK B N R e L U C O N V

26 BIG IDEA: STACKED U-NETS Combine information globalization of U-nets with power of ResNets No frills!

27 TRAINING ON REAL DATASETS

28 LIMITED TRAINING DATA ImageNet 1,000,000 MS-COCO 90,000 PASCAL VOC 1464 Solution: pre-train on big datasets, fine tune on small

29 STANDARD APPROACH Phase I: Train a classification net ball car hat grumpy cat

30 STANDARD APPROACH Phase 2: Add fancy stuff standard net fancy stuff Roughly doubles parameters!

31 Dilated SUNets Remove pooling layers Conventional Classification Networks Increase dilation factor of convolutions Exact same parameters! Dilated Classification Networks

32 RESULTS

33 METRICS Classification Top-1 accuracy Top-5 accuracy Segmentation Intersection over union Cat

34 RESULTS Classification Performance Model Top-1 Top-5 Depth Params ResNet M ResNet M ResNet M DenseNet M DenseNet M SUNet M SUNet M SUNet M 4.5% miou with 7x fewer parameters Segmentation Performance Model miou ResNet-101 [218] SUNet SUNet SUNet % miou w/o sacrificing classification performance Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein, Stacked U-Nets: A No- Frills Approach to Natural Image Segmentation

35 SEGMENTATION RESULTS Training on train+val (2x data) Training on train-set Methods Validation miou Resnet+ASPP Xception+ASPP+Decoder SUNet nce comparison on PASCAL VOC 2012 validat Methods miou Piecewise (VGG16) [224] 78.0 LRR+CRF [209] 77.3 DeepLabv2+CRF [210] 79.7 Large-Kernel+CRF [220] 82.2 Deep Layer Cascade [254] 82.7 Understanding Conv [221] 83.1 RefineNet [205] 82.4 RefineNet-ResNet152 [205] 83.4 PSPNet [217] 85.4 SUNet Advantages 1/2 the weights, 1/5 the RAM

36 SAMPLE RESULTS

37 SAMPLE RESULTS

38 SAMPLE RESULTS

39 SAMPLE RESULTS

40 FAILURE CASE

41 FAILURE CASE

42 SUMMARY OF PART I CFL condition for convex segmentation FOV problem for neural nets Unet blocks globalize information better We can compete with SOTA using simple architectures SUNets use fewer parameters and train fast

43 QUESTIONS?

44 QUANTIZED NETS

45 DEEP NETS ARE BIG Low power devices Canziani etal, An Analysis of Deep Neural Network Models for Practical Applications, arxiv 2016.

46 SOLUTION: QUANTIZED/LOW- PRECISION NETWORKS ±1 ±1 ±1 BinaryConnect [Courbariaux NIPS 15] BinaryNet [Hubara NIPS 16] XNOR-Net [Rastegar, ECCV 16] DoReFA-Net [Zhou, arxiv 16] DeepCompression[Han, ICLR 16] Advantages FAST & hardware friendly: no multiplications Low storage costs Low power consumption

47 <latexit sha1_base64="lrs0zfrrtt/qh/okcxt0qnco7zq=">aaacgnicbva9swnben3z2/h1aqfnyhc0cxciacnaweywmzceslez08xdvxn3to1hwb9ibau/wu5sbfwj/gs3myvjfddweg+gmxlrkoxfipjyjianpmdm5+ylc4tlyyv+6lrvjpnhuogjtewtyhak0fbbgrjqqqgmigkx0fvjz7+4bwnfos+xk0jtsustysezoqnlbzqq7jfxqgslhqdbumlym9j452635redutahhsfhgbtjaoww/91ojzxtojflzm09dfjs5syg4bk6huzmiwx8ml1c3vhnfnhm3v+hs7ed0qzxylxpph3170tolludfbloxfdkjno98t+vnmf82myftjmezx8xxzmkmnbeilqtd HOW TO TRAIN QUANTIZED NETS? minimize f(w) Non-quantized: Stochastic Gradient Descent w k+1 = w k rf(w k ) Fully-quantized: Stochastic rounding [Gupta ICML 15] w k+1 = w k Q[ rf(w k )]

48 EXPERIMENT Train CNNs (VGG-Net, ResNets, Wide-RseNet) with binary weight on CIFAR-10 Deterministic Rounding Stochastic Rounding Full-Precision

49 <latexit sha1_base64="lrs0zfrrtt/qh/okcxt0qnco7zq=">aaacgnicbva9swnben3z2/h1aqfnyhc0cxciacnaweywmzceslez08xdvxn3to1hwb9ibau/wu5sbfwj/gs3myvjfddweg+gmxlrkoxfipjyjianpmdm5+ylc4tlyyv+6lrvjpnhuogjtewtyhak0fbbgrjqqqgmigkx0fvjz7+4bwnfos+xk0jtsustysezoqnlbzqq7jfxqgslhqdbumlym9j452635redutahhsfhgbtjaoww/91ojzxtojflzm09dfjs5syg4bk6huzmiwx8ml1c3vhnfnhm3v+hs7ed0qzxylxpph3170tolludfbloxfdkjno98t+vnmf82myftjmezx8xxzmkmnbeilqtd HOW TO TRAIN QUANTIZED NETS? minimize f(w) Fully-quantized: Stochastic rounding [Gupta ICML 15] w k+1 = w k Q[ rf(w k )] Semi-quantized: BinaryConnect [Courbariaux NIPS 15] w k+1 = w k rf(q[w k ]) Floating point

50 EXPERIMENT Train CNNs (VGG-Net, ResNets, Wide-RseNet) with binary weight on CIFAR-10 Deterministic Rounding Stochastic Rounding BinaryConnect Full-Precision The SR method cannot beat BC, why?

51 BINARIZED NETS GO MAINSTREAM Hubara et al. Binarized neural nets Rastegari et al. XNOR Net Zhou et al. Dorefa-Net: training low precision Lin et al. Nets with few multiplications Lin, Zhao, Pan. Towards accurate binary Shayar et al. Learning discrete weights Park, Ahn, Yoo. Weighted Entropy quantization Mishra et al. WRPN: Wide reduced precision nets Kim Zhu et al. Trained ternary quantization Umuroglu et al. Finn: A framework Kim & Smaragdis. Bitwise Nets and tons more

52 CONVERGENCE UNDER CONVEXITY ASSUMPTIONS

53 CONVERGENCE THEORY FOR STOCHASTIC ROUNDING Fully quantized Theorem 2 Assume that F is µ -strongly convex and the learning rates are G given by. Let bound the gradient magnitude. Then E[F ( w T ) F (w? )] apple (1 + log(t + 1))G2 (1 + 1 ) 2µT + p d 2 G iteration count

54 CONVERGENCE THEORY FOR STOCHASTIC ROUNDING Fully quantized Theorem 2 Assume that F is µ -strongly convex and the learning rates are G given by. Let bound the gradient magnitude. Then E[F ( w T ) F (w? )] apple (1 + log(t + 1))G2 (1 + 1 ) 2µT + p d 2 G noise floor

55 CONVERGENCE THEORY FOR BINARYCONNECT semi-quantized Theorem 1 Assume that F is µ -strongly convex and the learning rates are G given by. Let bound the gradient magnitude. Then noise floor L 2 is a Lipschitz constants for the Hessian

56 But this can t be the whole story. What can we say about these methods on NON-CONVEX problems? The answer has to do with exploration vs exploitation

57 FLOATING POINT Learning rate = 1 Quantized scalar weight = rf(w)

58 FLOATING POINT Learning rate = 1 Quantized scalar weight = rf(w)

59 FLOATING POINT Learning rate = 1 Quantized scalar weight = rf(w)

60 FLOATING POINT Learning rate = 1 Quantized scalar weight = rf(w)

61 FLOATING POINT Learning rate = Histogram

62 FLOATING POINT Learning rate = Histogram

63 FLOATING POINT Learning rate = Histogram

64 FULLY QUANTIZED Learning rate = Histogram

65 FULLY QUANTIZED Learning rate = Histogram

66 FULLY QUANTIZED Learning rate = Histogram

67 WHAT S WRONG? Floating Point / Binary Connect Exploration Exploitation Shrink Learning rate Exploration Exploitation Stochastic Rounding Exploration Exploitation Shrink Learning rate Exploration Exploitation

68 MARKOV CHAIN INTERPRETATION Weight space w k+1 w k+1 = w k Q[ rf(w k )] w 2 w k w 1

69 MARKOV CHAIN INTERPRETATION w k+1 w k+1 = w k Q[ rf(w k )] w 2 w k w 1 Long term dynamics governed by the equilibrium distribution

70 MARKOV CHAIN INTERPRETATION equilibrium distribution

71 WHAT ABOUT STOCHASTIC ROUNDING? Fully discrete stochastic rounding does not concentrate Theorem: Then on stationary points If the noise satisfies Tails aren t too fat: Not concentrated at 0 SGD iterates stay bounded R 1 p x,k (z) dz < C/ 2 Iterates of fully-quantized SGD. don t concentrate on minimizers. But BinaryConnect does! Assumptions are weak enough for neural nets!

72 ANNEALING PROPERTIES MAKE A DIFFERENCE Train CNNs (VGG-Net, ResNets, Wide-RseNet) with binary weight on CIFAR-10/100 concentration/ annealing BinaryConnect Full-Precision

73 SUMMARY OF PART II Quantized nets reduce net size and energy Training requires specialized rounding methods We still don t understand the non-convexities of neural nets!

74 THANKS! Stacked U-Nets: A No-Frills Approach to Natural Image Segmentation Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Towards a deeper Understanding of Training Quantized Networks Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, Tom Goldstein

Towards Accurate Binary Convolutional Neural Network

Paper: #261 Poster: Pacific Ballroom #101 Towards Accurate Binary Convolutional Neural Network Xiaofan Lin, Cong Zhao and Wei Pan* firstname.lastname@dji.com Photos and videos are either original work