Weight Quantization for Multi-layer Perceptrons Using Soft Weight Sharing

Similar documents
Regularization in Neural Networks

Introduction to Machine Learning Midterm Exam

STA 414/2104: Lecture 8

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Pattern Recognition and Machine Learning

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Supervised Learning Part I

Lecture 5: Logistic Regression. Neural Networks

Links between Perceptrons, MLPs and SVMs

Comparison of Modern Stochastic Optimization Algorithms

Reading Group on Deep Learning Session 1

STA 414/2104: Lecture 8

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Artificial Neural Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CSC321 Lecture 5: Multilayer Perceptrons

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Introduction to Machine Learning Midterm Exam Solutions

Neural Networks and the Back-propagation Algorithm

FINAL: CS 6375 (Machine Learning) Fall 2014

Introduction to Machine Learning Spring 2018 Note Neural Networks

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Midterm, Fall 2003

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization

Introduction to Neural Networks

Expectation Maximization

CSC321 Lecture 9: Generalization

How to do backpropagation in a brain

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Multilayer Perceptron

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

CS 4700: Foundations of Artificial Intelligence

Heterogeneous mixture-of-experts for fusion of locally valid knowledge-based submodels

Machine Learning. Neural Networks

Deep Feedforward Networks

Midterm: CS 6375 Spring 2015 Solutions

Artificial Neural Networks

Generative v. Discriminative classifiers Intuition

Statistical Machine Learning from Data

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

p(d θ ) l(θ ) 1.2 x x x

Improved Bayesian Compression

Multi-Layer Boosting for Pattern Recognition

Learning and Memory in Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

Introduction To Artificial Neural Networks

Deep Learning Architecture for Univariate Time Series Forecasting

Mul7layer Perceptrons

Machine Learning Lecture 5

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Lecture 11: Unsupervised Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC321 Lecture 9: Generalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

ECE521 Lectures 9 Fully Connected Neural Networks

Learning from Examples

Lecture 4: Probabilistic Learning

Recent Advances in Bayesian Inference Techniques

Data Mining Part 5. Prediction

Logistic Regression. COMP 527 Danushka Bollegala

Neural networks. Chapter 19, Sections 1 5 1

Negatively Correlated Echo State Networks

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Application of the Cross-Entropy Method to Clustering and Vector Quantization

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Grundlagen der Künstlichen Intelligenz

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Convolutional Associative Memory: FIR Filter Model of Synapse

Lecture 7 Artificial neural networks: Supervised learning

Clustering with k-means and Gaussian mixture distributions

A Novel Rejection Measurement in Handwritten Numeral Recognition Based on Linear Discriminant Analysis

A Novel 2-D Model Approach for the Prediction of Hourly Solar Radiation

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Pattern Recognition. Parameter Estimation of Probability Density Functions

The wake-sleep algorithm for unsupervised neural networks

Stochastic gradient descent; Classification

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

Analysis of Fast Input Selection: Application in Time Series Prediction

Clustering and Gaussian Mixtures

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Artificial Neural Networks

Mixture Models and EM

Jakub Hajic Artificial Intelligence Seminar I

Transcription:

Weight Quantization for Multi-layer Perceptrons Using Soft Weight Sharing Fatih Köksal, Ethem Alpaydın, and Günhan Dündar 2 Department of Computer Engineering 2 Department of Electrical and Electronics Engineering Boğaziçi University, Istanbul Turkey Abstract. We propose a novel approach for quantizing the weights of a multi-layer perceptron (MLP) for efficient VLSI implementation. Our approach uses soft weight sharing, previously proposed for improved generalization and considers the weights not as constant numbers but as random variables drawn from a Gaussian mixture distribution; which includes as its special cases k-means clustering and uniform quantization. This approach couples the training of weights for reduced error with their quantization. Simulations on synthetic and real regression and classification data sets compare various quantization schemes and demonstrate the advantage of the coupled training of distribution parameters. Introduction Since VLSI circuits must be produced in large amounts for economy of scale, it is necessary to keep the storage capacity as low as possible to come up with cheaper products. In an artificial neural network, the parameters are the connection weights and if they can be stored using fewer bits, storage need will be reduced and we gain from memory. In this work, we try to find a good quantization scheme to achieve a reasonable compression ratio for parameters of the network, without significantly degrading accuracy. Our proposed method finds a method to partition the weights of a neural network into a number of clusters so that only one value is used for one cluster of weights. Thus, the actual memory which stores the real-numbered values will be small and weights will be pointers to this memory. Then, the weights in a cluster will point to the same location in the real memory. For example, given an MLP with, weights that can be grouped into 32 clusters, for each weight only five bits are used. An analogy is the color map. To get 6 million colors, one requires 24 bits for each pixel. Graphics adapters have color maps, e.g., of size 256, where each entry is 24 bits and is one of 6 million colors. Then usingeight bits for each pixel, we can index one entry in the color map. So the storage requirement for an image is one third. Although this means that an image can contain only 256 of the 6 million possible colors, if quantization [] is done well, there will not be a degradation of quality. Our aim is to do a similar quantization of weights of a large MLP. G. Dorffner, H. Bischof, and K. Hornik (Eds.): ICANN 2, LNCS 23, pp. 2 26, 2. c Springer-Verlag Berlin Heidelberg 2

22 Fatih Köksal, Ethem Alpaydın, and Günhan Dündar The benefit is obvious for digital storage in reducing the memory size. For analog storage of weights, capacitors are perhaps the most popular method, and in such circuits, the area of the capacitors generally dominates the total area. Thus, it is very important to reduce the number of weights. There is a great amount of research on the quantization and efficient hardware implementation of neural networks (specifically MLP). The aim is to get an MLP with weights which are represented by less number of bits, which consumes less memory at the expense of loss of precision; example studies are given in [2,3,4,5,6]. In our approach which is novel, we do not reduce the precision; the operations are still with full precision. We decrease storage by clustering of the weights. The organization of the paper is as follows: In Section 2, we discuss the possible methods for quantization and divide them into two groups: after-training methods and during-training methods. Section 3 contains experimental design and the results of these methods, and in Section 4, we conclude and discuss future work. 2 Soft Weight Sharing We can generally classify the applicable methods for weight quantization into two. The simplest method would be trainingthe neural network normally and then directly applyingquantization to the weights of the trained network. We call these after-training methods. However, the application of quantization without consideringthe effect of quantization leads to large error and therefore combining quantization and trainingis a better alternative which we call during-training methods [5]. The methodology we use is soft weight sharing [7] where it is assumed that the weights of an MLP are not constants but are random variables drawn from a mixture of Gaussians M p(w) = α j φ j (w) () j= where w is a weight, α j are the mixingcoefficients (prior probabilities), and the component densities φ j (w) are Gaussians, i.e., of the form φ j (w) N(µ j,σj 2). The main reason for choosing this type of distribution for weights is its generality and analytical simplicity. There are three types of parameters in the mixture distribution, namely, prior probabilities, α j, means, µ j, and variances, σ j. Assuming that the weights are independent, the likelihood of the sample of weights is given by W L = p(w i ) (2) i= In after-training, once the trainingof the MLP is complete, Expectation-Maximization (EM) method can be used to determine the parameters [8]. It is known that when all priors and variances are equal, the well-known quantization method k-means clustering is equivalent to EM. One can even view uniform quantization

Weight Quantization for Multi-layer Perceptrons 23 Table. The test set error values (average±standard deviation) on the regression dataset for several quantization levels. The unquantized MLP has an error of 4.±.6. Bit 2 Bits 3 Bits Uniform 245.33±37.6 4.87±6.33 4.64±23. K-means 73.34±87.59 57.6±2.97 6.85±3.7 SWS 8.85±2.27 9.99±4.9 4.6± in this framework where additional to all priors and variances beingequal, means are equally spaced and fixed. In during-training methods, to couple the trainingof weights with their quantization, we take the negative loglikelihood convertingit to an error function to be minimized Ω = M ln α j φ j (w i ) (3) i j= This then is added as a penalty term to the usual error function (mean square error in regression and cross-entropy in classification) E = E + υω (4) to get the augmented error function E which is then minimized, e.g., using gradient-descent, to learn both the weights w i, and also the parameters of their distribution, i.e., µ j, σ j, and α j. We do not give the update equations here due to lack of space but they can be found in [7]. 3 Experimental Results We have used one synthetic regression dataset for visualization and two real datasets for speech phoneme and handwritten digit recognition. All datasets are divided as trainingand test sets. For all problems, we first train MLPs and find the optimum number of hidden units and the number of parameters, then they are quantized with different number of bits. We have run each model ten times, starting gradient-descent from different random initial weights and report the average and standard deviation of error on the test set. The MLP of regression problem has one input, one output and four hidden units thus using3 weights. Thus without any quantization, we need four bits. We try quantizingwith two, four, and eight clusters (Gaussians) corresponding to one, two, and three bits. An example graph of regression function as quantized by soft weight sharing is given in Figure. In this figure, the effect of quantization on the regression function is easily observed. The means and the variances of error with the three methods are given in Table. k-means is an after-trainingmethod and works better than uniform quantization. Still better though is the result with soft weight sharing (SWS), which is a during-training method and clearly demonstrates the advantage of coupled training.

24 Fatih Köksal, Ethem Alpaydın, and Günhan Dündar Unquantized MLP MLP Quantized to bit by SWS.5.5.5.5 5 5 5 5 MLP Quantized to 2 bits by SWS MLP Quantized to 3 bits by SWS.5.5.5.5 5 5 5 5 Fig.. The graphs of regression function; unquantized and quantized by soft weight sharing (SWS) with one, two, and three bits. The speech phoneme is represented as a 2 dimensional input which contains 2 samples from six distinct classes (six speech phonemes) and is equally divided into two as trainingand test sets. The MLP has ten hidden units with 96 parameters and thus unquantized, we need bits. We have used one, two, three, four and five bits for our performance evaluation. Table 2 reports the simulation results. In Figure 2, we draw the scatter of weights and the fitted probability distribution using soft weight sharing. Note that the probability distribution of weights is clearly a Gaussian distribution centered near zero. Although, there are some papers (e.g. [3]) which claim and assume that the weights of an MLP are distributed uniformly, we face a zero-mean normal distribution. This must not be surprisingbecause there are more than parameters and they are initialized randomly to be close to zero and a large majority are not much updated later on. Note that if we were usinga pruningstrategy, the connections belongingto a cluster which is centered very close to zero would be the ones to be pruned. The significant ones are those which are distant from zero, since they have large error gradient and are updated much larger than the other parameters.

Weight Quantization for Multi-layer Perceptrons 25.8 Bit (2 Gaussians).4 2 Bits (4 Gaussians).6.3.4.2.2. 5 5 ( 3 Bits (8 Gaussians)).4.3.2. 5 5 ( 5 Bits (32 Gaussians)).2 5 5 ( 4 Bits (6 Gaussians)).2.5..5 5 5.5..5 5 5 Fig. 2. The distribution of weights of the MLP used for speech phoneme recognition and the fitted Gaussian mixture probability distribution by soft weight sharing. Vertical bars are the weights; circles are the centers of the Gaussians. The handwritten digit dataset contains,2 samples each of which is a 6 6 bitmap. The MLP has ten hidden units and we have 2,68 parameters for quantization. The distribution of weight values after the training phase of MLP is again a normal distribution like with the speech data set. Table 3 contains the quantization results. We see in both classification datasets that when the number of bits is large, even uniform quantization is good; the advantage of soft weight sharing, i.e., coupled training, becomes apparent with small number of clusters. 4 Conclusions We propose to use soft weight sharing, previously proposed for improved generalization, for quantizingthe weights of an MLP for efficient VLSI implementation and compare it with the previously proposed methods of uniform quantization and k-means. Our results indicate that soft weight sharing, because it couples the trainingof the MLP with quantization, leads to more accurate networks at the same level of quantization. Once quantization is done, the results also indi-

26 Fatih Köksal, Ethem Alpaydın, and Günhan Dündar Table 2. On the speech phoneme recognition problem, average±standard deviation of number of misclassifications out of 6 on the test set are given. For comparison, with the unquantized MLP using bits, the misclassification error is 4.7±3.79. Bit 2 Bits 3 Bits 4 Bits 5 Bits Uniform 49.29±8.36 376.5±77.4 29.89±55.35 4.9±33.69 55.5±8.96 K-means 365.±82.93 248.±69.3 23.4±48.74 59±8.66 45.5±5.93 SWS 387.5±74. 29.39±37.55 89.±45.4 52.±9.85 44.79±5.92 Table 3. On the handwritten digit recognition problem, average±standard deviation of number of misclassifications out of 6 of the MLP on the test set are given. For comparison, with the unquantized MLP, the misclassification error is 23.5±3.58. Bit 2 Bits 3 Bits 4 Bits 5 Bits Uniform 526.7±2.75 52.7±42.37 44.7±7.2 95.9±37.8 29.29±5.67 K-means 448.39±4.4 263.29±75.2 58.79±27.43 27.7±5.38 24.5±2.8 SWS 397.2±39.4 244.9±53.66 64.9±37.24 3.7±7.28 27.9±6.47 cate the saliency of weights which can be used further to prune the unnecessary connections; we leave this as future work. Acknowledgment This work is supported by Grant AD from Boğaziçi University Research Funds. References. Gersho, A. and R. Gray, Vector Quantization and Signal Compression Norwell, MA:Kluwer, 992. 2. Choi, J. Y. and C. H. Choi, Sensitivity Analysis of Multilayer Perceptron with Differentiable Activation Functions, IEEE Transactions on Neural Networks, Vol. 3, pp. 7, 992. 3. Xie, Y. and M. A. Jabri, Analysis of the Effects of Quantization in Multi-Layer Neural Networks Using a Statistical Model, IEEE Transactions on Neural Networks, Vol. 3, pp. 334 338, 992. 4. Skaue, S., T. Kohda, H. Yamamato, S. Maruno, and Y. Shimeki, Reduction of Required Precision Bits for Back Propagation Applied to Pattern Recognition, IEEE Transactions on Neural Networks, Vol. 4, pp. 27 275, 993. 5. Dündar, G. and K. Rose, The Effects of Quantization on Multi Layer Neural Networks, IEEE Transactions on Neural Networks, Vol. 6, pp. 446 45, 995. 6. Anguita, D., S. Ridella and S. Rovetta, Worst Case Analysis of Weight Inaccuracy Effects in Multilayer Perceptrons, IEEE Transactions on Neural Networks, Vol., pp. 45 48, 999. 7. Nowlan, S. J. and G. E. Hinton, Simplifying Neural Networks by Soft Weight Sharing, Neural Computation, Vol. 4, pp. 473 493, 992. 8. Alpaydın, E. Soft Vector Quantization and the EM Algorithm, Neural Networks, Vol., pp. 467 477, 998.