MINIMUM PRECISION REQUIREMENTS FOR THE SVM-SGD LEARNING ALGORITHM. Charbel Sakr, Ameya Patil, Sai Zhang, Yongjune Kim, and Naresh Shanbhag

Size: px

Start display at page:

Download "MINIMUM PRECISION REQUIREMENTS FOR THE SVM-SGD LEARNING ALGORITHM. Charbel Sakr, Ameya Patil, Sai Zhang, Yongjune Kim, and Naresh Shanbhag"

Diana Nichols
6 years ago
Views:

1 MINIMUM PRECISION REQUIREMENTS FOR THE SVM-SGD LEARNING ALGORITHM Charbel Sakr, Amea Patil, Sai Zhang, Yongjune Kim, and Naresh Shanbhag Dept. of Electrical and Computer Engineering, Universit of Illinois at Urbana Champaign ABSTRACT It is well-known that the precision of data, weight vector, and internal representations emploed in learning sstems directl impacts their energ, throughput, and latenc. The precision requirements for the training algorithm are also important for sstems that learn on-the-fl. In this paper, we present analtical lower bounds on the precision requirements for the commonl emploed stochastic gradient descent (SGD) on-line learning algorithm in the specific context of a support vector machine (SVM). These bounds are obtained subject to desired sstem performance. These bounds are validated using the UCI breast cancer dataset. Additionall, the impact of these precisions on the energ consumption of a fixed-point SVM with on-line training is studied. Simulation results in 45 nm CMOS process show that operating at the minimum precision as dictated b our bounds improves energ consumption b a factor of 5.3 as compared to conventional precision assignments with no observable loss in accurac. Index Terms machine learning, fixed point, precision, energ, accurac. INTRODUCTION Precision of data, weight vector, and internal signal representations in a machine learning implementation has a deep impact on its energ consumption and throughput. Recent works have empiricall studied the effect of moderate [] and heav [2, 3, 4, 5] quantization on the performance of learning sstems. A more analtical approach has recentl emerged where signal-to-quantization noise ratio (SQNR) is used as a metric to attempt establishing precision to accurac trade-offs for the feedforward path [6] and training [7] of deep neural networks. Our work addresses the problem of a sstematic wa of assigning precision to a fixed-point learning sstem while providing accurac guarantees. We stud the specific case of a support vector machine (SVM) [8] being trained using the stochastic gradient descent (SGD) algorithm, i.e., the SVM- SGD algorithm. Our approach is analtical in contrast to the trial-and-error approach emploed toda. Energ consumption is also brought in as a metric for consideration in the design of learning sstems. This work was supported in part b Sstems on Nanoscale Information fabrics (SONIC), one of the six SRC STARnet Centers, sponsored b MARCO and DARPA... Contributions Our contributions in this paper are the following: we provide analtical lower bounds on the precision of the data, weight vector, and training parameters for the SVM-SGD algorithm subject to requirements on classification accurac. We also stud the impact of precision reduction on the energ consumption of a baseline SVM-SGD architecture b emploing architectural level energ and dela models in a 45 nm CMOS process. We show that up to 5.3 reduction in energ is achieved when precisions are assigned based on the lower bounds as compared to a 6 b baseline implementation. The rest of this paper is organized as follows. Section 2 presents necessar background on which we build up our work. Section 3 proposes our analtical bounds on precision for fixed-point implementations. Experimental results are included in Section 4. We conclude our paper in Section BACKGROUND 2.. Online Learning SVM (SVM-SGD) SGD is an efficient training method for machine learning algorithms [9]. At each iteration, the weight vector of the algorithm is updated as follows: w n+ = w n γ w Q(z n, w n ) () where Q( ) is the loss function to be minimized, γ is the stepsize, w n R D is the weight vector to be learned, z n is the streamed sample which usuall consists of an input data vector x n and a corresponding label n, and D is the dimensionalit of the input data. SVM [8] is a simple and popular supervised learning method for classification. SVM operates b determining a maximum margin separating hperplane in the feature space. For regularization, some feature vectors ma be allowed to lie inside the margin making it a soft margin. The SVM predicts the label ŷ n {±} given a feature vector (i.e., input data vector) x n as follows: w T x n + b ŷ n= ŷ n= (2) where w is the weight vector and b is the bias term. The classification error for the SVM is defined as p e = P {Y

2 x B X B F T wx >> D Classifier Weight Update >> D >> >> w B W B F b B W ŷ ( ) Fig. : SVM-SGD algorithm s architecture showing the precision per dimension. Ŷ }. In the rest of this paper, we emplo capital letters to denote random variables. The optimum weight vector w in (2) maximizes the margin or minimizes the average of following loss function: Q(z n, w) = λ( w 2 +b 2 )+max{, n (w T x n +b)} (3) It can be shown that, when applied to the SVM algorithm, specificall the loss function defined in (3), the SGD update equation is [9]: { if n (wn T x n + b) >, w n+ = ( γ λ)w n + γ n x n otherwise. { if n (wn T x n + b) >, b n+ = ( γλ)b n + γ n otherwise. (4) 2.2. Architectural Energ and Dela Models Fig. shows the architecture of the SVM-SGD algorithm and indicates the precision assignments (per dimension) for the input B X, the classifier B F, and the weight update B W. The critical path of the architecture is an path between two latch nodes having the maximum dela, assuming inputs and outputs are also latched. The main building block of such computational sstem is the b full-adder (FA). Hence the critical path is the one having maximum number of FAs. The maximum throughput at which the architecture can operate is the reciprocal of the critical path dela as follows []: f max = I ON βl F A C F A V dd (5) where L F A is the number of FAs in the critical path of the architecture, C F A is the load capacitance of one FA, β is an empirical fitting parameter, I ON is the ON current of a transistor, and V dd is the suppl voltage at which the architecture is being operated at. The energ consumption of the architecture operating at f max is given b []: E = αn F A C F A V 2 dd + βn F A L F A C F A V 2 dd V dd S (6) where N F A is the total number of FAs in the sstem, α is the activit factor, and S is the subthreshold slope. It is found [] that both N F A and L F A are linear in the precisions B X, B F, and B W. Since E is proportional to the product of N F A and L F A, E is a quadratic function of these precisions. This clearl indicates the importance of minimizing B X, B F, and B W. Note, this analsis assumes standard implementation using ripple carr adders and Baugh-Woole multipliers. 3. ANALYTICAL BOUNDS ON PRECISION The impact of precisions of data (B X ), weight vector (B F ), and weight update block (B W ) on the overall SQNR is well established for finite-impulse response (FIR) filters and least mean-squared (LMS) adaptive filters [2]. In this section, we analticall predict the accurac of the SVM-SGD algorithm as a function of B X, B F, and B W. All details concerning the derivation and proofs of the upcoming results can be found in []. 3.. Classification Block We assume that the input data is quantized to B X bits and is hence corrupted b quantization noise that is independentl uniforml distributed on [ x 2, x 2 ] per dimension [3]. Note that x = 2 (B X ) is the input quantization step. Similarl, the weights are quantized to B F bits and are hence corrupted b quantization noise that is independentl uniforml distributed on [ f 2, f 2 ] per dimension where f = 2 (B F ). Our first result is a lower bound derived based on the geometric propert of the SVM. This lower bound on B X ensures that an datapoint ling outside the margin is classified correctl even after being perturbed b quantization noise. Theorem 3. (Geometric Bound). Given D, B F, and w, a feature vector x ling outside the margin will be classified correctl if ( ) D w B X > log 2 ( +. (7) D x )2 B F Proof. The main idea is to make sure that the worst case quantization noise is less than the functional margin of the classifier. For details, please see [].

3 The geometric bound reveals the following: larger SVM margin (i.e., smaller w ) allows greater reduction of B X ; there is a trade-off relation between B X and B F ; higher dimension D and larger x require more input precision B X. Note that Theorem 3. is specific to a single feature vector x. The following is a simple corollar that applies to all datapoints in the dataset ling outside the margin. Corollar 3... Given D, B F, and w, an feature vector in the dataset X ling outside the margin will be classified correctl if D w B X > log 2 ( + D max x X x )2 B F. (8) It is worth noting that max x X x D because of normalization. Let Ŷfx be the output of the fixed-point classifier and Ŷfl be the output of the floating-point classifier. Our next result is an upper bound on the mismatch probabilit p m = Pr{Ŷfx Ŷ fl } as a function of precision. Theorem 3.2 (Probabilistic Bound). Given B X, B F, w, and b, the upper bound on the mismatch probabilit p m is given b: p m 24 ( 2 x w 2 E + 2 f E [ X 2 + w T X + b 2 [ ] w T X + b 2 ]) where X denotes the random variable of the dataset. Note that the statistics (i.e., the expected values in (9)) are calculated empiricall. Proof. The main idea is to consider the mismatch event for one datapoint and upper bound its probabilit using Chebshev s inequalit. The law of total probabilit is then used to obtain the upper bound over the whole dataset. For details, please see []. Theorem 3.2 can be reformulated to obtain the lower bound on precision to achieve a target worst-case mismatch rate. It can also be extended to obtain the upper bound on the actual classification accurac of the fixed-point realization as follows: Theorem 3.3. The fixed-point probabilit of error is upper bounded as follows: (9) p e + min(p fl, p m ) max(p fl, p m ). () where p fl = Pr{Ŷfl = Y } denotes the probabilit of detection in floating-point. Proof. Basic laws of probabilit theor were used. For details, please see [] Weight Update Block Analsis In the preceding discussion, w is the converged weight vector of an infinite-precision algorithm. In practice, it is standard to first implement a floating-point algorithm with desirable convergence properties and then to quantize so as to keep the convergence behavior unaltered. The upcoming setup assumes without loss of generalit that the floating-point convergence of SGD in the context of SVM (4) is obtained for λ = and some small value of γ (tpicall a negative power of 2) Our next result ensures that the non-zero update term in (4) is represented correctl during training in spite of weight update quantization. Theorem 3.4. The lower bound on the weight update precision B W in order to ensure full convergence is given b: B W B X log 2 (γ). () Proof. The idea is to prevent the updates from adding additional quantization. For details, please see []. The setup of Theorem 3.4 is quite conservative. In fact, it is possible to sometimes break the bound in () and still obtain satisfactor convergence behavior []. Two particular cases are to be noted. If B W = log 2 (γ), then we obtain a sign-sgd behavior onl for negative updates. This is because onl the sign bit of the input data is retrieved after truncation at the end of the update. Similarl, if B W = log 2 (γ), then we still obtain a sign-sgd behavior, but the effective step-size doubles due to truncation. 4. EXPERIMENTAL RESULTS 4.. Classification Accurac and Convergence In this section, we validate our analtical results based on the UCI Breast Cancer dataset [4]. For fixed-point classification with B F = 6, Fig. 2 (a) shows that B X 5 is sufficient to achieve a target classification probabilit of p e.6. This is consistent with the analtical values determined b the geometric bound (8) and the probabilistic bound (9): B X = 4 and B X = 5, respectivel. Fig. 2 (b) shows the convergence curve of the average loss function of SVM as described in (3) for various weight update block precisions B W. The minimum precision given b () is B W = when γ = 2 5, and B X = 6 chosen to be consistent with the bounds shown in Fig. 2(a). We note that the fixed-point convergence curve tracks the floating-point curve ver closel. We also show convergence curves for precisions B W =, 7, 6, and 5. The curves are similar to the floatingpoint curve though with an observable loss in the accurac for lower precisions. Further analsis explaining this loss in

4 Loss Function E(J) Loss Function E(J) p e B X Bits V dd Sample (V) (a) Index (V) V dd Fig. 3: Energ savings using minimum precision (B F = 6, B X = 5, B W = ) for p m <., compared to 2 b higher precision (B F = 8, B X = 7, B W = 2) and uniform precision assignment of 6 b (B F = 6, B X = 6, B W = 6). Sample Index (b) Fig. 2: Experimental results for the Breast Cancer dataset: (a) fixed-point classifier simulation (FX sim), probabilistic upper bound (UB), and geometric bound (GB) for B F = 6, and (b) convergence curves for fixed-point (B F = 6 and B X = 6) and floating-point (FL sim) SGD training (γ = 2 5 and λ = ). accurac can be found in []. Note: the initial faster convergence but eventual loss in accurac when B W = 5 is because it corresponds to B W = log 2 (γ) meaning a sign-sgd behavior with double the step size as discussed in Section Energ Consumption We emploed the energ-dela models in Section 2.2 to estimate the energ consumption of the SVM-SGD architecture (Fig. ) when processing the Breast Cancer dataset. Our methodolog is the same as the one in [] but it uses a 45 nm semiconductor process parameters. This methodolog is useful for estimating the benefits of precision reduction without a time-consuming and laborious ASIC design process. The model parameters obtained through circuit simulations of a full adder and their values can be found in []. Based on these parameters, we evaluate the energ consumption in (6). Fig. 3 shows the energ consumption of the architecture as a function of suppl voltage for different precisions in order to achieve p m <.. An architecture that emplos 6 b for all variables is chosen as reference as recent works on fixedpoint learning have emploed commercial 6 b digital signal processors []. As discussed in [], the trade-off between leakage and dnamic energ (the two terms in (6)) leads to the well known minimum energ operating point (MEOP) for each curve. As shown in Fig. 3, the precision assignment using (9)-() results in 5.3 energ savings at the MEOP over a 6 b fixed-point implementation. 5. CONCLUSION In this paper, we presented analtical bounds on the precisions of data, weight vector, and weight update block for the SVM-SGD algorithm. These bounds are closel related to performance metrics such as the classification accurac and mismatch probabilit. Moreover, we quantified the impact of precision on the energ cost of inference in a commercial semiconductor process technolog using circuit analsis and models. We observed significant energ savings when assigning precision at the lower bounds as compared to prior work. Using our results, designers of fixed-point realization can efficientl determine precision requirements analticall. Future work includes developing a similar fixed-point analsis of the popular Deep Neural Networks (DNN). Such networks tend to be highl complex. A sstematic stud of precision requirements will be extremel valuable.

5 References [] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Naraanan, Deep learning with limited numerical precision, in Proceedings of The 32nd International Conference on Machine Learning, 25, pp [3] K. Parhi, VLSI digital signal processing sstems: design and implementation, John Wile & Sons, 27. [4] A. Asuncion and D. Newman, UCI machine learning repositor, edu/ml/, 27, [online]. [2] M. Kim and P. Smaragdis, Bitwise Neural Networks, arxiv preprint arxiv:6.67, 26. [3] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David, Binarconnect: Training deep neural networks with binar weights during propagations, in Advances in Neural Information Processing Sstems, 25, pp [4] Matthieu Courbariaux and Yoshua Bengio, Binarnet: Training deep neural networks with weights and activations constrained to + or -, arxiv preprint arxiv:62.283, 26. [5] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, XNOR-Net: ImageNet classification csing binar convolutional neural networks, arxiv preprint arxiv: , 26. [6] Darrl D Lin, Sachin S Talathi, and V Sreekanth Annapuredd, Fixed point quantization of deep convolutional networks, arxiv preprint arxiv:5.6393, 25. [7] Darrl D Lin and Sachin S Talathi, Overcoming challenges in fixed point training of deep convolutional networks, arxiv preprint arxiv:67.224, 26. [8] C. Cortes and V. Vapnik, Support-vector networks, Machine learning, vol. 2, no. 3, pp , 995. [9] L. Bottou, Large-scale machine learning with stochastic gradient descent, in Proceedings of COMP- STAT 2, pp Springer, 2. [] R. Abdallah and N. Shanbhag, Reducing energ at the minimum energ operating point via statistical error compensation, Ver Large Scale Integration (VLSI) Sstems, IEEE Transactions on, vol. 22, no. 6, pp , 24. [] Charbel Sakr, Amea Patil, Sai Zhang, and Naresh Shanbhag, Understanding the energ and precision requirements for online learning, arxiv preprint arxiv:67.669, 26. [2] M. Goel and N. Shanbhag, Finite-precision analsis of the pipelined strength-reduced adaptive filter, Signal Processing, IEEE Transactions on, vol. 46, no. 6, pp , 998.

arxiv: v3 [stat.ml] 26 Aug 2016

arxiv: v3 [stat.ml] 26 Aug 2016 Understanding the Energy and Precision Requirements for Online Learning arxiv:67.669v3 [stat.ml] 26 Aug 26 Charbel Sakr Ameya Patil Sai Zhang Yongjune Kim Naresh Shanbhag Department of Electrical and Computer