MINIMUM PRECISION REQUIREMENTS FOR THE SVM-SGD LEARNING ALGORITHM. Charbel Sakr, Ameya Patil, Sai Zhang, Yongjune Kim, and Naresh Shanbhag
|
|
- Diana Nichols
- 6 years ago
- Views:
Transcription
1 MINIMUM PRECISION REQUIREMENTS FOR THE SVM-SGD LEARNING ALGORITHM Charbel Sakr, Amea Patil, Sai Zhang, Yongjune Kim, and Naresh Shanbhag Dept. of Electrical and Computer Engineering, Universit of Illinois at Urbana Champaign ABSTRACT It is well-known that the precision of data, weight vector, and internal representations emploed in learning sstems directl impacts their energ, throughput, and latenc. The precision requirements for the training algorithm are also important for sstems that learn on-the-fl. In this paper, we present analtical lower bounds on the precision requirements for the commonl emploed stochastic gradient descent (SGD) on-line learning algorithm in the specific context of a support vector machine (SVM). These bounds are obtained subject to desired sstem performance. These bounds are validated using the UCI breast cancer dataset. Additionall, the impact of these precisions on the energ consumption of a fixed-point SVM with on-line training is studied. Simulation results in 45 nm CMOS process show that operating at the minimum precision as dictated b our bounds improves energ consumption b a factor of 5.3 as compared to conventional precision assignments with no observable loss in accurac. Index Terms machine learning, fixed point, precision, energ, accurac. INTRODUCTION Precision of data, weight vector, and internal signal representations in a machine learning implementation has a deep impact on its energ consumption and throughput. Recent works have empiricall studied the effect of moderate [] and heav [2, 3, 4, 5] quantization on the performance of learning sstems. A more analtical approach has recentl emerged where signal-to-quantization noise ratio (SQNR) is used as a metric to attempt establishing precision to accurac trade-offs for the feedforward path [6] and training [7] of deep neural networks. Our work addresses the problem of a sstematic wa of assigning precision to a fixed-point learning sstem while providing accurac guarantees. We stud the specific case of a support vector machine (SVM) [8] being trained using the stochastic gradient descent (SGD) algorithm, i.e., the SVM- SGD algorithm. Our approach is analtical in contrast to the trial-and-error approach emploed toda. Energ consumption is also brought in as a metric for consideration in the design of learning sstems. This work was supported in part b Sstems on Nanoscale Information fabrics (SONIC), one of the six SRC STARnet Centers, sponsored b MARCO and DARPA... Contributions Our contributions in this paper are the following: we provide analtical lower bounds on the precision of the data, weight vector, and training parameters for the SVM-SGD algorithm subject to requirements on classification accurac. We also stud the impact of precision reduction on the energ consumption of a baseline SVM-SGD architecture b emploing architectural level energ and dela models in a 45 nm CMOS process. We show that up to 5.3 reduction in energ is achieved when precisions are assigned based on the lower bounds as compared to a 6 b baseline implementation. The rest of this paper is organized as follows. Section 2 presents necessar background on which we build up our work. Section 3 proposes our analtical bounds on precision for fixed-point implementations. Experimental results are included in Section 4. We conclude our paper in Section BACKGROUND 2.. Online Learning SVM (SVM-SGD) SGD is an efficient training method for machine learning algorithms [9]. At each iteration, the weight vector of the algorithm is updated as follows: w n+ = w n γ w Q(z n, w n ) () where Q( ) is the loss function to be minimized, γ is the stepsize, w n R D is the weight vector to be learned, z n is the streamed sample which usuall consists of an input data vector x n and a corresponding label n, and D is the dimensionalit of the input data. SVM [8] is a simple and popular supervised learning method for classification. SVM operates b determining a maximum margin separating hperplane in the feature space. For regularization, some feature vectors ma be allowed to lie inside the margin making it a soft margin. The SVM predicts the label ŷ n {±} given a feature vector (i.e., input data vector) x n as follows: w T x n + b ŷ n= ŷ n= (2) where w is the weight vector and b is the bias term. The classification error for the SVM is defined as p e = P {Y
2 x B X B F T wx >> D Classifier Weight Update >> D >> >> w B W B F b B W ŷ ( ) Fig. : SVM-SGD algorithm s architecture showing the precision per dimension. Ŷ }. In the rest of this paper, we emplo capital letters to denote random variables. The optimum weight vector w in (2) maximizes the margin or minimizes the average of following loss function: Q(z n, w) = λ( w 2 +b 2 )+max{, n (w T x n +b)} (3) It can be shown that, when applied to the SVM algorithm, specificall the loss function defined in (3), the SGD update equation is [9]: { if n (wn T x n + b) >, w n+ = ( γ λ)w n + γ n x n otherwise. { if n (wn T x n + b) >, b n+ = ( γλ)b n + γ n otherwise. (4) 2.2. Architectural Energ and Dela Models Fig. shows the architecture of the SVM-SGD algorithm and indicates the precision assignments (per dimension) for the input B X, the classifier B F, and the weight update B W. The critical path of the architecture is an path between two latch nodes having the maximum dela, assuming inputs and outputs are also latched. The main building block of such computational sstem is the b full-adder (FA). Hence the critical path is the one having maximum number of FAs. The maximum throughput at which the architecture can operate is the reciprocal of the critical path dela as follows []: f max = I ON βl F A C F A V dd (5) where L F A is the number of FAs in the critical path of the architecture, C F A is the load capacitance of one FA, β is an empirical fitting parameter, I ON is the ON current of a transistor, and V dd is the suppl voltage at which the architecture is being operated at. The energ consumption of the architecture operating at f max is given b []: E = αn F A C F A V 2 dd + βn F A L F A C F A V 2 dd V dd S (6) where N F A is the total number of FAs in the sstem, α is the activit factor, and S is the subthreshold slope. It is found [] that both N F A and L F A are linear in the precisions B X, B F, and B W. Since E is proportional to the product of N F A and L F A, E is a quadratic function of these precisions. This clearl indicates the importance of minimizing B X, B F, and B W. Note, this analsis assumes standard implementation using ripple carr adders and Baugh-Woole multipliers. 3. ANALYTICAL BOUNDS ON PRECISION The impact of precisions of data (B X ), weight vector (B F ), and weight update block (B W ) on the overall SQNR is well established for finite-impulse response (FIR) filters and least mean-squared (LMS) adaptive filters [2]. In this section, we analticall predict the accurac of the SVM-SGD algorithm as a function of B X, B F, and B W. All details concerning the derivation and proofs of the upcoming results can be found in []. 3.. Classification Block We assume that the input data is quantized to B X bits and is hence corrupted b quantization noise that is independentl uniforml distributed on [ x 2, x 2 ] per dimension [3]. Note that x = 2 (B X ) is the input quantization step. Similarl, the weights are quantized to B F bits and are hence corrupted b quantization noise that is independentl uniforml distributed on [ f 2, f 2 ] per dimension where f = 2 (B F ). Our first result is a lower bound derived based on the geometric propert of the SVM. This lower bound on B X ensures that an datapoint ling outside the margin is classified correctl even after being perturbed b quantization noise. Theorem 3. (Geometric Bound). Given D, B F, and w, a feature vector x ling outside the margin will be classified correctl if ( ) D w B X > log 2 ( +. (7) D x )2 B F Proof. The main idea is to make sure that the worst case quantization noise is less than the functional margin of the classifier. For details, please see [].
3 The geometric bound reveals the following: larger SVM margin (i.e., smaller w ) allows greater reduction of B X ; there is a trade-off relation between B X and B F ; higher dimension D and larger x require more input precision B X. Note that Theorem 3. is specific to a single feature vector x. The following is a simple corollar that applies to all datapoints in the dataset ling outside the margin. Corollar 3... Given D, B F, and w, an feature vector in the dataset X ling outside the margin will be classified correctl if D w B X > log 2 ( + D max x X x )2 B F. (8) It is worth noting that max x X x D because of normalization. Let Ŷfx be the output of the fixed-point classifier and Ŷfl be the output of the floating-point classifier. Our next result is an upper bound on the mismatch probabilit p m = Pr{Ŷfx Ŷ fl } as a function of precision. Theorem 3.2 (Probabilistic Bound). Given B X, B F, w, and b, the upper bound on the mismatch probabilit p m is given b: p m 24 ( 2 x w 2 E + 2 f E [ X 2 + w T X + b 2 [ ] w T X + b 2 ]) where X denotes the random variable of the dataset. Note that the statistics (i.e., the expected values in (9)) are calculated empiricall. Proof. The main idea is to consider the mismatch event for one datapoint and upper bound its probabilit using Chebshev s inequalit. The law of total probabilit is then used to obtain the upper bound over the whole dataset. For details, please see []. Theorem 3.2 can be reformulated to obtain the lower bound on precision to achieve a target worst-case mismatch rate. It can also be extended to obtain the upper bound on the actual classification accurac of the fixed-point realization as follows: Theorem 3.3. The fixed-point probabilit of error is upper bounded as follows: (9) p e + min(p fl, p m ) max(p fl, p m ). () where p fl = Pr{Ŷfl = Y } denotes the probabilit of detection in floating-point. Proof. Basic laws of probabilit theor were used. For details, please see [] Weight Update Block Analsis In the preceding discussion, w is the converged weight vector of an infinite-precision algorithm. In practice, it is standard to first implement a floating-point algorithm with desirable convergence properties and then to quantize so as to keep the convergence behavior unaltered. The upcoming setup assumes without loss of generalit that the floating-point convergence of SGD in the context of SVM (4) is obtained for λ = and some small value of γ (tpicall a negative power of 2) Our next result ensures that the non-zero update term in (4) is represented correctl during training in spite of weight update quantization. Theorem 3.4. The lower bound on the weight update precision B W in order to ensure full convergence is given b: B W B X log 2 (γ). () Proof. The idea is to prevent the updates from adding additional quantization. For details, please see []. The setup of Theorem 3.4 is quite conservative. In fact, it is possible to sometimes break the bound in () and still obtain satisfactor convergence behavior []. Two particular cases are to be noted. If B W = log 2 (γ), then we obtain a sign-sgd behavior onl for negative updates. This is because onl the sign bit of the input data is retrieved after truncation at the end of the update. Similarl, if B W = log 2 (γ), then we still obtain a sign-sgd behavior, but the effective step-size doubles due to truncation. 4. EXPERIMENTAL RESULTS 4.. Classification Accurac and Convergence In this section, we validate our analtical results based on the UCI Breast Cancer dataset [4]. For fixed-point classification with B F = 6, Fig. 2 (a) shows that B X 5 is sufficient to achieve a target classification probabilit of p e.6. This is consistent with the analtical values determined b the geometric bound (8) and the probabilistic bound (9): B X = 4 and B X = 5, respectivel. Fig. 2 (b) shows the convergence curve of the average loss function of SVM as described in (3) for various weight update block precisions B W. The minimum precision given b () is B W = when γ = 2 5, and B X = 6 chosen to be consistent with the bounds shown in Fig. 2(a). We note that the fixed-point convergence curve tracks the floating-point curve ver closel. We also show convergence curves for precisions B W =, 7, 6, and 5. The curves are similar to the floatingpoint curve though with an observable loss in the accurac for lower precisions. Further analsis explaining this loss in
4 Loss Function E(J) Loss Function E(J) p e B X Bits V dd Sample (V) (a) Index (V) V dd Fig. 3: Energ savings using minimum precision (B F = 6, B X = 5, B W = ) for p m <., compared to 2 b higher precision (B F = 8, B X = 7, B W = 2) and uniform precision assignment of 6 b (B F = 6, B X = 6, B W = 6). Sample Index (b) Fig. 2: Experimental results for the Breast Cancer dataset: (a) fixed-point classifier simulation (FX sim), probabilistic upper bound (UB), and geometric bound (GB) for B F = 6, and (b) convergence curves for fixed-point (B F = 6 and B X = 6) and floating-point (FL sim) SGD training (γ = 2 5 and λ = ). accurac can be found in []. Note: the initial faster convergence but eventual loss in accurac when B W = 5 is because it corresponds to B W = log 2 (γ) meaning a sign-sgd behavior with double the step size as discussed in Section Energ Consumption We emploed the energ-dela models in Section 2.2 to estimate the energ consumption of the SVM-SGD architecture (Fig. ) when processing the Breast Cancer dataset. Our methodolog is the same as the one in [] but it uses a 45 nm semiconductor process parameters. This methodolog is useful for estimating the benefits of precision reduction without a time-consuming and laborious ASIC design process. The model parameters obtained through circuit simulations of a full adder and their values can be found in []. Based on these parameters, we evaluate the energ consumption in (6). Fig. 3 shows the energ consumption of the architecture as a function of suppl voltage for different precisions in order to achieve p m <.. An architecture that emplos 6 b for all variables is chosen as reference as recent works on fixedpoint learning have emploed commercial 6 b digital signal processors []. As discussed in [], the trade-off between leakage and dnamic energ (the two terms in (6)) leads to the well known minimum energ operating point (MEOP) for each curve. As shown in Fig. 3, the precision assignment using (9)-() results in 5.3 energ savings at the MEOP over a 6 b fixed-point implementation. 5. CONCLUSION In this paper, we presented analtical bounds on the precisions of data, weight vector, and weight update block for the SVM-SGD algorithm. These bounds are closel related to performance metrics such as the classification accurac and mismatch probabilit. Moreover, we quantified the impact of precision on the energ cost of inference in a commercial semiconductor process technolog using circuit analsis and models. We observed significant energ savings when assigning precision at the lower bounds as compared to prior work. Using our results, designers of fixed-point realization can efficientl determine precision requirements analticall. Future work includes developing a similar fixed-point analsis of the popular Deep Neural Networks (DNN). Such networks tend to be highl complex. A sstematic stud of precision requirements will be extremel valuable.
5 References [] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Naraanan, Deep learning with limited numerical precision, in Proceedings of The 32nd International Conference on Machine Learning, 25, pp [3] K. Parhi, VLSI digital signal processing sstems: design and implementation, John Wile & Sons, 27. [4] A. Asuncion and D. Newman, UCI machine learning repositor, edu/ml/, 27, [online]. [2] M. Kim and P. Smaragdis, Bitwise Neural Networks, arxiv preprint arxiv:6.67, 26. [3] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David, Binarconnect: Training deep neural networks with binar weights during propagations, in Advances in Neural Information Processing Sstems, 25, pp [4] Matthieu Courbariaux and Yoshua Bengio, Binarnet: Training deep neural networks with weights and activations constrained to + or -, arxiv preprint arxiv:62.283, 26. [5] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, XNOR-Net: ImageNet classification csing binar convolutional neural networks, arxiv preprint arxiv: , 26. [6] Darrl D Lin, Sachin S Talathi, and V Sreekanth Annapuredd, Fixed point quantization of deep convolutional networks, arxiv preprint arxiv:5.6393, 25. [7] Darrl D Lin and Sachin S Talathi, Overcoming challenges in fixed point training of deep convolutional networks, arxiv preprint arxiv:67.224, 26. [8] C. Cortes and V. Vapnik, Support-vector networks, Machine learning, vol. 2, no. 3, pp , 995. [9] L. Bottou, Large-scale machine learning with stochastic gradient descent, in Proceedings of COMP- STAT 2, pp Springer, 2. [] R. Abdallah and N. Shanbhag, Reducing energ at the minimum energ operating point via statistical error compensation, Ver Large Scale Integration (VLSI) Sstems, IEEE Transactions on, vol. 22, no. 6, pp , 24. [] Charbel Sakr, Amea Patil, Sai Zhang, and Naresh Shanbhag, Understanding the energ and precision requirements for online learning, arxiv preprint arxiv:67.669, 26. [2] M. Goel and N. Shanbhag, Finite-precision analsis of the pipelined strength-reduced adaptive filter, Signal Processing, IEEE Transactions on, vol. 46, no. 6, pp , 998.
arxiv: v3 [stat.ml] 26 Aug 2016
Understanding the Energy and Precision Requirements for Online Learning arxiv:67.669v3 [stat.ml] 26 Aug 26 Charbel Sakr Ameya Patil Sai Zhang Yongjune Kim Naresh Shanbhag Department of Electrical and Computer
More informationLayer 1. Layer L. difference between the two precisions (B A and B W ) is chosen to balance the sum in (1), as follows: B A B W = round log 2 (2)
AN ANALYTICAL METHOD TO DETERMINE MINIMUM PER-LAYER PRECISION OF DEEP NEURAL NETWORKS Charbel Sakr Naresh Shanbhag Dept. of Electrical Computer Engineering, University of Illinois at Urbana Champaign ABSTRACT
More informationAn Analytical Method to Determine Minimum Per-Layer Precision of Deep Neural Networks
An Analytical Method to Determine Minimum Per-Layer Precision of Deep Neural Networks Charbel Sakr, Naresh Shanbhag Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign
More informationAnalytical Guarantees on Numerical Precision of Deep Neural Networks
Charbel Sakr Yongjune Kim Naresh Shanbhag Abstract The acclaimed successes of neural networks often overshadow their tremendous complexity. We focus on numerical precision - a key parameter defining the
More informationAnalytical Guarantees on Numerical Precision of Deep Neural Networks
Charbel Sakr Yongjune Kim Naresh Shanbhag Abstract The acclaimed successes of neural networks often overshadow their tremendous complexity. We focus on numerical precision - a key parameter defining the
More informationσ(x) = clip( x m + α 2, 0, α) = min(max( x m + α 2, 0), α) (1)
TRUE GRADIENT-BASED TRAINING OF DEEP BINARY ACTIVATED NEURAL NETWORKS VIA CONTINUOUS BINARIZATION Charbel Sakr, Jungwook Choi, Zhuo Wang, Kailash Gopalakrishnan, Naresh Shanbhag Dept. of Electrical and
More informationBinary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky
Binary Deep Learning Presented by Roey Nagar and Kostya Berestizshevsky Deep Learning Seminar, School of Electrical Engineering, Tel Aviv University January 22 nd 2017 Lecture Outline Motivation and existing
More informationTYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad
TYPES OF MODEL COMPRESSION Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad 1. Pruning 2. Quantization 3. Architectural Modifications PRUNING WHY PRUNING? Deep Neural Networks have redundant parameters.
More informationOn the Complexity of Neural-Network-Learned Functions
On the Complexity of Neural-Network-Learned Functions Caren Marzban 1, Raju Viswanathan 2 1 Applied Physics Lab, and Dept of Statistics, University of Washington, Seattle, WA 98195 2 Cyberon LLC, 3073
More informationStochastic Computing: A Design Sciences Approach to Moore s Law
Stochastic Computing: A Design Sciences Approach to Moore s Law Naresh Shanbhag Department of Electrical and Computer Engineering Coordinated Science Laboratory University of Illinois at Urbana Champaign
More informationPERFECT ERROR COMPENSATION VIA ALGORITHMIC ERROR CANCELLATION
PERFECT ERROR COMPENSATION VIA ALGORITHMIC ERROR CANCELLATION Sujan K. Gonugondla, Byonghyo Shim and Naresh R. Shanbhag Dept. of Electrical and Computer Engineering, University of Illinois at Urbana Champaign
More informationDifferentiable Fine-grained Quantization for Deep Neural Network Compression
Differentiable Fine-grained Quantization for Deep Neural Network Compression Hsin-Pai Cheng hc218@duke.edu Yuanjun Huang University of Science and Technology of China Anhui, China yjhuang@mail.ustc.edu.cn
More informationTowards Accurate Binary Convolutional Neural Network
Paper: #261 Poster: Pacific Ballroom #101 Towards Accurate Binary Convolutional Neural Network Xiaofan Lin, Cong Zhao and Wei Pan* firstname.lastname@dji.com Photos and videos are either original work
More informationONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM. Abhishek Singh, Narendra Ahuja and Pierre Moulin
22 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 23 26, 22, SANTANDER, SPAIN ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM Abhishek Singh, Narendra Ahuja
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationCOMPARISON STUDY OF SENSITIVITY DEFINITIONS OF NEURAL NETWORKS
COMPARISON SUDY OF SENSIIVIY DEFINIIONS OF NEURAL NEORKS CHUN-GUO LI 1, HAI-FENG LI 1 Machine Learning Center, Facult of Mathematics and Computer Science, Hebei Universit, Baoding 07100, China Department
More informationA Novel Low Power 1-bit Full Adder with CMOS Transmission-gate Architecture for Portable Applications
A Novel Low Power 1-bit Full Adder with CMOS Transmission-gate Architecture for Portable Applications M. C. Parameshwara 1,K.S.Shashidhara 2 and H. C. Srinivasaiah 3 1 Department of Electronics and Communication
More informationarxiv: v1 [cs.lg] 10 Nov 2017
Quantized Memory-Augmented Neural Networks Seongsik Park, Seijoon Kim, Seil Lee, Ho Bae 2, and Sungroh Yoon,2 Department of Electrical and Computer Engineering, Seoul National University, Seoul 8826, Korea
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationCredit Assignment: Beyond Backpropagation
Credit Assignment: Beyond Backpropagation Yoshua Bengio 11 December 2016 AutoDiff NIPS 2016 Workshop oo b s res P IT g, M e n i arn nlin Le ain o p ee em : D will r G PLU ters p cha k t, u o is Deep Learning
More informationSajid Anwar, Kyuyeon Hwang and Wonyong Sung
Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationNeural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017
Neural Network Approximation Low rank, Sparsity, and Quantization zsc@megvii.com Oct. 2017 Motivation Faster Inference Faster Training Latency critical scenarios VR/AR, UGV/UAV Saves time and energy Higher
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationCS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning
CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationLarge Scale Semi-supervised Linear SVM with Stochastic Gradient Descent
Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationOn the design of Incremental ΣΔ Converters
M. Belloni, C. Della Fiore, F. Maloberti, M. Garcia Andrade: "On the design of Incremental ΣΔ Converters"; IEEE Northeast Workshop on Circuits and Sstems, NEWCAS 27, Montreal, 5-8 August 27, pp. 376-379.
More informationA short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie
A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab
More informationSupport Vector Machine via Nonlinear Rescaling Method
Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationOther examples of neurons can be found in:
4 Concept neurons Introduction to artificial neural networks 41 Tpical neuron/nerve cell: from Kandel, Schwartz and Jessel, Principles of Neural Science The cell od or soma contains the nucleus, the storehouse
More informationarxiv: v3 [cs.lg] 2 Jun 2016
Darryl D. Lin Qualcomm Research, San Diego, CA 922, USA Sachin S. Talathi Qualcomm Research, San Diego, CA 922, USA DARRYL.DLIN@GMAIL.COM TALATHI@GMAIL.COM arxiv:5.06393v3 [cs.lg] 2 Jun 206 V. Sreekanth
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationDesign for Manufacturability and Power Estimation. Physical issues verification (DSM)
Design for Manufacturability and Power Estimation Lecture 25 Alessandra Nardi Thanks to Prof. Jan Rabaey and Prof. K. Keutzer Physical issues verification (DSM) Interconnects Signal Integrity P/G integrity
More informationCSC242: Intro to AI. Lecture 21
CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages
More informationCSE 546 Midterm Exam, Fall 2014
CSE 546 Midterm Eam, Fall 2014 1. Personal info: Name: UW NetID: Student ID: 2. There should be 14 numbered pages in this eam (including this cover sheet). 3. You can use an material ou brought: an book,
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationQUADRATIC AND CONVEX MINIMAX CLASSIFICATION PROBLEMS
Journal of the Operations Research Societ of Japan 008, Vol. 51, No., 191-01 QUADRATIC AND CONVEX MINIMAX CLASSIFICATION PROBLEMS Tomonari Kitahara Shinji Mizuno Kazuhide Nakata Toko Institute of Technolog
More informationSupport Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI
Support Vector Machines CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Linear Classifier Naive Bayes Assume each attribute is drawn from Gaussian distribution with the same variance Generative model:
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationSerial Parallel Multiplier Design in Quantum-dot Cellular Automata
Serial Parallel Multiplier Design in Quantum-dot Cellular Automata Heumpil Cho and Earl E. Swartzlander, Jr. Application Specific Processor Group Department of Electrical and Computer Engineering The University
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationClassification of Hand-Written Digits Using Scattering Convolutional Network
Mid-year Progress Report Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan Co-Advisor: Dr. Maneesh Singh (SRI) Background Overview
More informationSupervised Learning Part I
Supervised Learning Part I http://www.lps.ens.fr/~nadal/cours/mva Jean-Pierre Nadal CNRS & EHESS Laboratoire de Physique Statistique (LPS, UMR 8550 CNRS - ENS UPMC Univ. Paris Diderot) Ecole Normale Supérieure
More informationUVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont.
UVA CS 4501-001 / 6501 007 Introduc8on to Machine Learning and Data Mining Lecture 9: Classifica8on with Support Vector Machine (cont.) Yanjun Qi / Jane University of Virginia Department of Computer Science
More informationLeast Mean Squares Regression
Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method
More informationLeast-Squares Independence Test
IEICE Transactions on Information and Sstems, vol.e94-d, no.6, pp.333-336,. Least-Squares Independence Test Masashi Sugiama Toko Institute of Technolog, Japan PRESTO, Japan Science and Technolog Agenc,
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationMultiscale methods for neural image processing. Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet
Multiscale methods for neural image processing Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet A TALK IN TWO ACTS Part I: Stacked U-nets The globalization
More informationRepresentational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber
Representational Power of Restricted Boltzmann Machines and Deep Belief Networks Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Introduction Representational abilities of functions with some
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationRegular Physics - Notes Ch. 1
Regular Phsics - Notes Ch. 1 What is Phsics? the stud of matter and energ and their relationships; the stud of the basic phsical laws of nature which are often stated in simple mathematical equations.
More informationStrain Transformation and Rosette Gage Theory
Strain Transformation and Rosette Gage Theor It is often desired to measure the full state of strain on the surface of a part, that is to measure not onl the two etensional strains, and, but also the shear
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationClassification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks
Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks Mohit Shridhar Stanford University mohits@stanford.edu, mohit@u.nus.edu Abstract In particle physics, Higgs Boson to tau-tau
More informationConvolutional Neural Networks II. Slides from Dr. Vlad Morariu
Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate
More informationReducing Delay Uncertainty in Deeply Scaled Integrated Circuits Using Interdependent Timing Constraints
Reducing Delay Uncertainty in Deeply Scaled Integrated Circuits Using Interdependent Timing Constraints Emre Salman and Eby G. Friedman Department of Electrical and Computer Engineering University of Rochester
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationComputation of Csiszár s Mutual Information of Order α
Computation of Csiszár s Mutual Information of Order Damianos Karakos, Sanjeev Khudanpur and Care E. Priebe Department of Electrical and Computer Engineering and Center for Language and Speech Processing
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationConvolutional Neural Networks
Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»
More informationFast Fir Algorithm Based Area- Efficient Parallel Fir Digital Filter Structures
Fast Fir Algorithm Based Area- Efficient Parallel Fir Digital Filter Structures Ms. P.THENMOZHI 1, Ms. C.THAMILARASI 2 and Mr. V.VENGATESHWARAN 3 Assistant Professor, Dept. of ECE, J.K.K.College of Technology,
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationRegML 2018 Class 8 Deep learning
RegML 2018 Class 8 Deep learning Lorenzo Rosasco UNIGE-MIT-IIT June 18, 2018 Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x
More informationWHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,
WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationTHE UNIVERSITY OF MICHIGAN. Timing Analysis of Domino Logic
Timing Analsis of Domino Logic b David Van Campenhout, Trevor Mudge and Karem Sakallah CSE-T-296-96 THE UNIVESITY O MICHIGAN Computer Science and Engineering Division Department of Electrical Engineering
More informationNeural networks and optimization
Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationAn Information Theory For Preferences
An Information Theor For Preferences Ali E. Abbas Department of Management Science and Engineering, Stanford Universit, Stanford, Ca, 94305 Abstract. Recent literature in the last Maimum Entrop workshop
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationSparse Regularized Deep Neural Networks For Efficient Embedded Learning
Sparse Regularized Deep Neural Networks For Efficient Embedded Learning Anonymous authors Paper under double-blind review Abstract Deep learning is becoming more widespread in its application due to its
More informationStability of Limit Cycles in Hybrid Systems
Proceedings of the 34th Hawaii International Conference on Sstem Sciences - 21 Stabilit of Limit Ccles in Hbrid Sstems Ian A. Hiskens Department of Electrical and Computer Engineering Universit of Illinois
More informationChange point method: an exact line search method for SVMs
Erasmus University Rotterdam Bachelor Thesis Econometrics & Operations Research Change point method: an exact line search method for SVMs Author: Yegor Troyan Student number: 386332 Supervisor: Dr. P.J.F.
More informationSimulation of Acoustic and Vibro-Acoustic Problems in LS-DYNA using Boundary Element Method
10 th International LS-DYNA Users Conference Simulation Technolog (2) Simulation of Acoustic and Vibro-Acoustic Problems in LS-DYNA using Boundar Element Method Yun Huang Livermore Software Technolog Corporation
More informationABC-LogitBoost for Multi-Class Classification
Ping Li, Cornell University ABC-Boost BTRY 6520 Fall 2012 1 ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University 2 4 6 8 10 12 14 16 2 4 6 8 10 12
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationFAST FIR ALGORITHM BASED AREA-EFFICIENT PARALLEL FIR DIGITAL FILTER STRUCTURES
FAST FIR ALGORITHM BASED AREA-EFFICIENT PARALLEL FIR DIGITAL FILTER STRUCTURES R.P.MEENAAKSHI SUNDHARI 1, Dr.R.ANITA 2 1 Department of ECE, Sasurie College of Engineering, Vijayamangalam, Tamilnadu, India.
More informationESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.
On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained
More informationIdentification of Nonlinear Dynamic Systems with Multiple Inputs and Single Output using discrete-time Volterra Type Equations
Identification of Nonlinear Dnamic Sstems with Multiple Inputs and Single Output using discrete-time Volterra Tpe Equations Thomas Treichl, Stefan Hofmann, Dierk Schröder Institute for Electrical Drive
More informationAn overview of deep learning methods for genomics
An overview of deep learning methods for genomics Matthew Ploenzke STAT115/215/BIO/BIST282 Harvard University April 19, 218 1 Snapshot 1. Brief introduction to convolutional neural networks What is deep
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationVLSI Signal Processing
VLSI Signal Processing Lecture 1 Pipelining & Retiming ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-1 Introduction DSP System Real time requirement Data driven synchronized by data
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationSupport Vector Machine (continued)
Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need
More informationOnline Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016
Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationTunable Floating-Point for Energy Efficient Accelerators
Tunable Floating-Point for Energy Efficient Accelerators Alberto Nannarelli DTU Compute, Technical University of Denmark 25 th IEEE Symposium on Computer Arithmetic A. Nannarelli (DTU Compute) Tunable
More informationSupervised Learning. George Konidaris
Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell,
More information