TIME:A Training-in-memory Architecture for RRAM-based Deep Neural Networks

Size: px

Start display at page:

Download "TIME:A Training-in-memory Architecture for RRAM-based Deep Neural Networks"

Madeleine Ross
5 years ago
Views:

1 1 TIME:A Training-in-memory Architecture for -based Deep Neural Networks Ming Cheng, Lixue Xia, Student Member, IEEE, Zhenhua Zhu, Yi Cai, Yuan Xie, Fellow, IEEE, Yu Wang, Senior Member, IEEE, Huazhong Yang, Senior Member, IEEE Abstract The training of neural network (NN) is usually timeconsuming and resource intensive. The emerging metal-oxide resistive random-access memory () device has shown the potential in the computation of NN. crossbar structure and multi-bit characteristic can perform the matrix-vector product in high energy efficiency, which is the most common operation of NN. However, there exist two challenges on realizing training NN based on. Firstly, the current architectures based on only support the inference in training NN, and cannot perform the backpropagation (BP) and the weight update of training NN. Secondly, training NN requires enormous iterations to constantly update the weights for reaching the convergence. However, the weight update leads to large energy consumption because of the non-ideal factors of. In this work, we propose an architecture, TIME, which means Training-In-MEmory based on, and the peripheral circuit design to enable training NN on. TIME supports the BP and the weight update while maximizing the reuse of peripheral circuits of the inference operation on. Meanwhile, a set of optimization strategies focusing on the non-ideal factors are designed to reduce the cost of tuning. We explore the performance of both supervised learning (SL) and deep reinforcement learning (DRL) on TIME. A specific mapping method of DRL is also introduced to further improve the energy efficiency. Simulation results show that, in SL, TIME can achieve 5.3x higher energy efficiency on average compared with DaDianNao, an application-specific integrated circuits (ASIC) in CMOS technology [1]. In DRL, TIME can perform 126x on average higher than GPU in energy efficiency. If the cost of tuning can be further reduced, TIME has the potential to boost the energy efficiency by two orders of magnitudes compared with ASIC. Index Terms, neural network, training neural network, deep reinforcement learning This work was supported by National Natural Science Foundation of China (No , ), National Key R&D Program of China 2017YFA , Joint fund of Equipment pre-research and Ministry of Education (No. 6141A ), and Beijing National Research Center for Information Science and Technology (BNRist). M. Cheng, L. Xia, Z. Zhu, Y. Cai, Y. Wang, H. Yang are with the Department of Electronic Engineering, Tsinghua University, Beijing, China, and Beijing National Research Center for Information Science and Technology (BNRist) ( yu-wang@tsinghua.edu.cn). Y. Xie is with the Department of Electrical and Computer Engineering, University of California at Santa Barbara, Santa Barbara, CA 93106, U ( yuanxie@gmail.com). Manuscript received by November 2nd, 2017; revised by February 4th, 2018; accepted by March 6th, 2018; data of current version is March 23rd, Copyright (c) 2015 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an to pubs-permissions@ieee.org. TABLE I: Neural network parameters and training iterations Network Name Parameter number Training Iterations VGG16 [7] 138M [7] ResNet-50 [6] 25.6M 6 [6] ResNet-152 [6] 60.2M [6] I. INTRODUCTION Recently, neural network (NN) and deep learning constantly make breakthroughs in many fields [2]. Supervised learning (SL) and deep reinforcement learning (DRL) based on NN have been shown to be the powerful tools for classification [3] and learning policy [4]. They even outperform human experts in some fields, such as the classification of objects in ImageNet [5] [6] and playing Atari game [2]. Data movements between the processing units (PUs) and the memory have been the most critical performance bottlenecks in various computing systems [8]. For example, the data transfer between CPUs and off-chip memory consumes two orders of magnitudes more energy than a floating-point operation [9]. State-of-the-art NN requires a large memory capacity to store the weights and intermediate data between layers. The number of parameters in VGG16 is more than 138M [10]. To address this challenge, some application-specific integrated circuits (ASIC) have adopted large on-chip memory to store the synaptic weights. For example, DaDianNao [1] places large on-chip edrams, which is 36MB in single node and it consists of sixteen 2MB edrams and one 4MB central edram, for both high bandwidth and data locality. However, the size of a single node is not enough for the size of models in TABLE I. The emerging metal-oxide resistive random-access memory () device provides promising alternative solutions to boost the performance [11] [12]. devices can achieve a large number of signal connections within a small footprint by its high integration density [13]. More importantly, devices can be used to build a crossbar structure (shown in Fig. 1 (c)) to perform the matrix-vector product. Recent works have demonstrated the potential of to realize the NN operations with 100x to 1000x energy efficiency improvement comparing with CPU or GPU solutions [8], [11], [14], [15]. However, the current -based architectures, such as PRIME [8], IAC [16], cannot support training NN efficiently. As Fig. 1 shows, training NN contains three phases: inference, backpropagation (BP) and weight update. These three phases are operated many iterations until NN become

2 2 converged [17]. In addition, the basic operations of inference, which is supported on PRIME and IAC, are different from those of BP, weight update for both SL and DRL (shown in Fig. 1 (c)). Thus, the architecture that supports training NN is also different from the architecture that only supports inference. Another challenge lies in the huge energy required for tuning accurately during training NN. There are two reasons of high tuning energy. The first reason is that enormous training iterations are required and the weights are updated frequently. Thus, there exists a large number of tuning operations, which changes the resistance of cell into the target resistance. According to TABLE I, all the models have more than 1M parameters and training the models needs at least iterations to reach convergence in GPU. Thus, all of these parameters need to be updated for times at least. Besides, tuning is costly in weight update phase of training NN because of the non-ideal factors. The non-ideal factors in Fig. 1 (d) include: (1) complex transformation between tuning voltage and the change of conductance of, (2) asymmetry between SET operation and RESET operation, (3) resistance saturation, (4) variability when tuning, (5) endurance of, and (6) Footprint mismatch between and CMOS-based peripheral circuits. These factors increase the number of tuning in the weight update phase so that enlarge the energy consumption of training NN. In this paper, we propose an architecture, TIME, to implement training NN on. Besides, For all kinds of the nonideal factors of, we propose corresponding methods and hardware implementations to reduce the cost caused by them. 1) We propose TIME (Training-In-MEmory), with modifications of peripheral circuits in PRIME, to further support the basic operations in backpropagation and the weight update. Simulation results show that, in supervised learning, TIME can improve energy efficiency 19.6x on average and speed 1.3x on average in CNN in supervised learning compared with DaDianNao, an ASIC solution based on CMOS [1]. For the Multi- Layer Perceptron (MLP) in supervised learning, TIME can improve energy efficiency by 6.74x on average and speed 1.5x on average compared with DaDianNao. 2) A lookup table policy is proposed to reduce the cost caused by the complex transformation between tuning voltage pulse and the change of conductance of. The results show that the policy can improve energy efficiency 3.79x at most. 3) We propose a one-direction writing scheme and corresponding circuit to overcome the asymmetry of writing. Besides, a smart refresh strategy is proposed to prevent resistance saturation when tuning. Energy efficiency of TIME is enhanced 2x at most and the speed of TIME is boosted 1.3x at most compared with the -based method in [18]. 4) We propose a variability-free tuning scheme to reduce the frequency of writing and reading. The results show the variability-free tuning scheme can improve the energy efficiency 2.29x at most and boost speed 1.26x at most compared with [19]. The energy efficiency can be improved by two orders of magnitude compared with the ASIC solution [1] if the energy of tuning can be reduced. 5) A specific mapping method is proposed to boost energy efficiency in the training of DRL by avoiding the copy operation. TIME improves the energy efficiency by 126x compared with the GPU and the specific mapping method promotes 1.2x energy efficiency compared with direct mapping method. A. Device Basics II. PRELIMINARY An device is a passive two-port element with variable resistance states [22]. The crossbar consists of many devices [22]. The relationship between the input voltage vector V and the output current I can be represented as the Eq. (1) [22] and it is shown in Fig. 1 (c). The crossbar can achieve the matrix-vector products, which occupy a major portion in the computation of NN. i out,k = N g k,j v in,j (1) j=1 where v in,j is the j-th element of input voltage V, g k,j is the conductance of device of the crossbar in the k- th column and the j-th row and i out,k is the k-th element of output current I. Two crossbars are required to represent a matrix with both positive and negative weights [23] because g k,j can only be positive. Theoretically, the resistance of an device can be tuned into an arbitrary state by changing the tunneling gap to a specific length [24]. Recent work has proved device in crossbar can represent 5-bit data [25]. Depending on the high density of crossbar, has been considered as the next generation of main memory and replaces the DRAM [26]. The read latency of is close to DRAM, and the write latency is less than 10% by optimization compared with DRAM [26]. B. Non-ideal Factors In Tuning Tuning is costly because of the non-ideal factors of. Complex transformation between tuning voltage and the change of conductance is shown in Fig. 1 (d) and Eq. (2)(3)(4). V w = sinh 1 ( g/v) V 0 (2) v = v 0 (exp( E a /kt ) (3) V 0 = kt/(γ a 0 /L q) (4) where g is the change of the resistance of and V w is the voltage pulse value. E a is activation energy (0.6 ev), a 0 is atom spacing (0.25 nm), L is oxide thickness (12 nm), q is electron charge, k is Boltzmann constant, and v 0 is fitting parameter (10 nm/ns), γ is the local field acceleration parameter [27]. For obtaining tuning voltage, Complex functions,

3 inference SL M-V products Nonlinear function Max pooling DRL M-V products Nonlinear function Max function Write Read Vi Basic operation on 1 M-V products on 1 2 g k,j M 1 2 Challenges of writing 1

function 2 Write N Io 3 Resistance saturation update 1.5 2.

3 3 inference SL M-V products Nonlinear function Max pooling DRL M-V products Nonlinear function Max function Write Read Vi Basic operation on 1 M-V products on 1 2 g k,j M 1 2 Challenges of writing 1 The function between voltage and change of conductance is complex Δg Vw 2 Asymmetry of SET and RESET backpropagation M-V products Multiplication Subtration M-V products Multiplication Subtration Max function 2 Write N Io 3 Resistance saturation update M-V products Write Read M-V products Subtration Write Read Vw Vw /2 Vw /2 Half-selected selected Vw/2 Vw/2 0 (c) Half-selected 4 Variability when writing Ideal reset Fig. 1: The training flow of neural network; The basic operation of supervised learning; The basic operation of deep reinforcement learning; M-V product is the matrix-vector product; (c) The based implementation of M-V product and weight update; (d) Challenges of tuning : (1) Complex transformation function [20] (2) Asymmetry of SET and RESET [21] (3) Resistance saturation [18] (4) Variability of tuning [21] Start Start Current boundary Real reset (d) Target End Variance Eq. (2)(3)(4), need to be computed according to [27]. The process unit can be used to obtain the result but there exists a huge data transfer because of huge number of parameters. Otherwise, specific circuit can also be adopted to compute the function, but it leads to huge area overhead because of the complexity of the functions. Asymmetry of SET and RESET is shown in Fig. 1 (d). Tuning includes two ways: SET and RESET. SET operation tunes the resistance of from high resistance state (HRS) to low resistance state (LRS). RESET operation is the reverse operation. According to [28], SET operation leads the resistance of to change abruptly while the RESET operation can gradually tune it. Resistance saturation is that the resistance of might arrive at the boundary so that it can not be changed anymore [18], as shown in Fig. 1 (d). It causes that the parameter value of NN is limited into a range so that damage the performance of NN. Variability of tuning is shown in Fig. 1 (d). According to Fig. 1 (d), both SET operation and RESET operation exist variability [20]. Thus, tuning into a target resistance value depends on iterative RESET operations and SET operations [19] but this method obviously increases the cost of tuning. Endurance also can be a challenge in training NN because it needs a large number of operations of updating weights. Thus, needs to be tuned many times. However, the endurance of, 2 cycles according to [29] [30], is less than DRAM and SRAM, 6 cycles according to [31], and it might become a challenge of training NN on. Mismatch footprint between array and its CMOS periphery circuit might lead to inaccurate input voltage because the area of device is too small while CMOS periphery circuits are too large. Mismatch footprint might decreases the performance of NN. C. Supervised Learning The training of SL is shown in Fig. 1 and it contains 3 phases: inference, BP, and update [32]. Inference of SL Fig. 1 shows that the inference [33] includes three operations: matrix-matrix multiplication, which can be composed of many matrix vector products, nonlinear function, which are represented as Eq. (5), and max pooling, which is represented in Eq. (6), y = f(w x + b) (5) where x and y are the input and the output of the NN, respectively. W is the weight matrix of NN and b is the bias vector [19]. x = max(x 1, x 2,, x n ) (6) where x 1, x 2,, x n are the inputs of max pooling function and x is the output of max pooling function. the number of input n is usually four. Backpropagation of SL Fig. 1 shows the basic operation of BP and it includes three basic operations: matrixmatrix multiplication, which can be composed of many matrix

4 4 2 a t agent environment S t S t+1 r t 3 1 S t a t St+1 r t At each step t, Agent receives state st of environment Agent chooses an action at for maximum summation of reward. R t = At next step t+1 Environment receives action at Environment emits a changed state st+1 Environment emits scalar reward rt a t+1 Fig. 2: The process of reinforcement learning. It can be represented as many Markov Decision Process cell s t, a t, s t+1, r t. k =1 r k + t A inference Q(s,a:w A) B inferenceq(s`,a:w B) Count =+ 1 A backpropagation and update Count > 64? Copy Count= 1 Network A Network B Copy weight of A into B Write resistance of one crossbar into another crossbar Fig. 3: The training process of deep reinforcement learning. The copy operation is implemented as writing resistance of the crossbar of A network into that of the crossbar of B network. vector product, multiplication, and subtraction [33]. Thus, the BP process is expressed in sequence as Eq. (7), (8), (9), y (l) = y (l) y (l) (7) δ (l) = y (l). y (l) (8) y (l 1) = W T (l) δ (l) (9) where y (l), y (l), W (l) represent the output of NN, the target output, the weight matrix of the lth layer, respectively. y (l) is the diff between the target output y (l) and the output y (l) in the lth layer. y (l) is the derivative of y (l). δ (l) is the error of BP. Eq. (7) is performed in the final layer. Then, Eq. (8) and Eq. (9) are implemented in sequence iteratively. Update of SL The update includes three operations [33]: read weight, change weight and matrix matrix multiplication, which can be composed of many matrix-vector products. The matrix-matrix multiplication is expressed as Eq. (10), W T (l) = (x (l) δ T (l) ) α (10) where W T (l) is the gradient of weight matrix in the lth layer and x (l) is the input of the lth layer. α is the learning rate and it is a hyper-parameter we determine before training NN. After Eq. (8) is computed, we can compute Eq. (10) and use W T (l) to change the weight of NN. D. Deep Reinforcement Learning The basic process of reinforcement learning is shown in Fig. 2 and it can be regarded as a Markov Decision Process [2]. At the timestep t, the environment sends its state s t to the agent. The agent adopts an action a t and it changes the state of the environment from s t to s t+1. Besides, this change produces a reward signal r t. The agent goal is to maximize the sum of reward, and this goal directs the agent to choose an action. Thus, the process of reinforcement learning is a Markov Decision Process as shown in Fig. 2 [34]. When we adopt CNN as the agent, the input of CNN is the state of the environment, such as an image of the environment. The jth output of CNN Q(s, a j ) is the predicted value of the sum of reward when we adopt the action a j in the current state s from now to the future. The action with maximum output is always adopted to change environment [34]. Fig. 1 shows that DRL also consists of three phases: inference, BP and update. The basic operations of these phases are the same as those of SL except the copy operation in update phase [2]. Inference of DRL Fig. 1 shows five basic operations in the inference:matrix matrix multiplication, which can be composed of many matrixvector products, nonlinear function, max function, write, and read. According to [34], there exist two networks in DRL and we can call them A network and B network as shown in Fig. 3. In the inference of DRL, both A network and B network need to operate inference which is the same as the inference in SL. s and s are the input of A and B, respectively. And a is the action we decide to make. Q and Q are the output of A and B, respectively. Backpropagation of DRL Depending on the output of A network and the output of B network, we can operate backpropagation according to the cost function in Eq. (11) y = r + γmax a Q (s, a ; W B ) Q(s, a; W A ) (11) where W A and W B are the weights of A network and B network, with y given by Eq. (7), and γ is a fixed parameter we set. Update of DRL The weight update phase is the same as that in SL except for the copy operation. Fig. 3 shows the copy operation in the update of DRL after operating BP in A network 64 times, which is an experienced value to obtain the best performance according to [2]. The copy operation is to copy the weights of A network into those of B network, which is expressed as W B := W A. III. TIME ARCHITECTURE TIME, proposed for Training In MEmory, can support not only the inference but also the BP and the weight update in the training process. Besides, TIME can support the training of SL and DRL. TIME consists of three subarrays: full function (FF) subarray, buffer subarray, and memory (Mem) subarray. FF subarrays have both data storage and computation capabilities, which including the inference, BP, and weight update. The Mem subarrays only store data. The buffer subarrays are the Mems that have the minimum distance to the FF subarray,

5 5 Global Row Decorder Bank Mat Controller ReRAM ReRAM Connection Final output diff ReRAM ReRAM Buffer array Global IO Row Buffer Mem Subarray FF Subarray Column Row control TG TG control add-on hardware : compared with PRIME : Wordline Driver and Decoder : Sense Amplifer SW: Switches AMP: Amplifer Vol: Voltage Sources control Data from Buffer en AMP WD Vol (g) Counter SW Output Reg Latch SW Vol Vread Vwrite IV conv Comp Column TG Reg + S A Row SUB TG sigmoid Control Neg array Control buffer Comp r MUL Control ADD 0 y or Q` f(x) or Q` 1 SUB buffer buffer Precision Ctrl ReLU MAX Control en 0 msb (c) (d) win-code Control 0 0 buffer Control Control SUB Reg 1 Reg 2 Reg 3 Reg 4 Reg 5 Reg m Control Global Dec. (f) Control data flow contr. Control Cmd Mat DL Data to/from FF CMD Dec. TiMing Ctrl 0 msb Control XOR Reg Buffer Subarray Adr buffer SW SW data path config. Update weight (e) Pos Xbar Neg Xbar Fig. 4: The TIME architecture. Left: bank structure. Right: functional blocks modified/added in TIME. wordline driver with multi-level voltage sources; column multiplexer with analog subtraction and sigmoid circuit; (c) preprocess for implementing the derivative of activation; (d) the final output diff for calculating the diff in the final layer; (e) connection between the FF and the Buffer subarray; (f) TIME controller; (g) reconfigurable with counters for multi-level outputs, and added ReLU, max function and update weight. which can make use of the high bandwidth to store the intermediate data. A. Design of Peripheral Circuit To enable training NN in the FF subarray, as Fig. 4 shows, we add the preprocess and the final output diff into FF subarray and we modify the wordline decoder and driver (), column multiplexers (), sense amplifiers (), and controllers. Fig. 4 shows that is for preparing the input data of NN and providing the voltage of writing and reading. needs to support data from the wordline (WL) of crossbar or the bitline (BL) of crossbar in order to reuse it in inference and BP. Thus, for isolating the BP and the inference, two transmission gates (TG) are connected with the wordline and the bitline, respectively. By this lowcost design, the inference and the BP reuse the same. Other parts of are the same as PRIME. The multi-level voltage sources are used to produce the multi-level voltages. The switching circuit (SW) is to make sure all the input data can be fed into the WL of crossbar simultaneously. Besides, the amplifier is for driving the WL and the write driver is for modifying the resistance value of device. is shown in Fig. 4. Similar to the, for reusing, two transmission gates are added to the input of to isolate the inference and BP of NN. Other parts are the same as PRIME. Sigmoid is used to realize the sigmoid function of NN and subtraction between two crossbars is to calculate the matrix-vector product because we use a pair of crossbar to represent positive weights and negative weights. With the MUX, we can choose the output among sigmoid, subtraction, and the crossbar, which outputs the resistance value of. is short for the sense amplifier. implements the analog-digital conversion, precision control, ReLU function, max function and the selection of crossbar in weight update as shown in Fig. 4 (g). Both Eq. (11) of DRL and Eq. (6) of SL implement the max function. The pooling in SL, which is calculated in Eq. (6), usually finds the maximum value of four values but the DRL usually finds it in more than four values, such as 18 values in [2]. Therefore, we add more registers and a corresponding MUX to switch the max functions in the pooling and max function in DRL. This change only appears in one because the max function of DRL is only computed in the final layer of NN. The updated weight of is designed for choosing which crossbar needs to be updated according to the sign of the gradient of weight W. The details of this circuit are discussed in Section 4. The sensing circuit is for realizing the analog-digital conversion and the precision control is for overcoming the difficulty of precision. These two parts are the same as PRIME [8]. Final Output Diff Fig. 4 (d) shows the design of the final output diff and it is used for computing subtractions of Eq. (7) and Eq. (11). Therefore, a subtractor is used for the computing the final diff between the target output and the output of NN. Besides, there exists an adder to calculate the addition between Q and r in Eq. (11) of DRL, which is for producing the target output in DRL. What we should note is that the final output diff only exists once so it causes very low cost of energy and area. Compared with PRIME, the final output diff is a new circuit. As Fig. 4 (c) shows, the preprocess is used to compute the derivative in Eq. (8). According to Eq. (8) and Eq. (10), the W depends on δ, which is the result of the multiplication between y and y. For generating y of sigmoid and ReLU, the preprocess is designed. According to [33], the comparator (Comp) can calculate the derivative of the ReLU function. As for sigmoid function [33], the subtractor and multiplier can calculate the derivative of the sigmoid function. Besides, we reuse the multiplier in Fig. 4 (c) to implement the multiplication of y and y. is a

6 6 new circuit compared with PRIME. Controller can be seen in Fig. 4 (f). It is mainly used to provide the control signals of FF subarray. The key function of the controller is to configure the FF subarrays among different modes. Compared with PRIME of two modes, which are memory and inference, the controller in TIME adds three modes: BP, update, DRL. Connection As Fig. 4 (e) shows, private connection between FF subarrays and the buffer subarrays, the intermediate data produced by FF subarrays can be stored in buffer subarrays with a high bandwidth and low cost of data movements. B. Training Implementation of SL 1) Inference: Fig. 5 shows the first layer of NN. At first, the blue line and arrow represent that training data (x, y ) is fed into the Mem subarrays of TIME. Then, (x, y ) is sent from Mem subarrays into the buffer subarrays. Next, the input x of the first layer is transferred into the FF subarrays through buffer subarrays. Then, x is sent into crossbar to implement the matrix-vector product in Eq. (5). At the same time, as the pink line shows, x is written into another crossbar ahead of implementing the operations in Eq. (10). Other layers are in similar condition. The input is sent into two crossbars for implementing inference of NN and it is stored into another FF subarrays at the same time. These two parallel operations can accelerate the training. Then, the results of two crossbars implement subtraction and nonlinear function in. As the brown line in Fig. 5 shows, the data are divided into two parts after Eq. (5). One part is sent into crossbars of the next layer of NN to continue the inference of NN, while the other part is fed into preprocess to compute y in Eq. (8) before the results come out. These data flows also increase the speed of training. 2) Backpropagation: Fig. 5 shows the final layer of NN. At first, the final output diff calculates the subtraction in Eq. (7). Then, the result y is sent into the preprocess part to perform the multiplication in Eq. (8) as the yellow line shows. After that, the results are sent to two places at the same time. One is sent to the crossbars that store weights of this layer to calculate Eq. (9) as the green line shows, and the other is to the crossbars that stores the input of layer for calculating Eq. (10) of the weight update as the emerald green line shows. Thus, we parallel the BP and the weight update of NN to boost the speed. 3) Update: Fig. 5 (c) shows the weight update of training. After realizing the operation of pink line in Fig. 5 and the operation of emerald green line in Fig. 5, the Eq. (10) can be implemented as shown in the emerald green line in Fig. 5 (c). Then, the result is transferred into the corresponding crossbar to tune the. 4) Parallel in Training Process: There exist two parallel operations in training process. Parallel In Inference : The time of inference can be composed of matrix-vector multiplication on array, called as Tinf mv, and storing the results of matrix-vector multiplication on for computing Eq. (10), called as Tinf s. T inf mv is proportional to the batch size of the network, which is smaller than 256. The batch size is 256 at most and TIME can operate one batch in each cycle. Thus, 256 cycles are always enough to operate matrix-vector multiplication of a single layer in inference. Tinf s is proportional to the size of x in Eq. (10). We store data of different batches in different rows. Besides, the length of the data is always larger than. Thus, Storing data on always consumes more cycles compared with operate matrix-vector product. These two operations can be paralleled and Tinf s is always larger than T mv. Thus, the time of the inference can be represented inf as max(tinf mv, T inf s ) = T inf s. Parallel Between BP and Weight Update : The time of backpropagation mainly comes from the matrix-vector multiplication on array in each layer, called as TBP mv. The time of weight update includes matrix-vector multiplication on array in each layer, called as Tup mv and the time of changing weights on, called as Tup. c Thus, the time of weight update is Tup mv + Tup. c The backpropagation of the jth layer of the neural network can be paralleled with the weight update of the j +1th and the subsequent layers. Same as Tinf mv, TBP mv is proportional to the batch size of the network, which is 256 at most. Tup c is proportional to the length of δ in Eq. (10) of every batch and is always larger than. Thus, Tup c is always larger than TBP mv. Therefore, the time of backpropagation and weight update is max(tup c + Tup mv, TBP mv) = T up c + Tup mv. IV. NON-IDEAL FACTORS OPTIMIZATION In this section, we focus on the non-ideal factors of when training NN as shown in Fig. 1 (d). To tune the cells in training phase, we need to obtain the tuning voltage pulse according to the change of conductance of. There exists the first challenge caused by the complex function between the tuning voltage pulse value and the change of resistance [20]. Then, we use this tuning voltage pulse to change the resistance of and we face the challenge of asymmetry between SET operation and RESET operation [23]. It causes the SET operation to be risky in tuning. Besides, we face the resistance saturation [18] when training NN on, which means the resistance of has been tuned to the boundary and cannot be further increased or decreased. Next, there exists a variability [20] of resistance after launching voltage pulse into the and it changes the resistance to be a wrong value. Finally, we have a discussion about the endurance of for NN and footprint mismatch between and CMOS peripheral circuit. A. Lookup Table Policy According to [20] and Eq. (2)(3)(4), the transformation functions between voltage pulse value and the change of conductance are very complex. Thus, we obtain the tuning voltage pulse depending on processing unit or specific circuit as shown in Fig. 1. However, these two methods have their disadvantages. When a processing unit is adopted for the result of tuning voltage, a large number of parameters in TABLE I are transferred between the memory and the processing unit, and the energy consumption caused by data transfer is huge [9]. Besides, using a processing unit in training NN means

7 7 Global Row Decorder Bank Mat Connection Controller Final output diff Buffer array Global IO Row Buffer (x*, y*) Mem Subarray FF Subarray Global Row Decorder Bank Mat Connection Controller Final output diff Buffer array Global IO Row Buffer Mem Subarray FF Subarray Global Row Decorder Bank Mat Connection Controller Final output diff Buffer array Global IO Row Buffer Fig. 5: Inference of NN. The blue line represents the flow of training data, the brown line represents the matrix-vector multiplication on and the pink line represents storing the input of every layer of NN; Backpropagation of NN. The yellow line represents the data flow from the final output diff and buffer subarray to preprocess, the green line represents the backward matrix-vector multiplication and the emerald green line represents preparing for the update; (c) Weight update of NN. The deep green line represents the update calculation and tuning the resistance value of. ( ) 64 so the number of voltage pulse level is = When Δg i Δg i Complex circuit or Processing unit 1 2 ΔV 1 ΔV 2 ΔV N Read circuit M V ΔG i Controll (c) Mem Subarray FF Subarray 2 the crossbar size is N N, an crossbar can store N 2 voltage pulses of different levels. N is usually larger than 64 so the capacity of the crossbar is enough for storing all different voltage pulse levels. Thus, depending on storing voltage pulse levels in FF subarrays, lookup table policy not only avoids the data transfer between processing unit caused by obtaining the voltage pulse, but also makes full use of the circuit design in FF subarrays. V ΔG i Fig. 6: Lookup table policy on. cells store the tuning voltage value to avoid computing complex transformation functions TIME cannot deal with training only by itself. If we use the specific circuit, the number of specific circuit must be small. Otherwise, the area overhead of TIME is very large because of the complex design of specific circuit [20]. To reduce the cost of data transfer of computing the complex function in Fig. 1 (d), we propose a lookup table policy to obtain the tuning voltage by using crossbar in FF subarrays. Depending on the advantage of multi-level of in FF subarrays, we can store the result of tuning voltage pulse on of FF subarrays. The method is shown in the Fig. 6. The crossbars in FF subarrays are used to store the value of voltage pulse. If the precision of is P -bit, the number of conductance level of ( ) is 2 P, so the number of voltage pulse levels is up to 2 P = 2 P (2 P 1)/2. We can store all different voltage 2 pulse values on crossbar only if the number of cells of crossbar is larger than 2 P (2 P 1)/2 as shown in Fig. 6. For example, the precision of is usually 6-bit, B. One-Direction Writing Scheme According to [20], there exists asymmetry between RESET operation and SET operation, as shown in Fig. 1 (d). The SET operation is abrupt while the RESET process is gradual, so it is risky to SET the, which makes the weight cannot be updated accurately. A weight of NN is expressed by two conductances in [23], as shown in Eq. (12). I o = (G + G ) V i (12) where G + and G is the conductance of positive crossbar and negative crossbar, respectively. V i is the input of two crossbars and I o is the output of them. Thus, we use Eq. (13) to avoid the SET operation in writing. We can only RESET the positive when reducing the weight and only RESET the negative when increasing the weight. This method can prevent the risk from the SET operation. if W < 0, Only reset pos crossbar elseif W > 0, Only reset neg crossbar else Continue (13) Besides, we propose a hardware design of the scheme, which is called updated weight circuit as shown in Fig. 4 (g), so that we only RESET the resistance and avoid SET operation in the computation of NN. According to the most significant

8 8 Accuracy 100% 75% 50% 25% Refreshing interval Fig. 7: Accuracy of NN in different refreshing interval, the x-axis is the refreshing interval and y-axis is the accuracy. bit (MSB) of W, the crossbar needs to be tuned which is chosen. C. Smart Refresh Strategy Only operating RESET operation means that the value of cell is easier to reach its boundary, which is called resistance saturation [18] as shown in Fig. 1 (d). According to [18], the accuracy is obviously damaged by resistance saturation if we do not deal with it. [18] adopts a refreshing strategy at a fixed frequency to reduce the effect of resistance saturation.we use this method to train NN on MNIST dataset with a model, whose size is , and the result is shown in Fig. 7. According to Fig. 7, the accuracy is guaranteed only by refreshing every time after updating weight of NN. However, tuning resistance value is energy-consuming and time-consuming so the large number of parameters of NN enlarges the consumption. In order to reduce the cost of writing caused by resistance saturation, we propose a smart refresh strategy, as shown in Algorithm 1. We do not adjust the resistance every time while just tuning once the resistance reaches the boundary. Before tuning based on the training NN algorithm, we read the current resistance of at first. Then, we decide whether the current resistance can arrive at the target resistance because of the limitation of the boundary. If can be tuned to the target resistance, we do not adjust it. Otherwise, we refresh the. By using the smart refresh strategy, we can cancel some unavoidable writing operations because the method in [18] reads and writes the resistance in every training iteration. D. Variability-free Tuning Scheme There exists variability when tuning as shown in Fig. 1 (d) and the variation model of tuning in [20] is expressed in Eq. (14). R real N( R accurate, 0.09R) (14) where R real is the actual change of resistance when we tune, R accurate is the resistance value compared with the target resistance, and the R is the current resistance value. The existing method to overcome the variability of writing is to RESET, read, and initialize the iteratively Algorithm 1: Smart refresh strategy Input: Current resistance of R current, positive crossbar resistance R pos, negative crossbar resistance R neg, range of resistance [R a, R b ] Output: R current 1 if W < 0 then 2 if R neg + R > R a then 3 Record R current = R pos R neg 4 Lower R neg, R pos with same value 5 Keep the R pos R neg = R current 6 end 7 R neg = R neg + R 8 end 9 else if W > 0 then 10 if R pos + R > R a then 11 Record R current = R pos R neg 12 Lower R neg, R pos with same value 13 Keep the R pos R neg = R current 14 end 15 R pos = R pos + R 16 end 17 return R pos, R neg [19]. But this method consumes large energy because the initialization leads to writing and reading, which is operated by analog-to-digital converter (AD) and digital-toanalog converter (DA). In the method of [19], we must change the resistance into the original value once the resistance is larger than the maximum value of the target range. When we change according to the mean value in Eq. (14), it is easy to be larger than the target range because of the variance. Then, we need to tune the to be the original resistance and restart the process above. The initialization means the previous RESET operation and the energy of this RESET operation is wasted. For reducing the number of unavoidable initialization, we propose a variability-free tuning scheme in Algorithm 2 to reduce the energy of tuning. The method depends on not only the ideal distance, which is the mean value in Eq. (14), but also the variance in Eq. (14). The voltage pulse is still produced by the Eq. (14) while the resistance is based on R 0.09R 3. According to 3σ principle in [35], the probability that the resistance is larger than target range is 0.26% and it means that the initialization can almost be avoided. Besides, NN has shown the power to resist noises. Some references [36] [37] show that the noises can improve the performance of NN. Thus, we attempt to make use of this ability to train NN based on the random model of Eq. (14). However, NN cannot converge in the recognition task even in a small dataset, such as MNIST [38]. The reason might be the random model of is not same as the noise model in [36] [37]. Thus, we propose the variability-free tuning method to update weight accurately. E. Discussion 1) Endurance: According to [29] [30], the endurance of is 2 at most while the endurance of DRAM and the endurance of SRAM is up to 6 [31]. We list the number of training iteration of popular models of NN in TABLE I and the task of the model is ImageNet recognition. According to TABLE I, a training model in 128 batch size needs 10 6 training

9 9 Algorithm 2: Varialibity-free tuning scheme Input: Current resistance of R current, target resistance range R target, target resistance range ɛ, change of resistance δr Output: R current 1 while ɛ > abs(r current R target) do 2 R real N( R accurate, 0.09R) 3 Produce the R real according to N( R accurate 3σ, 0.09R) 4 Obtain V pulse by lookup table based on 5 Use V pulse to change the resistance of 6 obtain the R current 7 R accurate = R accurate R target 8 end 9 return R current A inference Q(s,a:w A ) B inferenceq(s`,a:w B ) Count =+ 1 A backpropagation and update Count > 64? Copy Count= 1 Y Inference Store Q(s,a:w A ) Count =+ 1 Count > 64? Y inference Q(s`,a:w B ) backpropagation update Count -= 1 Count > 64? N N iterations at most, so the endurance of is enough for training NNs. Besides, the number of training iterations decreases as the batch size increases and the larger batch size means a lower number of writing. A threshold-training method in [39] is proposed to extend the endurance of in training NN. Thus, the endurance of does not affect training NN on. 2) Footprint Mismatch Between And CMOS Periphery: The mismatch of footprint between array and its CMOS periphery circuit might lead to inaccurate input voltage, which decreases the performance of the system. For this problem, our system can adopt the fault-tolerant method in [40] [39] to reduce the negative effects. Besides, NN has shown the power in resisting noises [24]. We can make use of this ability to reduce the cost caused by mismatch problem. Finally, the fabrication of can be improved in the future. Depending on the methods above, the mismatch problem can be relieved. V. DRL ON TIME As Fig. 3 shows, there is a copy operation in DRL, which copies the weights of A network into that of B network. The copy operation in the hardware is writing the resistance of the crossbar of A network into that of the crossbar of B network. However, tuning is costly because of the non-ideal factors and the large number of parameters in DRL so that the copy operation in DRL consumes huge energy and time. Thus, the specific mapping method, as shown in Fig. 8, is proposed to avoid the writing operation in hardware caused by the copy operation. According to Fig. 3, B network is used to produce Q, which is used to calculate the y in Eq. (11), and its weights are the same as the weights of A network of µ iterations before. In other words, the weights of B network are derived from A network and we use the output Q and r from the same data set to train A network. It means that we only need Q and r for training network and do not need to store the weights. Thus, by adjusting the order of operation, we can use one network of crossbar to implement the two networks of DRL as shown in Fig. 8. We operate the inference 64 times in advance and store these 64 Q results. Then, depending on these results, we can implement inference and backpropagation according to Eq. (7) 64 times. Next, weight update phase can be operated without implementing copy operation 64 times. Then, we repeat this process without the writing operation in hardware. Thus, with Fig. 8: By adjusting the order of operation of DRL, specific mapping method of DRL can cancel the tuning of caused by copy of update. TABLE II: Benchmark Name The structure of network MLP MLP MLP CNN-1 conv5x5-pool CNN-2 conv7x10-pool conv11x96-pool3-conv5x256-pool3- DoReFa Net conv3x384-conv3x384-conv3x256- pool DRL net conv8x16-conv4x the specific mapping method, we cancel the tuning operation caused by the copy operation of DRL. Besides, the scale of the network in hardware is reduced to its half with this specific mapping method and the area of TIME benefits from it. VI. SIMULATION RESULTS In this section, we evaluate our TIME design. We first describe the experiment setup, and then present the speed and energy results and estimate the area overhead. A. Experiment Setup The benchmarks we use comprise seven NN designs for deep learning applications, as listed in Table II. CNN-1 and CNN-2 are two CNNs, and MLP-1/2/3 are three multilayer perceptrons (MLPs) with different network scales: small, medium, and large. Those five NNs are evaluated on the widely used MNIST database of handwritten digits [38]. The sixth NN, DoReFa net [41], is a well known low bits neural network for ImageNet ILSVRC [42]. It is an extremely large CNN, containing 16 weight layers and synapses. In DRL, we adopt DRL net [2] as the benchmark. TIME contains 64 banks in one chip and there are two FF subarrays and one buffer subarray per bank (totally 64 subarrays). In FF subarray, each mat includes one crossbar of cells and eight 6-bit s. For the cell, according to the study we know, the weight of NN is

10 10 floating points in BP. Thus, we set cell 5-bit during the inference and assume cell ideally consistent during BP for computation. We write row-by-row according to methodology in [43]. The cell is 1-bit in memory mode. The counterparts of TIME are GPU and DaDianNao. The NVIDIA TITAN X is adopted as the GPU counterpart and the data of DaDianNao is based on [1]. We measure the result of K20 and we evaluate the performance of DaDianNao according to the improvement compared with NVIDIA GPU K20 in [1]. The simulation library is 65nm CMOS library. The detailed parameters are based on previous designs and models, such as write driver [44], the sigmoid and [45], model is in [20]. Other settings, such as basic peripheral circuit parameters, etc, are based on PRIME. We use H- simulation program with integrated circuit emphasis (HSPICE) to simulate these circuits performance. We model the above TIME designs according to 65nm TSMC CMOS library. We also model main memory and our TIME system with modified NVSim [46], CACTI- 3DD [47], and CACTI-IO [48]. We adopt Pt/TiO2-x/Pt devices [49] with Ron/Roff = 1k/20k and SET/RESET voltage is based on [50]. The FF subarray is modeled by heavily modified NVSim, according to the peripheral circuit modifications. B. Non-ideal Factors Optimization For administrating the advantage of our method, we compare the TIME using the non-ideal factors optimization and TIME not using it. 1) Lookup table policy: For illustrating the effects of lookup table policy, we use NVIDIA TITAN X as a counterpart of lookup table policy to compute the complex function in [20]. Fig. 9 shows the comparison in different NNs between TIME using lookup table policy and TIME using GPU, NVIDIA TITAN X, and the batch size of each NN is 32. According to Fig. 9, the energy efficiency of TIME with lookup table policy is 2.79x on average, 3.79x at most, 1.73x at least better than TIME with GPU. Besides, Fig. 9 shows that the speed of TIME with lookup table policy is 0.27x on average, 0.07x at least, 0.49x at most compared with TIME with GPU. Thus, there exists a tradeoff between energy efficiency and speed when using lookup table policy. 2) Variability-free method: In order to administrate the effects of the variability-free method, we compare the performance between TIME with the method in [19] and TIME with the variability-free method. Fig. 10 shows the performance results in different NNs and batch size of NNs is 32. The energy efficiency results are shown in Fig. 10 and the results show that the energy efficiency is improved 2.28x on average, 2.29x at most compared with the method in [19]. Besides, according to Fig. 10, the speed of TIME with the variability-free method can be boosted 1.68x at most, 1.26x on average compared with that of TIME with method in [19]. C. SL Result 1) Smart refresh strategy: For illustrating the advantage of smart refresh strategy, the method in [18] is selected as GOP/J GOP/s Energy Efficiency TIME with GPU TIME with lookup table policy Speed TIME with GPU TIME with lookup table policy Fig. 9: Energy efficiency of lookup table policy. The vertical axis represents the GOP/J and the horizontal axis represents different neural networks; Speed of lookup table policy. The vertical axis represents the GOP/s and the horizontal axis represents different neural networks. GOP/J GOP/s Energy Efficiency TIME without variability free method TIME with variability free method Speed TIME without variability free method TIME with variability free method Fig. 10: Energy efficiency of variability-free method. The vertical axis represents the GOP/J and the horizontal axis represents different neural networks; Speed of variabilityfree method. The vertical axis represents the GOP/s and the horizontal axis represents different neural networks. a counterpart. Fig. 11 shows the energy efficiency comparison between TIME with the smart refresh strategy and TIME with the method in [18]. The results in Fig. 11 are in different NNs and batch size of NNs is 32. According to Fig. 11, the energy efficiency with smart refresh strategy is 1.88x on average, 2x at most, 1.3x at least compared with TIME with the method in [18]. For the speed, TIME with smart refresh strategy is 1.1x on average, 1.3x at most compared with TIME using the method in [18]. 2) Energy Result: Fig. 12 shows the detailed energy efficiency results of DaDianNao, TIME with the optimization and TIME without optimization in different NNs. Besides,

11 11 GOP/J Energy Efficiency TIME with smart refreshing strategy TIME with refreshing every time Speed TIME with smart refreshing strategy TIME with refreshing every time 2) Speed Result: Fig. 13 shows the detailed speed results of GPU, TIME with the specific mapping method and TIME without the specific mapping method with different batch sizes. The GOP/s of TIME increases at most 35x and at least 7.5x compared with GPU in DRL net. Besides, TIME without the copy and TIME with the copy are compared. The results in Fig. 13 show that the speed of the former is about 1.2x to 1.4x faster than the latter, while the energy efficiency of the former is about 1.2x lower than the latter. Thus, there exists a trade-off between energy efficiency and speed when we realize DRL in TIME. GOP/s Fig. 11: Energy efficiency of smart refresh strategy. The vertical axis represents the GOP/J and the horizontal axis represents the batch size; Speed of smart refresh strategy. The vertical axis represents the GOP/s and the horizontal axis represents the batch size. the experiments are under different batch size, 32, 64, 128 and 256. In CNN, the results show that the energy efficiency of TIME with optimization increases by 2.5x at least, 67x at most and 19.6x on average compared with DaDianNao. In MLP, the energy efficiency of TIME with optimization increases by 2.3x at least, 10.3x at most and 6.74x on average compared with DaDianNao. The comparison between TIME with optimization and TIME without optimization shows that the energy efficiency is boosted 3.1x on average. These results show that the proposed optimization is necessary to reduce the cost of tuning. 3) Speed Result: Fig. 12 shows the detailed speed results of DaDianNao, TIME with the non-ideal optimization and TIME without the non-ideal optimization with different batch sizes. The speed of TIME with the non-ideal optimization increases at most 1.92x and on average 1.3x compared with DaDianNao in CNN. For MLP, the speed of DaDianNao is faster than TIME, such as MLP3, but it is slower than TIME in CNN. And in the large scale NN, DoReFa net, TIME is 1.5x faster than DaDianNao. According to Fig. 12, the speed of TIME with the nonideal optimization is almost the same as TIME without the non-ideal optimization. Thus, the non-ideal optimization does not accelerate training NN in TIME. E. Energy Distribution and Area Overhead 1) Energy Distribution: Fig. 14 shows the energy distribution in TIME. The tuning energy dominates 87% of the total energy in training MLP and 97% in training CNN. Thus, the tuning energy is dominant because the non-ideal factors cause too much number of writing and reading which need AD/DA operation. The writing model in this paper is adopted from [50] and the write energy is 60.9 pj while the current write energy in [20] can be reduced to about 1 pj. If we use 1 pj as the write energy, TIME will improve the energy efficiency more than 50x, and boost energy efficiency by two orders of magnitude higher compared with DaDianNao. Thus, TIME has potential to improve energy efficiency again once we decrease the cost of tuning. 2) Area Overhead: Fig. 15 shows the area overhead of TIME. Our design only incurs 6.92% area overhead compared with the whole memory according to [46]. The area overhead of is the largest part because we add ReLU function, max and weights update circuit in it. The design of and COL MUX does not lead to significant increase in the area overhead. The controller, which increases only 8% overhead in PRIME, increases 12% overhead in TIME because TIME can support the training of NN. The final output diff and preprocess occupy 14%. For analyzing the effects of, speed and area overhead of TIME in DoReFa Net under different number of are shown in TABLE III. According to TABLE III, increasing the number of in each FF subarray can improve the speed but it leads to more area consumption. Thus, there exists a tradeoff between speed and area. Besides, it seems that we can further increase the number of for higher speed of system. However, it enlarges the mismatch between and the periphery circuits so that decreases the accuracy of system. Thus, we set eight s in each FF subarray for balancing the speed, area and accuracy of system. D. DRL Result 1) Energy Result: Fig. 13 shows the detailed energy efficiency results of GPU, TIME with the specific mapping method and TIME without the specific mapping method in DRL net under different batch sizes. The results show that the energy efficiency of TIME is at least 114x, at most 156x, and on average 126x compared with GPU. TIME with the specific mapping method can boost 1.2x energy efficiency on average compared with TIME without the specific mapping method. VII. CONCLUSION In this paper, we propose an architecture based on, TIME, and peripheral circuits design to enable training NN on. TIME substantially improves the speed and energy efficiency for NN applications. Besides, we propose several methods to reduce the cost caused by the non-ideal factors of in weight update phase of training NN. We firstly propose lookup table policy to reduce the cost of computing complex functions between tuning voltage and the change

12 12 Dadinanaobatch=32 Dadinanao batch=64 Dadinanao batch=128 Dadinanao batch=256 TIME with optimazation batch=32 TIME with optimazation batch=64 TIME with optimization batch=128 TIME with optimization batch=256 GOP/J TIME without optimazation batch=32 TIME without optimazationbatch=64 TIME without optimization batch=128 Energy Efficiency TIME without optimization batch=256 GOP/s Speed Fig. 12: The energy efficiency of SL. The vertical axis represents the GOP/J and horizontal axis represents different batch size in different NN; The speed of SL. The vertical axis represents the GOP/s and horizontal axis represents different batch size. Energy Efficiency GPU TIME with resistance copy TIME without resistance copy 7% Others 14% Control 19% GOP/J 13% GOP/s batch size Speed GPU TIME without resistance copy TIME with resistance copy batch size Fig. 13: Energy Efficiency of DRL. The vertical axis represents the GOP/J and the horizontal axis represents the batch size; Speed of DRL. The vertical axis represents the GOP/s and the horizontal axis represents the batch size. 34% 12% Final output diff <1% Fig. 15: Area overhead of TIME. dominates the largest part of area overhead. TABLE III: Tradeoff between speed and area under different number of Number of Speed(op/s) Area overhead 4 600M 4.58% 8 1.1T 6.92% T 9.26% T 11.8% T 14.0% others 13% 87% others 3% 97% write write energy energy MLP CNN Fig. 14: Energy distribution of TIME. Tuning energy is dominant in energy distribution. of conductance. Secondly, one-direction writing scheme is proposed and corresponding circuits are designed for avoiding the asymmetry and reducing the cost of writing. Thirdly, we design a smart refresh strategy to enhance the performance of TIME and avoid the resistance saturation. Fourthly, a variability-free writing scheme is proposed for reducing the energy consumption caused by variability of. Finally, we discuss the endurance of and the footprint mismatch between and CMOS peripheral circuits. We illustrate that the current endurance of is suitable for training NN on. We analyze that footprint mismatch does not affect the performance of TIME. Besides,

13 13 a specific mapping method is proposed to improve the energy efficiency of DRL. In addition, we evaluate the energy distribution of TIME and find that the energy consumption mainly comes from tuning. Once we reduce the cost of tuning, the energy efficiency and speed are boosted obviously. TIME incurs an insignificant area overhead compared with original memory chips. The simulation results show that TIME can achieve a high speedup and significant energy saving for various NN applications. REFERENCES [1] Y. Chen et al., Dadiannao: A machine-learning supercomputer, in IEEE/ACM ISM, 2015, pp [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, Playing atari with deep reinforcement learning, Computer Science, [3] C. Szegedy and others., Going deeper with convolutions, pp. 1 9, [4] D. Silver et al., Mastering the game of go with deep neural networks and tree search, Nature, vol. 529, no. 7587, pp , [5] K. He et al., Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp [6] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Computer Vision and Pattern Recognition, 2016, pp [7] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, Computer Science, [8] P. Chi et al., PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory, in ISCA, 2016, pp [9] S. W. Keckler et al., GPUs and the Future of Parallel Computing, IEEE Micro, vol. 31, no. 5, pp. 7 17, [10] J. Qiu et al., Going deeper with embedded fpga platform for convolutional neural network, in FPGA. ACM, 2016, pp [11] B. Li et al., Training itself: Mixed-signal training acceleration for memristor-based neural network, in ASPDAC, 2014, pp [12] Training low bitwidth convolutional neural networks on rram, in ASPDAC, [13] Y. Wang, L. Xia, T. Tang, B. Li, S. Yao, M. Cheng, and H. Yang, Low power convolutional neural networks on a chip, in IEEE International Symposium on Circuits and Systems, 2016, pp [14] T. Tang, L. Xia, B. Li, Y. Wang, and H. Yang, Binary convolutional neural network on rram, in Design Automation Conference, 2017, pp [15] L. Xia et al., Switched by input: power efficient structure for rrambased convolutional neural network, in DAC, [16] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars, in ACM/IEEE International Symposium on Computer Architecture, 2016, pp [17] W. Huangfu, L. Xia, M. Cheng, X. Yin, T. Tang, B. Li, K. Chakrabarty, Y. Xie, Y. Wang, and H. Yang, Computation-oriented fault-tolerance schemes for rram computing systems, in Design Automation Conference, 2017, pp [18] G. W. Burr et al., Experimental demonstration and tolerancing of a large-scale neural network ( synapses) using phase-change memory as the synaptic weight element, IEEE TED, vol. 62, no. 11, pp , [19] B. Li et al., -Based Analog Approximate Computing, TCAD, vol. 34, no. 12, pp. 1 1, [20] S. Yu et al., A neuromorphic visual system using synaptic devices with Sub-pJ energy and tolerance to variability: Experimental characterization and large-scale modeling, IEDM, vol. 2012, pp , [21] B. Li, L. Xia, P. Gu, Y. Wang, and H. Yang, Merging the interface: Power, area and accuracy co-optimization for rram crossbar-based mixed-signal computing system, in Design Automation Conference (DAC), nd ACM/EDAC/IEEE. IEEE, 2015, pp [22] B. Li et al., Merging the interface: Power, area and accuracy cooptimization for rram crossbar-based mixed-signal computing system, in DAC. ACM, 2015, p. 13. [23] P. Gu et al., Technological exploration of rram crossbar array for matrix-vector multiplication, in DAC, 2015, pp [24] T. Tang, L. Xia, B. Li, R. Luo, Y. Chen, Y. Wang, and H. Yang, Spiking neural network with rram: Can we use it for real-world application? in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition. EDA Consortium, 2015, pp [25] F. Alibart et al., High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm. Nanotechnology, vol. 23, no. 7, pp , [26] C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang, S. Yu, and Y. Xie, Overcoming the challenges of crossbar resistive memory architectures, in High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on. IEEE, 2015, pp [27] S. Yu, B. Gao, Z. Fang, H. Yu, J. Kang, and H.-S. P. Wong, A neuromorphic visual system using rram synaptic devices with sub-pj energy and tolerance to variability: Experimental characterization and large-scale modeling, in Electron Devices Meeting (IEDM), 2012 IEEE International. IEEE, 2012, pp [28] S. Yu,, and others., A low energy oxide-based electronic synaptic device for neuromorphic visual systems with tolerance to device variation. Advanced Materials, vol. 25, no. 12, p. 1774, [29] M.-J. Lee, C. B. Lee, D. Lee, S. R. Lee, M. Chang, J. H. Hur, Y.- B. Kim, C.-J. Kim, D. H. Seo, S. Seo et al., A fast, high-endurance and scalable non-volatile memory device made from asymmetric ta2o5- x/tao2-x bilayer structures, Nature materials, vol. 10, no. 8, p. 625, [30] C. W. Hsu, Self-rectifying bipolar taox/tio2 rram with superior endurance over 1012 cycles for 3d high-density storage-class memory, in VLSI, 2013, pp. T166 T167. [31] J. J. Yang, D. B. Strukov, and D. R. Stewart, Memristive devices for computing, Nature nanotechnology, vol. 8, no. 1, pp , [32] R. Hasan et al., Enabling back propagation training of memristor crossbar neuromorphic processors, in IJCNN. IEEE, 2014, pp [33] Y. Lecun, A theoretical framework for back-propagation, in Artificial Neural Networks: concepts and theory, [34] V. Mnih et al., Human-level control through deep reinforcement learning, Nature, vol. 518, no. 7540, pp , [35] T. W. Anderson, T. W. Anderson, T. W. Anderson, T. W. Anderson, and E.-U. Mathématicien, An introduction to multivariate statistical analysis. Wiley New York, 1958, vol. 2. [36] A. F. Murray and P. J. Edwards, Enhanced mlp performance and fault tolerance resulting from synaptic weight noise during training, Neural Networks IEEE Transactions on, vol. 5, no. 5, pp , [37] C. M. Bishop, Training with noise is equivalent to tikhonov regularization, Neural Computation, vol. 7, no. 1, pp , [38] LeCun et al., The mnist database of handwritten digits, [39] L. Xia, M. Liu, X. Ning, K. Chakrabarty, and Y. Wang, Fault-tolerant training with on-line fault detection for rram-based neural computing systems, in The Design Automation Conference, 2017, pp [40] L. Xia, W. Huangfu, T. Tang, X. Yin, K. Chakrabarty, Y. Xie, Y. Wang, and H. Yang, Stuck-at fault tolerance in rram computing systems, IEEE Journal on Emerging & Selected Topics in Circuits & Systems, vol. PP, no. 99, pp. 1 1, [41] S. Zhou et al., Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients, arxiv preprint arxiv: , [42] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li, Imagenet: A large-scale hierarchical image database, in Computer Vision and Pattern Recognition, CVPR IEEE Conference on, 2009, pp [43] G. W. Burr et al., Large-scale neural networks implemented with non-volatile memory as the synaptic weight element: Comparative performance analysis (accuracy, speed, and power), in IEDM, [44] C. Xu et al., Understanding the trade-offs in multi-level cell reram memory design, in DAC, 2013, pp [45] L. Xia et al., MNSIM: Simulation platform for memristor-based neuromorphic computing system, in DATE, 2016, pp [46] X. Dong et al., Nvsim: A circuit-level performance, energy, and area model for emerging non-volatile memory, in Emerging Memory Technologies. Springer, 2014, pp [47] K. Chen, S. Li, N. Muralimanohar, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, Cacti-3dd: Architecture-level modeling for 3d die-stacked dram main memory, in Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, 2012, pp

14 [48] N. P. Jouppi, A. B. Kahng, N. Muralimanohar, and V. Srinivas, Cactiio: Cacti with off-chip power-area-timing models, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.

, Recent progress of phase change memory (PCM) and resistive switching random access memory (), in IEEE IMW. IEEE, 2011, pp. 1 5. Ming Cheng received his B.S.

His current research interests include energy efficient hardware architecture design based on emerging non-volatile device. Lixue Xia received his B.S.

His research mainly focuses on energy efficient hardware computing system design and neuromorphic computing system based on emerging non-volatile device. Zhenhua Zhu is currently pursuing his B.S.

His research interests include non-volatile memory, power efficient system design, hardware acceleration and computing with emerging devices. Yu Wang (S05-M07-SM14) received his B.S. degree in 2002 and Ph.

He has published more than 50 journals (38 IEEE/ACM journals) and more than 150 conference papers (12 DAC, 13 DATE, 4 ICCAD, 24 ASPDAC, 9 FPGA) in the areas of EDA, FPGA, VLSI Design, and Embedded

He has graduated 4 Ph.D. students and 11 Master Students. He is currently advising 9 doctoral students and 6 master Students.

14 14 [48] N. P. Jouppi, A. B. Kahng, N. Muralimanohar, and V. Srinivas, Cactiio: Cacti with off-chip power-area-timing models, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 7, pp , [49] L. Gao, F. Alibart, and D. B. Strukov, A high resolution nonvolatile analog memory ionic devices, Proc. NVMW, [50] H.-S. P. Wong et al., Recent progress of phase change memory (PCM) and resistive switching random access memory (), in IEEE IMW. IEEE, 2011, pp Ming Cheng received his B.S. degree in 2015 from Huazhong University of Science and Technology, Wuhan. He is currently pursuing the M.S. degree with the Department of Electronic Engineering, Tsinghua University, Beijing. His current research interests include energy efficient hardware architecture design based on emerging non-volatile device. Lixue Xia received his B.S. degree in electronic engineering from Tsinghua University, Beijing, in He is currently pursuing his Ph.D. degree in the Department of Electronic Engineering, Tsinghua University, Beijing. His research mainly focuses on energy efficient hardware computing system design and neuromorphic computing system based on emerging non-volatile device. Zhenhua Zhu is currently pursuing his B.S. degree in electronic engineering in Tsinghua University, Beijing. His research interests include non-volatile memory, power efficient system design, hardware acceleration and computing with emerging devices. Yu Wang (S05-M07-SM14) received his B.S. degree in 2002 and Ph.D. degree (with honor) in 2007 from Tsinghua University, Beijing. He is currently a tenured Associate Professor with the Department of Electronic Engineering, Tsinghua University. He has published more than 50 journals (38 IEEE/ACM journals) and more than 150 conference papers (12 DAC, 13 DATE, 4 ICCAD, 24 ASPDAC, 9 FPGA) in the areas of EDA, FPGA, VLSI Design, and Embedded Systems, with a focus on brain inspired computing, application specific heterogeneous computing, parallel circuit analysis, and power and reliability aware system design methodology. He has graduated 4 Ph.D. students and 11 Master Students. He is currently advising 9 doctoral students and 6 master Students. He has served as PI/Co-PI on over 20 research grants administrated by China government agencies (including NSFC, National Key Technology Program, 863, and etc) and 18 research grants from industry (including Microsoft, IBM, Huawei, MHI and etc), with total amount of 45.1 million RMB and personal share of 28.4 million RMB. These projects lead to new CAD tools and optimization methods, interesting heterogeneous computing systems based on CPU/FPGA/GPU/emerging memory technology. He has received Best Paper Award in FPGA 2017, NVM 2017, ISVLSI 2012, and Best Poster Award in HEART 2012 with 8 Best Paper Nominations (DAC 2017, ASPDAC 2016, ASPDAC 2014, ASPDAC 2012, 2 in ASPDAC 2010, ISLPED 2009, CODES 2009). He is a recipient of IBM X10 Faculty Award in 2010 (one of 30 worldwide). Yu Wang also received The Natural Science Fund for Outstanding Youth Fund in 2016, and is the co-founder of Deephi Tech (valued over 150M USD), which is a leading deep learning processing platform provider. He has been an active volunteer in the design automation, VLSI, and FPGA conferences. He served as TPC chair for ISVLSI 2018, ICFPT 2011 and Finance Chair of ISLPED , Track Chair for DATE and GLSVLSI 2018, and served as program committee member for leading conferences in these areas, including top EDA conferences such as DAC, DATE, ICCAD, ASP-DAC, and top FPGA conferences such as FPGA and FPT. Currently he serves as Co-Editor-in-Chief for ACM SIGDA E- News, Associate Editor for IEEE Transactions on CAD, IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), and Journal of Circuits, Systems, and Computers, and Special Issue Editor for Microelectronics Journal. He also serves as guest editor for Integration, the VLSI Journal and IEEE Transactions on Multi-Scale Computing Systems. He has given 70 invited talks and 2 tutorials in industry/academia. He is an ACM/IEEE Senior Member. Yi Cai received his B.S. degree in electronic engineering from Tsinghua University, Beijing, in He is currently pursuing his Ph.D. degree in the Department of Electronic Engineering, Tsinghua University, Beijing. His research mainly focuses on on-chip neural network system and emerging nonvolatile memory technology. Yuan Xie (F15) was with IBM, Armonk, NY, U, from 2002 to 2003, and AMD Research China Lab, Beijing, China, from 2012 to He has been a Professor with Pennsylvania State University, State College, PA, U, since He is currently a Professor with the Electrical and Computer Engineering Department, University of California at Santa Barbara, Santa Barbara, CA, U. His current research interests include computer architecture, Electronic Design Automation, and VLSI design. Huazhong Yang (M97-SM00) was born in Ziyang, Sichuan Province, P.R.China, on Aug.18, He received B.S. degree in microelectronics in 1989, M.S. and Ph.D. degree in electronic engineering in 1993 and 1998, respectively, all from Tsinghua University, Beijing. In 1993, he joined the Department of Electronic Engineering, Tsinghua University, Beijing, where he has been a Full Professor since Dr. Yang was awarded the Distinguished Young Researcher by NSFC in 2000 and Cheung Kong Scholar by Ministry of Education of China in He has been in charge of several projects, including projects sponsored by the national science and technology major project, 863 program, NSFC, 9th five-year national program and several international research projects. Dr. Yang has authored and co-authored over 400 technical papers, 7 books, and over 100 granted Chinese patents. His current research interests include wireless sensor networks, data converters, energy-harvesting circuits, nonvolatile processors, and brain inspired computing. Dr. Yang has also served as the chair of Northern China ACM SIGDA Chapter.

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua