AS the number of cores per chip continues to increase,

Size: px

Start display at page:

Download "AS the number of cores per chip continues to increase,"

Jessie Fleming
5 years ago
Views:

1 IEEE TRANSACTIONS ON COMPUTERS 1 Improving Bit Flip Reduction for Biased and Random Data Seyed Mohammad Seyedzadeh, Rakan Maddah, Donald Kline Jr, Student Member, IEEE, Alex K. Jones, Senior Member, IEEE, Rami Melhem, Fellow, IEEE Abstract Nonvolatile memory technologies such as Spin-Transfer Torque Random Access Memory (STT-RAM) and Phase Change Memory (PCM) are emerging as promising replacements to DRAM. Before deploying STT-RAM and PCM into functional systems, a number of challenges still remain must be addressed. Specifically, both require relatively high write energy, STT-RAM suffers from high bit error rates and PCM suffers from low endurance. A common solution to overcome those challenges is to minimize the number of bits changed per write. In this paper, we propose and evaluate the hybrid coset encoder to efficiently improve and balance the bit flip reduction for biased and unbiased data. The main core of the coset encoder consists of biased and unbiased vectors which maps the data input to a larger set of data vectors. Subsequently, the intermediate data vector that yields the least number of differences when compared to the currently stored data is selected. Our evaluation shows that hybrid coset encoder reduces bit flips by up to 25% over a baseline differential writing scheme. Further, our proposed scheme reduces bit flips by up to 20% over the leading bit-flip minimization scheme for biased data, while achieving very low decoding overhead similar to the Flip-N-Write scheme. Index Terms Non-Volatile Memory, Coset Coding, Reliability. 1 INTRODUCTION AS the number of cores per chip continues to increase, the memory system is becoming more than ever a defining component for the performance of computer systems. A large memory capacity operating under stringent quality of service requirements is required to respond to memory access requests of executing cores within acceptable latencies. Unfortunately, DRAM, which currently forms the building block of the memory system, is becoming limited by power and scalability challenges, thus, endangering the evolution of the memory system. This has turned the attention of architects and researchers to considering alternative memory technologies [1 4]. Amongst several memory candidates, both Phase Change Memory (PCM) and Spin-Transfer Torque Random Access Memory (STT-RAM) are receiving considerable attention as potential replacements for DRAM. Assessments and evaluations of PCM [5, 6] show that it can compete with DRAM in terms of performance while providing improved scalability and power efficiency. Multi-level cell techniques for STT-RAM [7] suggest the potential for near DRAM densities while retaining a near SRAM performance (potentially faster than DRAM). Yet, S.M. Seyedzadeh and R. Melhem are with the Department of Computer Science, University of Pittsburgh, Pittsburgh, PA seyedzadeh@cs.pitt.edu, melhem@cs.pitt.edu. Rakan Maddah is with Intel Corporation. This work was completed while the author was still a Ph.D student at the University of Pittsburgh. rakan.maddah@intel.com. D. Kline Jr and A.K. Jones are with the Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA dek61@pitt.edu, akjones@ece.pitt.edu. both technologies suffer from a number of challenges that must be resolved to become viable for high volume manufacturing. Specifically, PCM suffers from low endurance [8] (10 6 to 10 8 writes on average) and STT-RAM suffers from high write bit error rates [9 11]. To address the challenges faced by PCM and STT- RAM, servicing write requests while minimizing the actual number of bits written to memory is a promising approach. This achieves a reduced effective wear-out rate of PCM cells and a lower effective bit error rate for STT-RAM. In its simplest form, minimizing the actual number of bit flips can be achieved through a concept called a differential write [12]. Differential write is a bit by bit comparison between the new data to be written and the currently stored data within the memory block and subsequently only writing to the cells that store bit values different from their new bit values. Clearly, the higher the similarity of the new data to the currently stored data the lower the number of required bit changes. In this realm, coset coding techniques [13 17] have been used to encode the new data into a form that exhibits high similarity of the data to be written to the currently stored data. The encoding process consists of mapping the new data vector to be written into several other vector candidates and subsequently picking the candidate that minimizes the bit flips. Therefore, the type of the encoding can play a significant role in minimizing the bit flips. Consequently, in this paper, we consider two fundamental problems: Finding efficient encodings to reduce not only the number of bit flips but also the encoding hardware overhead. Finding encodings that apply to both unbiased and biased data.

2 IEEE TRANSACTIONS ON COMPUTERS 2 We advance the notion of the coset coding to deliver solutions to these problems and provide major enhancements of the understanding of coset coding and how this may be implemented in a simple fashion, with reduced computational complexity. We make the following contributions to data encoding for bit flip minimization: We observe that increasing the randomness of encoded vector candidates decreases the required number of bit flips required for unbiased (i.e. apparently random) data sets. We demonstrate mathematically and by simulation that random encoding outperforms the leading bitminimization encoding approach based partially on inverting bits. We introduce a new type of the base coset called the hybrid coset which includes unbiased and biased vectors suitable for the unbiased and biased data, respectively. We propose low-overhead encoders and decoders that only use the bitwise exclusive OR operations. The remainder of this paper is organized as follows. Section 2 presents fundamentals on bit flip reduction techniques. Section 3 describes write minimization using a randomly generated coset. Section 4 discusses two different ways for the construction of the random cosets. Section 5 illustrates a pseudo-random encoding scheme [18] which can be suitable for unbiased data. Sections 6 and 7 explain the low-overhead and flexible encoders/decoders which can be applied to biased and unbiased data. Section 8 presents an experimental evaluation of hybrid coset coding for biased and unbiased data. Section 9 compares the encoding and decoding overheads of schemes. Finally, Section 10 concludes our work. 2 BACKGROUND: WRITE MINIMIZATION US- ING COSET THEORY As mentioned in the previous section, differential write [12] is a technique to minimize bits written by only writing bits that change value during the write operation. Consider an n-bit data block B that is to be written to an n-cell memory block that already stores data D. A traditional differential write first reads D, compares it with B and only writes the bits of B that are different from their corresponding bits in D. Thus, for entirely random data, the distribution of zeros and ones in both B and D is random, and hence, on average, n/2 bits are written to memory instead of n bits. Coset theory [19] attempts to increase the number of bits that are identical in B and D by exploring a number K of possible encodings of B, where K = 2 k and k is an integer. The overall number of written bits can be reduced by selecting the encoding that has the minimum Hamming distance to D (the fewest number of different bits). To recover the original data B, it is necessary to have a unique decoding path. To accomplish this, one coset approach defines a fixed set of K, n-bit vectors, C 0,..., C K 1 and uses these vectors to generate K different encodings of B through bit-wise XOR. We denote these K different encodings of B as W 0,, W K 1 where each W i, 0 i K 1, is an (n + k)-bit vector with the first n bits equal C i B and the last k bits record the binary representation of the index i. Any W i among the K possible encodings can be decoded back to B by recovering i from the last k bits of W i and XORing C i with the first n bits of W i, i.e., (C i B) C i = B. To minimize the number of bits written, as previously mentioned, the encoding W i that has the minimum Hamming distance from D (the data already stored in memory) is used. However, each W i and D contains n+k bits, while the data block, B, contains only n bits. The additional k bits, which stores the index i, represents the coding overhead that is needed to reduce the Hamming distance and minimize the number of bits written during a differential write. We denote these overhead bits as h 0,..., h k 1. In the rest of this paper, we will refer to the set C 0,..., C K 1 as the base coset. Flip-N-Write (FNW) proposes a method to reduce the number of bits written using differential write by selectively inverting blocks of the data to be written [16]. In general, for any number of k bits, Flip-N-Write divides B into k equal partitions and writes each partition directly or inverted, whichever minimizes the number of written bits. It uses k overhead bits h 0,, h k 1 to track which partition is inverted to allow retrieval of the original data. FNW is a special case of the coset approach in which each n-bit vector C i in the base coset is constructed from K = 2 k sub-vectors, C i,j, j = 0,, K 1, each consisting of n/k bits. Specifically, if the overhead bits, h 0,, h k 1, are used to store the binary representation of the index i, then the n/k bits of C i,j are all zeroes if h j = 0 and the n/k bits of C i,j are all ones if h j = 1. For example, if k = 1 (that is K = 2), then the base coset for Flip-N-Write contains the two vectors 0 0 and 1 1, which means that the data vector B can be either written as is B 0 0 or written inverted B 1 1, with one overhead bit, h 0, indicating which of the two options is used. For k = 2 (that is K = 4), the base coset for Flip-N-Write contains the 4 vectors , , and , which means that B is divided into two halves and each of the two halves can be either inverted or not inverted, with one overhead bit, h 0, to indicate if the first half is inverted and another overhead bit, h 1, to indicate if the second half is inverted. Another approach for write minimization, FlipMin, proposes using a special instance of coset theory to encode B using the Hamming (72,64) s dual code [20] to obtain W 0,, W K 1 and to decode any W i back to B [17]. FlipMin performs one to many mapping by its base coset from each dataword to a coset of vectors and then picks for writing the vector that minimizes the number of bit flips. The key difference between FNW

3 IEEE TRANSACTIONS ON COMPUTERS 3 and FlipMin can be conceptually described in terms of the base coset used to map the dataword. Specifically, FNW uses simple patterns such as 0 0 and 1 1 to construct the base coset while FlipMin utilizes the generators of linear codes as the base coset. In the next section, we propose a third option for the base coset. 3 WRITE MINIMIZATION USING A RANDOMLY GENERATED COSET In this section, we take a different approach to building the base coset by randomly generating each vector C 0,, C K 1. We mathematically demonstrate that a randomly generated coset can complete write requests while inducing fewer bit flips than FNW. To derive a formula for the number of written bits when the base coset consists of K randomly generated vectors, C 0,, C K 1, we compute the Hamming distance between each encoded vector C i B and the currently stored data, D. We then estimate the expected value of the minimum of this distance over i = 0,, K 1. The Hamming distance between C i B and D is equal to the Hamming weight of C i B D. Moreover, because B, D and C i are random, then C i B D is also random. This implies that the number of written bits (NWB) is equivalent to the expected value of the minimum Hamming weight of K random vectors. The random vector, C i can be any element of the set of 2 n possible strings of zeroes and ones. We divide the set of 2 n possible elements into the set of 2 n k distinct groups so that each group has 2 k elements. The number of groups can change from one group (k = n) to 2 n groups (k = 0). We assume that the weight of a word is determined by the number of ones it contains denoted by the parameter w such that 0 w n. The total number of elements that has weight w in the set of 2 n elements is ( n w). Since the number of elements in each group is 2 k, the number of elements with weight w in a group of 2 k elements ranges 1 l 2 k. In the set of 2 n elements, there are ( ) 2 n 2 different combinations to k form 2 n k distinct groups so that each group includes 2 k elements. To obtain the average weight of groups, we follow steps 1-3 as follows: Step 1. Find the Number of Combinations with the Weight w (NC w ) so that each combination has at least one element with weight w and at most 2 k elements with weight w as follows: NC w = 2 k l=1 (( n w) l ) ( n j=w+1 2 k l ( n ) j) Step 2. Multiply NC w by the corresponding weights as follows: n NC = (w NC w ) (2) w=0 Step 3. Calculate the average number of bits updated (1) Bit Flip Reduc:on (k=4 auxiliary bits) 25% 20% 15% 10% 5% 0% 32 (12.5%) 64(6.25%) 128 (3.125%) 256(1.5625%) 512 ( %) Block Size / Overhead (Percentage) Random Coset FNW Coset Fig. 1: Weight average difference of random cosets and FNW with 4 auxiliary bits and different block sizes. The value in between parentheses represents space overhead. per write (denoted NW B Random ) as follows: NW B Random = NC ( 2 n 2 k ) (3) where 0 w n, 1 l 2 k and 0 k n. For comparison with FNW [16], we note that given a memory block of size n and k auxiliary bits, the number of written bits by FNW can be expressed as: NW B F NW = k 2 n k 1 n 2k 1 i=0 n ) i( k + n ( n ) k i 2 n k +1 n 2k Figure 1 shows the bit flip reduction over differential write (n/2 bit flips) achieved by randomly generated cosets and FNW cosets derived through Eq. (3) and Eq. (4). The randomly generated coset achieves considerably higher flip reduction rates than a FNW coset for various different block sizes. Unfortunately, it is infeasible to determine an analytical model to estimate the number of required bit flips for FlipMin to complete a write request. Thus, we rely on Monte Carlo simulations to compare random cosets against FlipMin. Our analysis from these simulations do indicate that random cosets can achieve significantly fewer bit flips than FlipMin for unbiased data as reported in Section MECHANISMS OF THE RANDOM COSET GENERATION In Section 2, we introduced cosets that can encode a value B into a value W i that reduces the bit transitions when writing to a memory cell containing a value D using differential write. We can also refer to the encoded value W i as a codeword. The analysis provided in Section 3 demonstrates that using a randomly generated coset for unbiased data provides higher reduction in bit flips than the leading existing flip minimization schemes. However, to decode a codeword encoded with a random coset, we need to know the random vector C i that we used to encode the data. Since the inverse generator matrix concept used in FlipMin is not a valid option (4)

4 IEEE TRANSACTIONS ON COMPUTERS 4 for a random coset, we are left with the following two options: Use a random number generator at encoding time to generate the random coset element, C i, from B. The hardware overhead of generating these random elements by traditional random generators is high and typically irreversible at decoding time. To address these difficulties, in the next section we propose a Pseudo-Random Encoding Scheme, PRES, to generate pseudo-random vectors directly from the value B. PRES codewords are quickly recoverable using an efficient decoding mechanism that is less complex than FlipMin. Generate the random coset elements in advance and store them in both the encoder and the decoder. This option leads to creating a low-overhead encoder and decoder which is more flexible and efficient than the first option. Details are explained in Sections 6 and 7. 5 COSET GENERATION BASED ON PRES PRES uses a tree-structure pseudo-random encoding model to generate pseudo-random cosets. We define conditions for our proposed tree-structure model to guarantee that the generation of the pseudo-random cosets is demonstrably close to true random through several standard tests. We describe the PRES scheme in detail in the following sections. 5.1 PREM: A Pseudo-Random Encoding Model We first define a pseudo-random encoding model (PREM) to decorrelate a data block B as: P n 1 B i i=0 P i = B i 1 B i i=1 (5) P i 1 B i i=2,...,n-1 where B i and P i represent the i-th element of the data block and the pseudo-random vector, respectively. The parameter n is the number of memory cells to be encoded. As illustrated in Eq. (5), P i for 1 i n 1 is first generated, followed by P 0 that is produced with P n 1 from the previous step and B 0. Eq. (5) has been designed so that all B i have the potential to be updated after each encoding process; Since, the probability of the cell having a 0 or 1 is 1 2, it follows that the probability of the cell being updated is also 1 2. Thus, the probability of cells with different values is 1 2 because the corresponding combinations are 0 1 and also 1 0 out of the four possible combinations that also include 0 0 and 1 1. Therefore, there is a high probability to produce pseudo-random vectors using Eq. (5). The corresponding decoding algorithm for Eq. (5) can be expressed as: P n 1 P i i=0 B i = B i 1 P i i=1 (6) P i 1 P i i=2,...,n-1 B0 B1... Bn- 2 Bn- 1 P0 P1... Pn- 2 Pn- 1 B0 B1... Bn- 2 Bn- 1 Encoder Decoder Fig. 2: Overview of PREM for encoding and decoding. Figure 2 shows the feedback path from the left side to the right side that causes P i to be produced serially. As shown in Figure 2, there is no feedback path between P i and B i. However, the advantage of this configuration is that all cells in the decoder can be simultaneously decoded. Thus, read accesses that are typically on the critical path for processor performance and require decoding can be streamlined. Table 1 is an example that illustrates the encoding and the decoding for n = 4. Further, several pseudo-random encodings can be obtained by applying PREM in different bit-orderings to expand the number of candidate encodings. For example, if the feedback path in Figure 2 is used from the right side to the left side (i.e., reversed), a different pseudo-random vector, P, can be generated. Thus, it is possible to utilize one bit pattern in two different directions to build P and P. To produce additional pseudo-random codewords, the encoding process can be applied in different patterns such as different block interleaved orderings. We describe this process in the following section and use this to build an indexable set of pseudo-random vectors for write minimized encoding. 5.2 PRES: A Pseudo-Random Encoding Scheme We can create p different patterns by subdividing B into sub-blocks conceptually represented by the rows (or columns) of a two dimensional matrix. Thus, PRES can simultaneously and independently encode these sub-blocks using PREM in two opposite directions to TABLE 1: The encoding and the decoding for n = 4 Input B B B B = P 1 = B 1 B 0 = 1 0 = 1 Encoding P 2 = B 2 P 1 = 0 1 = 1 P 3 = B 3 P 2 = 0 1 = 1 P = B 0 0 P = 1 1 = 0 3 B 0 = P 0 P 3 = 0 1 = 1 Decoding B 1 = P 1 B 0 = 1 1 = 0 B 2 = P 2 P 1 = 1 1 = 0 B 3 = P 3 P 2 = 1 0 = 1

5 IEEE TRANSACTIONS ON COMPUTERS 5 LR RL TB BT Dataword Block/Index (a)/ (b)/ (c)/ Fig. 3: PRES 16-bit example (a) the 4 4 block, (b) the generation of parents, (c) the generation of children. generate two different pseudo-random codewords. By constructing p different matrices we can generate 2p codewords. For a partition based pseudo-random generator model to be functional, there are certain requirements on how each matrix is partitioned and encoded. To that end, each particular partitioning corresponds to a particular pattern and results in a unique pseudo-random encoding. However, partitions should attempt to group bits in the sub-blocks such that each partition groups unique bits together and does not repeat bits grouped together in other partitions sub-blocks. If overlapping occurs, those bits are guaranteed to have the same values in the two different codewords, which decreases the randomness of the encoding candidates. Ensuring that the patterns have minimal overlap makes it possible to build a partition based pseudo-random generator model that closely mimics an ideal random encoding generator. One way to do this is to use a single matrix and to apply PREM along different dimensions using different orderings such as along rows, then columns, diagonals, etc. In PRES, we propose a two-phase tree structure to generate 2p codewords in the first phase and to take these resulting codewords and from each generate 2p 1 codewords in the second phase, resulting in a total of 2p (2p 1) codewords. We assume that each matrix partitions the n bits of B in a square m m matrix, such that each row (or column) of the matrix can be considered a sub-block that contains m bits 1. The encoding process is explained as follows: Step 1. The encoder uses p given patterns in the tree structure to partition B into equal sub-blocks and generate p new matrices. PREM is independently applied to the sub-blocks of p new matrices in two different directions to generate 2p pseudo-random codewords. The generated codewords are different in terms of the used encoding direction or the pattern. Then, each codeword can be re-partitioned into p new matrices to produce 2p 1 codewords by PREM. Note that the same direction of the original PREM used to generate the first phase 1. It is relatively straightforward to extend this idea to support nonsquare matrices through repartitioning. codeword should not be used in the second phase as it will essentially reproduce several bits of the original block B. Step 2. The encoder utilizes k-bit indices (k = log 2 4p 2 ) to label the 2 k generated codewords (C 0,, C 2 k 1) in Step 1 and compare each codeword, concatenated with its corresponding index (i.e., W i ), with D. The W i that has the minimum Hamming distance to D is selected. Step 3. Then, W i is written to memory. To retrieve B, the decoder uses the index i in memory to find the corresponding patterns (matrix partitioning) and encoding directions used for the codeword written in memory. Finally, the decoder using Eq. (6) restores the original data block by reversing the two phases from the encoding. To clarify the process we provide a detailed example in Figure 3 to generate 16 pseudo-random codewords from a new 16-bit data block storing B = 0xBC07 and then minimize the number of bits flipped during the write to the memory that currently stores an existing value D = 0x Recall that D is the concatenation of a previously stored codeword (16-bits) and index (4-bits) requiring 20-bits of storage to store a 16-bit value. Let us assume that the bits of B are arranged in a 4 4 matrix [Figure 3(a)]. In the first phase of PRES, PREM is applied to the matrix from Left-to-Right (LR), Right-to- Left (RL), Top-to-Bottom (TB), and Bottom-to-Top (BT). Note, TB is equivalent to applying LR to the transposed matrix. Each function generates one codeword we call a parent [Figure 3(b)]. In the next phase, each parent using the three other functions generates three additional codewords, or children. For instance, the parent generated by LR uses RL, BT and TB functions to create three children [Figure 3(c)]. Each parent and child is provided a 4-bit unique index. PRES then compares the 16 generated codewords and index (i.e., W 0,, W 15 ) with D (i.e., 0x00000) and selects W i with the minimum Hamming weight. According to Figure 3, the block with the minimum Hamming weight is at index i = 4. Thus W 4 is written in memory. It is important to note that PRES actually generates 17 codeword candidates, because the original data B can

6 IEEE TRANSACTIONS ON COMPUTERS 6 4.a.1 i $$$$$$$coset0000$ 4.a.2 i $$$$$$$coset0001$ 4.a.3 i $$$$$$$coset0010$ 4.a.4 i $$$$$$$coset0011$ b.1 i $$$$$$$coset0100$ 4.b.2 i $$$$$$$coset0101$ 4.b.3 i $$$$$$$coset0110$ 4.b.4 i $$$$$$$coset0111$ c.1 i $$$$$$$coset1000$ 4.c.2 i $$$$$$$coset c.3 i $$$$$$$coset1010$ 4.c.4 i $$$$$$$coset1011$ d.1 i $$$$$$$coset1100$ 4.d.2 i $$$$$$$coset1101$ 4.d.3 i $$$$$$$coset1110$ 4.d.4 i $$$$$$$coset1111$ Fig. 4: The generation of 16 cosets using the base coset (4.a.1). The elements of the first coset, coset 0000, containing the vectors C 0 = 1100, C 1 = 1001, C 2 = 0110, and C 3 = 0001 are randomly selected from the 16 possible 4-bit vectors. be a codeword. If there is a desire to use the original data as one of the codeword candidates, rather than use an additional index bit, we can systematically replace one of the generated codewords with B. In this example, only p = 2 patterns (horizontal and vertical) were used to generate 16 pseudo-random codewords. If we have additional index bits (e.g., 6-bits), PRES can utilize other patterns such as southwest-tonortheast and southeast-to-northwest diagonal patterns to increase the generated pseudo-random codewords from 16 to 64. In this case, although the computational complexity of the decoder does not change, the computational complexity of the encoder significantly increases. In the next section, we propose an encoder and decoder based on storing base coset elements, which have lower computational complexity than PRES. 6 RANDOMLY GENERATED COSET (RGC) In this section, we use pure random data to build a base coset from which other cosets are derived. The base coset is stored in the encoder and the decoder. Without loss of generality, we assume that a base coset has 2 k unique elements. The (n+k)-bit element consists of two parts: k-bit unique index, i, and n-bit random data, C i. We provide a simple example to clarify the coset generation using random data. Assume we have a base coset as shown in Figure 4.a.1. The base coset consists of 2 k (n+k)-bit elements where k = 2 and n = 4. For each row in the base coset, although the values, C i, are randomly selected, a unique index, i, is assigned. Assuming the 4-bit input data, B, can be any element of a set of 2 4 possible strings of zeros and ones. XORing bits of C i by B, 2 4 unique cosets (coset B = C i B; i = 0,, 3) are generated. For example, all elements of the base coset are mapped from Figure 4.a.1 to Figure 4.a.2 by XORing C i and B = 0001 to create the coset Moreover, the set of all possible bit vectors is split into 2 4 cosets 2. The base coset can sometimes be referred to as the zero coset because coset 0000 is identical to the base coset. indexed by all possible values of B so that each coset contains 2 2 unique elements indexable by i. 6.1 Coset Encoder We now propose an architecture for minimizing writes using the randomly generated coset. As shown in Figure 5, the main core of the encoder and the decoder is the base coset. The base coset includes K = 2 k blocks that store n-bit random data. The role of the base coset in the encoder is to map datawords from the set of 2 n possibilities to a larger set of 2 n+k possible (n+k)-bit codewords. The larger set of codewords makes it much more flexible for the encoder to select the codeword W i to be written in the memory block such that the Hamming weight compared to the codeword D already stored in the memory block is minimized. The middle part in the encoder, i.e. the algorithm part, finds the codewords with the minimum Hamming weight. The decoder utilizes the New data (B) Coset0 (C0,..., CK- 1) Wi' Ci i' i Wi i Xi Stored data (D) Encoder Coset0 (C0,..., CK- 1) Ci' Decoder Find Wi that minimizes the weight of Xi i' Xi' Original data (B) Fig. 5: Coset Encoder and Coset Decoder. Wi'

7 IEEE TRANSACTIONS ON COMPUTERS 7 base coset, along with the index stored in the codeword, for finding the base coset element used in the encoder to retrieve the original data block. In order to encode the dataword B, the encoder maps B to the suitable coset (i.e., coset B ) from an arbitrary base coset containing unique random coset elements. Let us present an algorithm to encode B assuming that the random base coset is stored in both the encoder and decoder. Given the base coset with 2 k blocks, 2 k n-bit random data bits, C i,j are stored in the base coset where C i,j denotes the j th bit of the vector C i. Thus, B is an n-bit data block that is to be written to an (n + k)-cell memory block that already stores data D such that the first n bits of the block contain data and the last k bits stores the corresponding codeword index. To select the codeword W i encoded from B to store in the memory block, Algorithm 1 is applied. In each iteration, i, of the algorithm (line 2), C i (the i th element in the base coset) is selected to be XORed with B. Then, the results of n XOR operations (lines 3 to 4) are set to the first n bits of W i. The last k bits of W i records the binary representation of the index i in lines 5 to 7 where symbols % and / are the mode and division operations. Thus, W is coset B of the base coset. To minimize the number of bits flipped between the new write and the old write, the (n + k)- cell memory block D is XORed with each W i (lines 8 to 10) to form the two-dimensional array X i,j which keeps the weight difference between the old data, D, and the elements of the new coset, W i. Then, the encoder computes the Hamming weight of X i (lines 11 and 12) and finds X i which has the minimum Hamming weight (line 13). Finally, the (n + k)-bit codeword, W i, is stored into memory (line 14). Sometimes, there is more than one i discovered at line 13. This means that multiple coset elements can minimize the number of bit flipped per write. In this case, Algorithm 1 selects the minimum i of the set of elements that satisfy the minimum number of flips. 6.2 Coset Decoder To retrieve the dataword, Algorithm 2 is applied. To determine which the coset element has been used in the codeword, lines 1 to 3 retrieve the index used in the encoder, i. XORing the corresponding bits of W i,j and C i,j, the original dataword, B, is recovered. Scrutinizing the coset encoder in Algorithm 1, we observe that the main part of the encoder can be implemented in parallel because the base coset elements, C i, are independent and there is not any feedback between the neighboring bits in C i. Accordingly, lines 2 to 7 can merge with lines 8 to 10 to provide the first n bits of W i,j. Also, finding the index of the used base coset element in the encoder, Algorithm 2 simultaneously retrieves original bits of the dataword. 7 HYBRID COSET (HC) Recall that write minimization results depend on the input dataset. While FNW greatly outperforms FlipMin Algorithm 1: Coset Encoder 1 begin 2 for i 0 to 2 k do 3 for j 0 to n 1 do 4 W i,j = C i,j B j 5 for j n to n + k 1 do 6 W i,j = i%2 7 i = i/2 8 for i 0 to 2 k do 9 for j 0 to n + k 1 do 10 X i,j = W i,j D j 11 for i 0 to 2 k 1 do 12 S i = sum(x i,j, j = 0,, n + k 1) 13 if i, i S i < S i then 14 Out W i,j Algorithm 2: Coset Decoder 1 begin 2 for j n to n + k 1 do 3 i + = W i,j 2 j n 4 for j 0 to n 1 do 5 B j = W i,j C i,j on biased data, FlipMin has been shown to do better with unbiased data [17]. Moreover, we will show in Section 8.1 that the random coset will do better than FNW with unbiased data. This discrepancy reveals the lack of tunability and flexibility of existing base cosets for different types of input datasets. This motivates and inspires a new type of base coset, a hybrid based coset, which is the combination of the randomly generated coset and FNW coset. Our hypothesis is that by replacing a percentage of randomly generated coset elements with FNW coset elements, we will significantly improve the bit flip reduction for biased data while suffering only a negligible degradation of bit flip reduction for unbiased data compared to an entirely randomly generated coset. As explained in Section 2, the FNW coset element is constructed from K = 2 k vectors, C i,j, j = 0,, K 1, each consisting of n/k bits. Given the overhead bits, h 0,, h k 1, the n/k bits of C i,j are all zeroes if h j = 0 and the n/k bits of C i,j are all ones if h j = 1. For example, if k = 4 (that is K = 16), FNW coset elements contain 16 vectors , , ,, which can be shown by corresponding overhead bits (pattern) 0000, 0001, 0010,, 1111, respectively. We choose 2 α FNW coset elements and 2 k 2 α randomly generated coset elements from the 2 k base

8 IEEE TRANSACTIONS ON COMPUTERS 8 Access Frequency Distribu7on 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% Pa#ern astar gcc leslie3d mcf soplex zeusmp canneal lbm libquantum omnetpp wrf milc Fig. 6: Access frequency distribution for 16 FNW coset elements. coset elements. Therefore, the proportion of the number of FNW coset elements to randomly generated coset elements can play a significant role in the results of random and biased inputs. To find 2 α FNW coset elements, we ranked the FNW coset elements based on the access frequency distribution [21] for biased data and selected elements that have the most access frequency. Figure 6 shows results of the access frequency distribution for the bit FNW coset elements with k = 4 on 12 different applications. The first four FNW patterns from the left side roughly covers up to 71.50% access frequency distribution. Also, Figure 7 depicts the first bit FNW coset elements achieves up to 80.25% access frequency distribution when k = 8 (the number of coset elements increases from 16 to 256). Note that each bit in Figures 6 and 7 is representative of the content of two bytes and one byte, respectively. For example, the FNW coset element is represented by the pattern in Figure 7. Based on Figures 6 and 7, we conclude that a hybrid coset can be tuned with 25% FNW coset elements (4 out of 16 patterns for k = 4 and 64 out of 256 patterns for k = 8) and the remaining 75% use randomly generated coset elements in order to cover at least 71% of the best codewords from the entire FNW coset for biased data, while still providing good coverage of the unbiased data space from the randomly generated coset. In the next section, we will show that the hybrid coset with 75% randomly generated coset elements and 25% FNW coset elements can obtain competitive performance for both biased and unbiased datasets. Note that this ratio is sensitive to the used datasets. While selecting more RGCbased coset elements improves the performance (bit flip reduction) for datasets with random data patterns, it sacrifices the performance for datasets including biased patterns. Similarly, more FNW-based coset elements improve the performance for datasets including biased patterns EXPERIMENTAL RESULTS In this section, we evaluate our proposed schemes, specifically, PRES, RGC, and HC, against the leading bit minimization techniques of FlipMin and FNW for unbiased and biased datasets, respectively. Our experiments are applied across different budgets for space overhead. The results show that HC is the best candidate for application to systems that process both unbiased and biased datasets. Moreover, we measure the randomness degree of the data vectors generated by the encoder of each scheme and consider the implications of these results. 8.1 Bit Flip Reduction for Unbiased data To compare the reduction in bit flips achieved by each scheme with respect to differential write as a baseline for unbiased data, we generated 100 million random inputs. We assign 4 auxiliary bits (k = 4) to RGC, i.e., the base coset encompasses 16 different elements. In addition, we consider p = 2 for PRES to generate 16 pseudorandom codewords based on the input data. Finally, we compared with HC. Recall from Section 7, HC can be tuned based on the combination of randomly generated coset elements and FNW coset elements and that 25% from FNW and 75% from RGC provided good coverage. Thus, four FNW coset elements we combined with 12 randomly generated coset elements for k = 4. Finally, we change the data block sizes from 32 to 512 bits covering an encoding overhead range of 12.5% to less than 1%. Figure 8 shows that RGC, HC and PRES require flipping fewer bits and outperform both FlipMin and FNW across different block sizes with varying space overhead. RGC requires 15-25% fewer bit flips depending on block size configuration than FlipMin and FNW, respectively. Since both RGC and PRES generate random codewords, they have roughly the same bit flip reduction for different overheads. Moreover, Figure 8 shows that replacing four randomly generated coset elements with four FNW coset elements still maintains the positive bit flip minimization effect on the random dataset. To achieve higher flip reduction rates, the number of elements of the base coset can be increased to allow more encoding candidates to be considered. Accordingly, we extend the number of auxiliary bits to eight (256 elements for RGC) and consider HC with the combination of the 64 FNW coset elements as shown in Figure 7 and 192 randomly generated coset elements and then plot our findings in Figure 9. The results show that for the same space overhead, the flip reduction capabilities of all five schemes have significantly improved. For a space overhead of 12.5%, the flip reduction capability of RGC increased from 21% to 25% which amounts to a 20% improvement, and also the flip reduction capabilities of HC reached 24.2%, an 18% improvement. For a space overhead of %, while RGC and HC require up to 6% fewer bit flips than PRES, they obtain up to 30% fewer bit flips than either FlipMin or FNW. We

9 IEEE TRANSACTIONS ON COMPUTERS 9 Access Frequency Distribu0on 25% 20% 15% 10% 5% 0% astar canneal gcc lbm leslie3d libquntum mcf omnetpp soplex wrf zeusmp milc Pa4ern Fig. 7: The selection of 64 FNW coset elements with most access frequency from 256 possible FNW coset elements. Bit Flip Reduc:on (4 auxiliary Bits) 20% 15% 10% 5% RGC PRES HC FlipMin FNW Bit Flip Reduc:on (8 auxiliary Bits) 25% 20% 15% 10% 5% RGC PRES HC FlipMin FNW 0% 32 (12.5%) 64(6.25%) 128 (3.125%) 256(1.5625%) 512 ( %) Block Size / Overhead (Percentage) 0% 64 (12.5%) 128 (6.25%) 256 (3.125%) 512 (1.5625%) 1024 ( %) Block Size / Overhead (Percentage) Fig. 8: Bit flip reduction over Differential Write for unbiased data with 4 auxiliary bits and different block sizes. The value in between the parenthesis represents space overhead. Fig. 9: Bit flip reduction over Differential Write for unbiased data with 8 auxiliary bits and different block sizes. The value in between the parenthesis represents space overhead. note that with a larger base cost, FlipMin gets closer to the capability of RGC and HC for blocks with higher space overhead. However, the performance overhead of FlipMin is significantly larger than RGC, HC, PRES and FNW as we discuss in Section Bit Flip Reduction for Biased Data We evaluated the performance of RGC, HC, FlipMin and FNW for biased 3 inputs in Figures 10 and 11. Since both RGC and PRES generate essentially random codewords, regardless the type of input data, i.e. biased or unbiased, both RGC and PRES have a similar effect on biased data. On average, RGC produces a 33.8% bit flip reduction improvement over the baseline (Differential Write) for blocks with four auxiliary bits, while FlipMin and FNW produce 47.2% and 56.4% bit flip reduction improvements over the baseline, respectively. Compared to RGC, HC on average achieves up to 55.92% the bit flip reduction over baseline for a space overhead of 6.25%. For eight auxiliary bits, the flip reduction capability of the randomly generated coset increased from 33.8% to 38.0%, while the flip reduction capabilities FlipMin and FNW increased from 47.4% and 57.0% to 48.8% and 3. We use twelve write-intensive benchmarks from SPEC CPU2006, SPEC JBB2005 and the only write-intensive benchmark (canneal) from the PARSEC suite. 59.6%, respectively. Figure 11 depicts that the performance of HC can be improved from 56.0% to 58.7% on average. On all benchmarks, FNW and HC outperform RGC and FlipMin even though RGC only outperforms FNW on unbiased inputs. This discrepancy highlights how write minimization results depend on the inputs. According to Figures 10 and 11, using HC is much more effective than using random elements alone when biased inputs is employed. Therefore, employing the combination of 75% randomly generated coset elements and 25% FNW coset elements in HC generally improves performance for biased inputs over purely random encoding strategies. This is supported by the results demonstrating that the HC scheme has the best bit flip reduction for biased inputs in comparison to FlipMin and RGC (56.4% for a space overhead of 6.25% and also 58.7% for a space overhead of 12.5%) and almost has the same bit flip reduction as the FNW for biased inputs. We conclude that using the combination of randomly generated coset elements and FNW coset elements in the base coset is better than each coset alone in terms of the bit flip reduction for a system that handles both biased and unbiased inputs. Compression was proposed as a method to reduce the overhead of writing in nonvolatile memory [22]. It is well known that compression increases the randomness of the data and hence will ben-

10 IEEE TRANSACTIONS ON COMPUTERS 10 Number of bit flips (less is be<er) bit block 4 auxiliary bits RGC FlipMin HC FNW 0 libquantum omnetpp wrf canneal leslie3d milc astar zeusmp mcf gcc lbm soplex average Fig. 10: Comparison of Write minimization schemes using 4 auxiliary bits for SPEC CPU2006 and SPEC JBB2005 inputs. Number of bit flips (less is be<er) bit block 8 auxiliary bits RGC FlipMin HC FNW 0 libquantum omnetpp wrf canneal leslie3d milc astar zeusmp mcf gcc lbm soplex average Fig. 11: Comparison of Write minimization schemes using 8 auxiliary bits for SPEC CPU2006 and SPEC JBB2005 inputs. efit our proposed schemes. To verify this hypothesis, we compressed 32-bytes blocks of the different applications using Base-Delta-Immediate Compression(BDI) [22] and then encode compressed applications using the different encoding schemes. Since compressing data can increase randomness, the experimental results show that the combination of random coset elements and FNW-based coset elements, on average significantly minimizes the number of bit flips compared to other schemes. Note that some data applications are compressed very well while others cannot be compressed by the compression algorithm. Figures 12 and 13 show the average number of bit flips for different schemes that compresses data applications before encoding with 4 and 8 auxiliary bits, respectively. According to the experimental results, compressing data before encoding can improve HC performance in comparison to other schemes. On average, HC, FNW, FlipMin and RGC with 4 auxiliary bits achieve bit flip reductions of about 77%, 56%, 63% and 52% against the differential write scheme, respectively. With 8 auxiliary bit, HC, FNW, FlipMin and RGC improve the bit flip reduction by up to 79%, 60%, 65%, 56% against the differential write scheme, respectively. 8.3 Randomness Measurement Figures 8 and 9 showed that RGC outperforms both FlipMin and FNW in bit flip reduction capability for unbiased data. We attribute those results to the fact that the RGC generates a base coset with elements that are more random than the elements of the base cosets of FlipMin and FNW. NIST SP [23] are well known quantitative tests that measure the randomness of a set of data vectors. Accordingly, we have used these tests to measure the randomness of the codewords produced by the encoders of RGC, HC, PRES, FlipMin and FNW. Our measurements reveal that the data vectors generated by RGC, HC, PRES, FlipMin and FNW pass 100%, 95%, 99%, 96% and 89% of the tests, respectively. These findings support our rationale that the higher the randomness of the base coset elements the higher the rate of the bit flip reduction that can be achieved for unbiased inputs. To achieve the satisfactory results when both unbiased inputs and biased inputs are employed, a small percentage of random coset elements can be replaced by targeted FNW coset elements at the cost of slightly reducing the randomness of the base coset elements. However, lack of randomness is not necessarily valuable for biased data. FlipMin also attempts to address general datasets through coding theory and has a similar randomness as HC, however, the results from Sections 8.1 and 8.2 indicate that HC is significantly more effective for actual bit flip reduction due to the better targeted nature of the base coset elements. 9 CODING COMPLEXITY We applied Synopsys Design Compiler to synthesize encoders and decoders of different schemes using 45nm FreePDK cell library. To store the coset elements of FlipMin and Hybrid Coset (HC), we used CACTI [24] and modified it to model SRAM cells and estimate the area and delay of the storage. We assume that FlipMin, HC and PRES take 64-bit inputs and utilize 4 auxiliary bits for encoding input data and their decoders retrieve 64-bit original datawords. Note that the overhead of

IEEE TRANSACTIONS ON COMPUTERS 11 Number of Bit flips (less is be=er) 25 20 15 10 5 0 64- bit block 4 auxiliary bits RGC FlipMin HC FNW libquantum omnetpp wrf canneal leslie3d milc astar zeusmp mcf

11 IEEE TRANSACTIONS ON COMPUTERS 11 Number of Bit flips (less is be=er) bit block 4 auxiliary bits RGC FlipMin HC FNW libquantum omnetpp wrf canneal leslie3d milc astar zeusmp mcf gcc lbm soplex average Fig. 12: Comparison of Write minimization schemes using 4 auxiliary bits after compressing SPEC CPU2006 and SPEC JBB2005 inputs. Number of Bit flips (less is be=er) bit block 8 auxiliary bits RGC FlipMin HC FNW libquantum omnetpp wrf canneal leslie3d milc astar zeusmp mcf gcc lbm soplex average Fig. 13: Comparison of Write minimization schemes using 8 auxiliary bits after compressing SPEC CPU2006 and SPEC JBB2005 inputs. TABLE 2: The encoder and decoder overheads of different schemes. Scheme Delay(ns) Area(µm 2 ) # of cells # of cells in the critical path PRES(16 counters) PRES(1 counter) Encoder HC/FlipMin(16 counters) HC/FlipMin(1 counter) FNW PRES Decoder HC/FlipMin FNW FlipMin is similar to HC since they leverage the same mechanisms for encoding and decoding when the coset elements are stored in SRAM. To compare the overhead of FNW, which is 16-bit wide, we multiply the area overhead of FNW by 4. Table 2 shows overheads of the different schemes. The dominant portion of the encoding (area and latency) for FlipMin, PRES, and HC is in the one s counter to determine the best codeword to write. Thus, we implemented an entirely parallel version (16 encoders and 1 s counters) and a sequential version (individual encoder and 1s counter that is pipelined). The number of cells used in PRES encoder for the parallel version is about the same as the number of cells used in HC/FlipMin and also the number of cells of the critical path in PRES is 23.5% less than that in HC/FlipMin because, in PRES, it is possible to count the number of ones in the codeword while they are computed by XOR operations (See Figure 2). Accordingly, PRES encoder reduces the delay by up to 20.9% compared to HC/FlipMin. However, the most critical latency is that of decoding since this is typically the limiting factor in performance. Encoding happens during writing and typically this is not along the critical due to memory buffers. According to Table 2, although the HC/FlipMin encoder experiences a higher overhead than FNW encoder, the HC/FlipMin decoder achieves an overhead similar to FNW. Meanwhile, the HC/FlipMin decoder decreases the delay by up to 89.18% compared to PRES, and the number cells used in the HC/FlipMin is 47 times fewer than that in PRES decoder. In the sequential version, HC/FlipMin has identical performance to PRES since we pipelined the design of both encoders with the one s counter being in the critical stage of pipelines. The encoding and 1 s counting are in separate cycles that dominate the delay. Extending the number of auxiliary bits provides higher flip reduction improvement because the number of coset elements in HC/FlipMin and pseudo-random codewords in PRES can be increased to

12 IEEE TRANSACTIONS ON COMPUTERS 12 allow the currently stored data to be compared against more data vectors. This improvement comes at a higher computational overhead, as a larger codeword requires more elements to compare against. For k=8, the storage space and gate counts of HC/FlipMin increase by up to 16 times compared to k=4 since the number of coset elements increases from 16 to 256. Meanwhile, PRES requires an increase in the number of codewords by up to 16 times which amounts to 16 times space overhead. On the other hand, using one counter or 16 counters during encoding process does not change much the space occupied each the counter in comparison to 4 auxiliary bits. 10 CONCLUSION The relatively high write energy is one of the major weaknesses of emerging non-volatile memories. Accordingly, bit change reduction schemes are a particularly successful approach to reduce the impact of this overhead through the minimization of the number of bits changed per write. In this paper, we show that the effectiveness of coset based write minimization techniques is directly correlated with the correlation of the biased and unbiased codewords in the encoder with the biased and unbiased nature of the dataset to be encoded. We further show that the randomness of the elements that form the base coset work effectively on unbiased datasets while FNW approaches work effectively for biased datasets. Finally, we find that replacing a number of randomly generated coset elements with selected FNW coset elements can dramatically improve the randomly generated coset approach towards biased data while maintaining the positive effect of the random coset elements on random datasets. Our analyses and experimental results showed that the generated codewords by the hybrid coset encoder can lead to fewer bit flips than generated codewords by other schemes. Also, the HC decoder not only decreases the delay by up to 89.18% in the critical path compared to PRES, but also achieves an overhead (very low) similar to FNW. Overall, the hybrid coset encoder contributes to overcoming the challenges of dynamic energy and reliability concerns of emerging non-volatile memories, advancing their eventual deployment in commercial systems. ACKNOWLEDGMENT The authors would like to thank anonymous Referees for their valuable comments and suggestions to improve this paper. REFERENCES [1] M. K. Qureshi, S. Gurumurthi, and B. Rajendran, Phase change memory: From devices to systems, Synthesis Lectures on Computer Architecture, vol. 6, no. 4, pp , [2] E. Chen, D. Apalkov, Z. Diao, A. Driskill-Smith, D. Druist, D. Lottis, V. Nikitin, X. Tang, S. Watts, S. Wang et al., Advances and future prospects of spin-transfer torque random access memory, IEEE Transactions on Magnetics, vol. 46, no. 6, pp , [3] M. Rasquinha, D. Choudhary, S. Chatterjee, S. Mukhopadhyay, and S. Yalamanchili, An energy efficient cache design using spin torque transfer (stt) ram, in Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design, 2010, pp [4] X. Guo, E. Ipek, and T. Soyata, Resistive computation: avoiding the power wall with low-leakage, stt-mram based computing, ACM SIGARCH Computer Architecture News, vol. 38, no. 3, pp , [5] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, Architecting phase change memory as a scalable dram alternative, ACM SIGARCH Computer Architecture News, vol. 37, no. 3, pp. 2 13, [6] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, Scalable high performance main memory system using phase-change memory technology, ACM SIGARCH Computer Architecture News, vol. 37, no. 3, pp , [7] Y. Zhang, L. Zhang, W. Wen, G. Sun, and Y. Chen, Multi-level cell stt-ram: Is it realistic or just a dream? in IEEE/ACM International Conference on Computer-Aided Design, 2012, pp [8] S. Raoux, G. W. Burr, M. J. Breitwisch, C. T. Rettner, Y.-C. Chen, R. M. Shelby, M. Salinga, D. Krebs, S.-H. Chen, H.-L. Lung et al., Phase-change random access memory: A scalable technology, IBM Journal of Research and Development, vol. 52, no. 4.5, pp , [9] W. Wen, Y. Zhang, Y. Chen, Y. Wang, and Y. Xie, Ps3-ram: A fast portable and scalable statistical stt-ram reliability analysis method, in Proceedings of the 49th Annual IEEE Design Automation Conference, 2012, pp [10] Y. Zhang, W. Wen, and Y. Chen, The prospect of stt-ram scaling from readability perspective, IEEE Transactions on Magnetics, vol. 48, no. 11, pp , [11] R. Maddah, S. M. Seyedzadeh, and R. Melhem, Cafo: Cost aware flip optimization for asymmetric memories, in 21st IEEE International Symposium on High Performance Computer Architecture, 2015, pp [12] B. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger, Phase-change technology and the future of main memory, IEEE Micro, [13] B.-D. Yang, J.-E. Lee, J.-S. Kim, J. Cho, S.-Y. Lee, and B.-G. Yu, A low power phase-change random access memory using a datacomparison write scheme, in IEEE International Symposium on Circuits and Systems, 2007, pp [14] A. N. Jacobvitz, R. Calderbank, and D. J. Sorin, Writing cosets of a convolutional code to increase the lifetime of flash memory, in 50th Annual IEEE Allerton Conference on Communication, Control, and Computing, 2012, pp [15] J. Li and K. Mohanram, Write-once-memory-code phase change memory, in Design, Automation and Test in Europe Conference and Exhibition, 2014, pp [16] S. Cho and H. Lee, Flip-n-write: a simple deterministic technique to improve pram write performance, energy and endurance, in 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009, pp [17] A. N. Jacobvitz, R. Calderbank, and D. J. Sorin, Coset coding to extend the lifetime of memory, in 19th IEEE International Symposium on High Performance Computer Architecture, 2013, pp [18] S. M. Seyedzadeh, R. Maddah, A. Jones, and R. Melhem, Pres: Pseudo-random encoding scheme to increase the bit flip reduction in the memory, in 52nd ACM/EDAC/IEEE Design Automation Conference, 2015, pp [19] G. D. Forney Jr, Coset codes. i. introduction and geometrical classification, IEEE Transactions on Information Theory, vol. 34, no. 5, pp , [20] I. Reed, A class of multiple-error-correcting codes and the decoding scheme, Transactions of the IRE Professional Group on Information Theory, vol. 4, no. 4, pp , [21] Y. Du, M. Zhou, B. R. Childers, D. Mossé, and R. Melhem, Bit mapping for balanced pcm cell programming, in ACM SIGARCH Computer Architecture News, vol. 41, no. 3. ACM, 2013, pp [22] G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, Base-delta-immediate compression: practical data compression for on-chip caches, in Proceedings of the 21st international conference on Parallel architectures and compilation techniques. ACM, 2012, pp [23] Rukhin et al., Nist special publication , A statistical test

IEEE TRANSACTIONS ON COMPUTERS suite for random and pseudorandom number generators for cryptographic applications, 2001. [24] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, Cacti 6.

S. degree in Electrical Engineering from Iran University of Science and Technology in 2011. He is currently a Ph.D. student in computer engineering at University of Pittsburgh.

and M.S. degree in Computer Science from the Lebanese American University in 2007 and 2009 respectively. He joined the Computer Science Department at the University of Pittsburgh as a Ph.D. student in 2010 and earned his degree in 2015.

Donald Kline, Jr received his Bachelor of Science in Computer Engineering from the University of Pittsburgh in the spring of 2015, and is currently pursuing his Ph.D. in Electrical and Computer Engineering at the University of Pittsburgh under the guidance of Dr.

Jones received the BS degree in 1998 in physics from the College of William and Mary in Williamsburg, Virginia, and the MS and PhD degrees in 2000 and 2002, respectively, in electrical and computer

13 IEEE TRANSACTIONS ON COMPUTERS suite for random and pseudorandom number generators for cryptographic applications, [24] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, Cacti 6.0: A tool to understand large caches, University of Utah and Hewlett Packard Laboratories, Tech. Rep, Seyed Mohammad Seyedzadeh received a B.S. degree in Electrical Engineering from Shiraz University of Technology in 2007 and an M.S. degree in Electrical Engineering from Iran University of Science and Technology in He is currently a Ph.D. student in computer engineering at University of Pittsburgh. His main research interests include computer architecture, fault tolerance, and coding theory. Rakan Maddah received his B.S. and M.S. degree in Computer Science from the Lebanese American University in 2007 and 2009 respectively. He joined the Computer Science Department at the University of Pittsburgh as a Ph.D. student in 2010 and earned his degree in He is now a senior engineer with Intel s NonVolatile Solutions Group working on memory and storage products. His research interests are in computer architecture, systems and fault tolerance. Donald Kline, Jr received his Bachelor of Science in Computer Engineering from the University of Pittsburgh in the spring of 2015, and is currently pursuing his Ph.D. in Electrical and Computer Engineering at the University of Pittsburgh under the guidance of Dr. Alex Jones and Dr. Rami Melhem. His research interests currently include computer architecture, memories, compilers, and machine learning. Alex K. Jones received the BS degree in 1998 in physics from the College of William and Mary in Williamsburg, Virginia, and the MS and PhD degrees in 2000 and 2002, respectively, in electrical and computer engineering at Northwestern University. He is currently the Director of Computer Engineering and an Associate Professor of Electrical and Computer Engineering and Computer Science at the University of Pittsburgh, Pennsylvania. He is a Walter P. Murphy Fellow of Northwestern University, a senior member of the IEEE and ACM. Dr. Jones research interests include compilation techniques for configurable systems and architectures, behavioral and lowpower synthesis, parallel architectures and networks, radio frequency identification (RFID) and sensor networks, sustainable computing, and embedded computing for medical instruments. He is the author of more than 100 publications in these areas. His research is funded by the U.S. National Science Foundation, DARPA, CCC, and industry. Dr. Jones contributions have received several awards including multiple the 2010 ACM/SIGDA Distinguished Service Award and recognition of a top 25 paper from the first 20 years of FCCM. Recently, Dr. Jones led an effort in visioning for the electronic design automation community funded by the Computing Community Consortium (CCC). Dr. Jones is also actively involved in efforts to improve the scientific method for experiments in computers science and engineering, to develop methods reproducible research, and a centralized hub for computer architecture simulators, emulators, benchmarks and experiments. 13 Rami Melhem received a B.E. in Electrical Engineering from Cairo University in 1976, an M.A. degree in Mathematics and an M.S. degree in Computer Science from the University of Pittsburgh in 1981, and a Ph.D. degree in Computer Science from the University of Pittsburgh in He was an Assistant Professor at Purdue University prior to joining the faculty of The University of Pittsburgh in 1986, where he is currently a Professor in the Computer Science Department which he chaired from 2000 to His research interests include Power Management, Parallel Computer Architectures, Fault-Tolerant Systems, Optical Networks and High Performance Computing. Dr. Melhem served and is serving on program committees of numerous conferences and workshops and on the editorial boards of the IEEE Transactions on Computers, the IEEE Transactions on Parallel and Distributed systems, the Computer Architecture Letters, the Journal of Parallel and Distributed Computing and the Journal of Sustainable Computing, Informatics and Systems. Dr. Melhem is a fellow of IEEE and a member of the ACM.

Leveraging ECC to Mitigate Read Disturbance, False Reads and Write Faults in STT-RAM

Leveraging ECC to Mitigate Read Disturbance, False Reads and Write Faults in STT-RAM Seyed Mohammad Seyedzadeh, Rakan Maddah, Alex Jones, Rami Melhem University of Pittsburgh Intel Corporation seyedzadeh@cs.pitt.edu,