ABSTRACT. Efficient detectors for LTE uplink systems: From small to large systems. Michael Wu

Size: px

Start display at page:

Download "ABSTRACT. Efficient detectors for LTE uplink systems: From small to large systems. Michael Wu"

Robert Booker
5 years ago
Views:

2 ABSTRACT Efficient detectors for LTE uplink systems: From small to large systems by Michael Wu 3GPP Long Term Evolution (LTE) is currently the most popular cellular wireless communication standard. Future releases of the 3GPP specifications consider largescale (or massive) multiple-input multiple-output (MIMO), an emerging technology where the base station (BS) is equipped with hundreds of antennas. Although largescale MIMO improves the spectral efficiency, link reliability, and coverage over that of conventional (small-scale) MIMO systems, the dimensionality of large-scale systems increases the computational complexity of uplink data detection significantly. In this thesis, we present efficient data detection algorithms for the LTE uplink and analyze the performance-complexity tradeoff for small to large-scale multiple-input multiple-output (MIMO) systems. We first propose an iterative detection and decoding (IDD) scheme which combines frequency domain minimum mean-square error (FD-MMSE) equalization with parallel interference cancellation (PIC) to achieve near-optimal performance and show this scheme achieves near-optimal detection performance if the number of BS antennas exceeds the number of users by roughly 2x. For (symmetric) small-scale MIMO systems, IDD significantly reduces the frame error rate (FER) while the gains with large-scale MIMO are comparably smaller, which suggests linear FD-MMSE detection

3 is sufficient for large-scale MIMO systems. Linear FD-MMSE detection, however, still requires a computationally complex matrix inversion. For systems with very large ratios between the number of BS and user antennas, matrix inversion is performed on a strongly diagonally dominant matrix. We investigate a variety of exact and approximate equalization schemes that solve the system of linear equations either explicitly (requiring the computation of a matrix inverse) or implicitly (by directly computing the solution vector), and we analyze the associated performance/complexity trade-offs. We show that for small base-station (BS)-to-user-antenna ratios, exact and implicit data detection using Cholesky decomposition achieves near-optimal performance at low complexity; for large BS-to-user-antenna ratios, implicit data detection using approximate equalization methods results in the best trade-off. Finally, by combining the advantages of exact, approximate, implicit, and explicit matrix inversion, we propose a new frequency-adaptive equalizer (FADE), which outperforms existing linear data-detection methods in terms of performance and complexity and scales from small-scale MIMO systems to large-scale MIMO systems.

4 Contents Abstract List of Illustrations List of Tables ii viii xi 1 Introduction 1 2 MIMO Detection for Small Scale Systems Nonlinear Tree-Search Detection MIMO OFDM System Model Modified Real-Valued Decomposition Soft-Output MIMO Detection Way MIMO Detection N-Way Parallel MIMO Detection Performance Iterative Detection and Decoding MIMO SC-FDMA system model Soft-input Soft-output FD-MMSE-PIC Low-Complexity Adaptive MMSE Equalization Simulation Results MIMO Detection for Large Scale Systems Linear MMSE Detection Explicit vs. Implicit Equalization Explicit Approximate MMSE Detection

5 v Exact Inversion via the Cholesky Decomposition Exact Inversion using Series Expansions Approximate Inversion using Truncated Series Expansions LLR Computation for Explicit Inversion Methods Implicit Approximate MMSE Detection Exact Inversion using Implicit Cholesky Decomposition Implicit Accelerated Neumann Recursion Existing Approximate Implicit Equalization Methods LLR Approximation for Implicit Inversion Methods Initialization Matrices for Approximation MMSE Detection Relevance of the Initialization Matrix Existing Initialization Matrices D 1, A Simple Initialization Matrix Two New Initialization Matrices Comparison of Empirical Convergence Behavior Frequency-Adaptive Equalizer (FADE) Exploiting Correlation in Multipath Wireless Channels Frequency Adaptive Equalizer (FADE) Computational Complexity Initialization Methods Explicit Series Expansions and Exact Inversion Implicit Series Expansions and Exact Inversion Complexity of FADE Performance and Complexity Trade-off MIMO Detector Implementations N-Way MIMO Detector for Small-Scale Systems

6 vi QR Decomposition Kernel Candidate Search Kernel Hypothesis and 1-Hypothesis Generation Kernel N-Way Parallel MIMO Detection Kernel Implementation Performance Linear MIMO detectors for Large-Scale Systems Architecture Overview Cholesky-based Inversion Unit Approximate Inversion Matched filter and Equalization SINR computation unit IFFT and LLR Computation Units Implementation Results and Trade-offs Turbo Decoder Overview of Turbo Decoding Algorithm outline Branch-metric computation Forward and backward state metric computation LLR computation Turbo Decoder Implementations SIMD data types Memory allocation Multi-mode interleaver lookup table Max operation Branch-metric computation Forward and backward traversal LLR computation

7 vii Multi-codeword decoding Implementation Results FER performance Peak decoding throughput Conclusion 148 Bibliography 150

8 Illustrations 1.1 The evolution of the 3GPP LTE standard toward 5G An example of the MIMO detection search process for a QAM MIMO system N-way MIMO detector for a 2 2 MIMO System BER Performance of soft-output 4 4 N-way MIMO detector in Rayleigh fading channels Principle of iterative detection and decoding (IDD) in SC-FDMA-based MIMO wireless systems FER performance comparison for a SC-FDMA-based MIMO wireless system (U = 4) FER performance comparison for a SC-FDMA-based MIMO wireless system (U = 8) Average relative error (RE) comparison for different antenna configuration, algorithms, and initialization methods Illustration of the frame structure of a wideband system. FADE only computes explicit matrix inverses at the base points (in black); equalization at adjacent (in time and frequency) subcarriers is performed using one iteration of the implicit accelerated Neumann series recursion

9 ix 3.3 The number of real-valued multiplications required for Neumann series expansions depends on the number of users U Error rate performance vs. complexity trade-off. The complexity is defined as the number of real-valued multiplications and the performance as the SNR operation point, which is the minimum SNR that is required to achieve a BLER of 10% Effect of optimizations on 4 4, 64-QAM MIMO detectors, N = 1, 8192 subcarriers Throughput of 4 4 N-way MIMO detector on GPU vs. workload size High-level VLSI architecture of the large-scale MIMO detection engine for 3GPP LTE-A High-level architecture of the systolic array for B = Block error-rate (BLER) performance comparison for (a) U = 4 (b) U = 8, and (c) U = 12 single-antenna users where M = 64 and MCS = 28; FP designates the performance of a fixed-point implementation Performance/complexity trade-off. Hardware complexity is defined as the number of DSP48E1 slices required to achieve the LTE-A uplink 75 Mb/s per-user peak throughput High-level structure of a rate- 1 /3 3GPP turbo decoder Structure of the 3GPP trellis. There are 8 states per trellis step and one step per transmitted information bit. The vector s k consists of all state metrics at trellis step k. The values u k and p k, are the k th information bit and the parity bit (both ±1) respectively

10 x 5.3 Vectorized implementations of turbo decoder operations: (a) Vectorized computation of α k+1 for the 3GPP turbo code. The block vmax implements the vectorized element-wise max operator. (b) Vectorized computation of β k 1 for the 3GPP turbo code. (c) Vectorized LLR computation for the 3GPP turbo code. The block hmax reduces 8 elements in the input vector to one element using the max operator FER performance of (a) log-map decoding and (b) max-log-map decoding for K = 6144 and 6 decoding iterations on CPU and GPU 147

11 Tables 3.1 Complexity of different initialization methods Complexity of exact and approximate explicit matrix inversion methods for K max Complexity of implicit matrix inversion methods Break-Even Point for Implicit Inversion. The break-even point is the smallest U such that the method exhibits lower complexity than the implicit Cholesky decomposition MIMO Detection kernel time for 8192 MIMO symbols QR decomposition GPU kernel time for 8192 MIMO symbols Total runtime for 8192 MIMO symbols including data transport time Throughput comparison of the MIMO detection kernel with other GPU MIMO detectors Implementation results on a Xilinx Virtex-7 XC7VX980T FPGA GPU peak throughput for the K = 6144, rate- 1 /3 LTE turbo code

1 Chapter 1 Introduction Multiple-input multiple-output (MIMO) is a key technology in most modern wireless communication standards, such as 3GPP LTE and LTE-Advanced [1] or IEEE 802.11n [2].

12 1 Chapter 1 Introduction Multiple-input multiple-output (MIMO) is a key technology in most modern wireless communication standards, such as 3GPP LTE and LTE-Advanced [1] or IEEE n [2]. For example, figure 1.1 shows the progression of the 3GPP LTE standard. To meet the constantly increasing demands for higher data rates, the 3GPP LTE standard has evolved from small-scale MIMO systems with a few antennas (e.g. 2 to 8) to hundreds of antennas (e.g., tens to hundred) for future 5G systems. Figure 1.1 : The evolution of the 3GPP LTE standard toward 5G To achieve good performance with reasonable complexity, current wireless systems

13 2 have a small number of antennas (e.g 2 to 8). IEEE n and LTE downlink combines OFDM with MIMO to increase bandwidth utilization efficiency. For these small-scale systems, soft-output maximum a-posteriori (MAP) detection is the optimal detection schemewhich significantly outperforms linear detection schemes [3]. The optimal detection scheme, however, results in prohibitive complexity even for these small-scale systems. To reduce computational complexity, a number of suboptimal tree-search based algorithms have been proposed [3 11], but can not be efficiently parallelized on programmable multicore processors such as GPUs. In addition, current cellular systems such as LTE and LTE-Advanced, relies on single carrier frequency division multiple access(sc-fdma) for uplink to enable the user of cheaper radio-frequency circuitry. Unfortunately, SC-FDMA increases the dimensionality (and hence the complexity) of the underlying detection problem even for these small-scale MIMO systems, rendering tree-search based algorithm infeasible. As a result, to achieve near-optimal performance in SC-FDMA systems, the bestknown receivers rely on iterative detection and decoding (IDD) [12 14] based on linear detection schemes to achieve good performance. However, due to the constantly increasing demands for higher data rates, these systems are already approaching their throughput limits. Hence, new wireless transmission technologies are required to provide higher throughput in cellular multi-user systems, without further increasing the communication bandwidth. Large-scale MIMO is believed to be the key technology to meet the ever-growing demands for higher

14 3 spectral efficiency in the near future [15 18]. The idea of massive MIMO is to equip the base station (BS) with a large number of antennas (e.g., tens to hundred), while serving a not-so-large number of users concurrently and in the same frequency band. Consequently, future releases of the 3GPP LTE specifications consider large-scale MIMO as a way to improve performance of future 5G cellular systems [1]. Although large-scale MIMO improves spectral efficiency, link reliability, and coverage over conventional (small-scale) MIMO systems, the dimensionality of large-scale systems increases the computational complexity of uplink data detection significantly. As a result, the computational complexity of conventional detection algorithms for small-scale MIMO do not scale well to large-scale MIMO systems. In addition, complexity of linear detection methods can become prohibitive as dimensionality of the large-scale systems increases. Consequently, MF detection is typically considered for low-complexity detection for large-scale MIMO systems [15]. Unfortunately, MF detection requires an extremely large BS antenna to user antenna ratios. In this thesis, we aim to bridge this gap. In particular, we present efficient data detection algorithms and analyze the performance-complexity tradeoff between small and large-scale multiple-input multiple-output (MIMO) systems. We show that the choice of algorithm strongly depends on the ratio between the number of base-station antenna and the number of users. We show that iterative detection and decoding significantly reduces the frame error rate (FER) of symmetric small-scale MIMO systems and that linear MMSE detection is sufficient for large-scale MIMO systems.

15 4 We investigate a variety of exact and approximate equalization schemes that solve the system of linear equations either explicitly (requiring the computation of a matrix inverse) or implicitly (by directly computing the solution vector). We analyze the associated performance/complexity trade-offs, and we show that for small base-station (BS)-to-user-antenna ratios, exact and implicit data detection using the Cholesky decomposition achieves near-optimal performance at low complexity; for large BS-touser-antenna ratios, implicit data detection using approximate equalization methods results in the best trade-off. By combining the advantages of exact, approximate, implicit, and explicit matrix inversion, we develop a new frequency-adaptive equalizer (FADE), which outperforms existing data-detection methods in terms of performance and complexity for wideband massive MU-MIMO systems. Contributions We propose efficient algorithms and evaluate their performance-complexity tradeoff for small to large-scale MIMO systems. 1. We propose a flexible MIMO detector for small-scale OFDM systems that achieves a wide range of trade-offs between throughput and error-rate performance, and is able to approach the error-rate performance of the optimal soft-output MAP detector [19]. 2. We build on the IDD algorithm proposed in [20] for OFDM systems and propose an IDD algorithm for SC-FDMA systems [21].

16 5 (a) We show how the SINR can be obtained in an efficient manner. (b) We reduce the complexity of the IDD algorithm for large-scale MIMO using rank-one updates. 3. We propose novel low complexity new linear detection methods [22 24]. (a) We propose accelerated explicit matrix inversion methods building on the Neumann series expansion. (b) We propose corresponding implicit equalization methods, which avoid the computation of a matrix inverse. (c) We propose a method to approximate the post-equalization signal-to-noiseand-interference-ratio (SINR) values that enable soft-output data detection with implicit equalizers. (d) We propose low-complexity initialization schemes that improve the convergence of our iterative algorithms. (e) We propose a hybrid explicit/implicit frequency-adaptive equalizer (short FADE) that exploits frequency correlation in wideband MIMO wireless systems. As case studies, we implemented MIMO detector for both small-scale and large-scale MIMO systems. 1. We provide a GPU design for N-way MIMO detector that achieves excellent error-rate performance and high throughput on GPUs [25].

17 6 (a) Our implementation achieves a wide range of trade-offs between throughput and error-rate performance, and is able to approach the error-rate performance of the optimal soft-output MAP detector within 0.25 db. (b) Our implementation on Nvidia GPUs achieves a substantially higher throughput than existing soft-output MIMO detectors implemented on GPUs. 2. We provide FPGA designs for the approximate and exact MMSE detection and for various antenna configurations [26]. (a) The resulting FPGA designs are to the best of our knowledge the first data detection engines for massive MIMO systems reported in the open literature. (b) Both designs achieve a peak uplink throughput exceeding the 300 Mb/s specified in 3GPP LTE-Advanced operating at 20MHz bandwidth. In addition, we implemented turbo decoder on two different high performance programmable processors, namely on a quad-core Intel i7-3770k (Ivy Bridge) and a Nvidia GeForce GTX 680 (Kepler GK104) [27]. These implementions are used extensively in our performance simulations and can be used to enable fast software defined radio testbeds.

18 7 Chapter 2 MIMO Detection for Small Scale Systems This chapter investigates and proposes MIMO detection algorithms for small-scale MIMO systems. We first investigate detection algorithms for small-scale MIMO OFDM systems. We then propose a flexible MIMO tree-search based detector that achieves excellent error-rate performance and high throughput on GPUs. We show that the proposed design achieves a wide range of trade-offs between throughput and error-rate performance, and is able to approach the error-rate performance of the optimal soft-output MAP detector within 0.25 db. However, sphere detection cannot be readily applied to single carrier frequency division multiple access (SC- FDMA)-based massive MIMO systems, such as 3GPP LTE-based systems. We then develop a novel, low-complexity iterative detection and decoding algorithm for these SC-FDMA-based MIMO systems. The proposed algorithm combines a novel frequencydomain minimum mean-square error (FD-MMSE) equalization method with parallel interference cancellation (PIC). The propose algorithm requires low computational complexity and achieves near-optimal error-rate performance in realistic 3GPP-LTEbased massive MIMO systems having roughly 2 more base-station antennas than users.

19 8 2.1 Nonlinear Tree-Search Detection Multiple-input multiple-output (MIMO) wireless is a key technology used in many modern communication standards, such as 3GPP-LTE, WiMAX, and IEEE n. The use of multiple antennas at both ends of the wireless link, particularly combined with OFDM, enables significant improvements (compared to single-antenna systems) in terms of spectral efficiency. Among different MIMO detection schemes, soft-output maximum a- posteriori (MAP) detection is the optimal detection scheme in coded systems. This nonlinear detector requires an exhaustive search over all candidate vectors, which results in prohibitive computational complexity, even for small-scale MIMO systems transmitting a few spatial streams. Consequently, exact MAP detection is impractical as wireless systems typically have stringent hardware constraints (silicon area and power consumption), as well as challenging throughput and latency requirements. As a result, most practical solutions to MIMO detection rely on suboptimal low complexity algorithms [3 11, 28]. Suboptimal detection algorithms can generally be categorized into linear detection algorithms and non-linear tree-search-based methods. Although the complexity of linear detection methods for small-scale MIMO systems is very low, the associated error-rate performance is rather poor [7, 8, 28]. Non-linear tree-search-based detection methods, however, are capable of achieving excellent error-rate performance at low computational complexity for small-scale MIMO systems. Such detectors can be

20 9 divided into two categories: (i) depth-first tree-search algorithms, such as sphere decoding [3, 10], and (ii) breadth-first search algorithms, such as the K-best algorithm [9]. For both approaches, the search space, i.e., the set of all possible transmit vectors, can be represented as a tree. To reduce the computational complexity of data detection, these detectors use heuristics to eliminate useless branch extensions during the tree-search process. Depth-first sphere detectors traverse the tree from top to bottom recursively and prune branches with large partial distances during backtracking. As the traversal path is not deterministic, parallel implementations of depth-first search require load balancing to achieve high efficiency [29]; this requires global synchronization which is inefficient on programmable processors such as GPUs. In addition to the random runtime of depth-first tree-search algorithms, depth-first search algorithms usually evaluate a large and small number of tree branches at low and high SNR, respectively. Breadth-first search algorithms, such as the K-best algorithm, reduce the complexity by pruning branches level by level. Although the K-best algorithm has a deterministic run-time, the algorithm requires a global sort at each tree level to find the best K nodes, which is the main bottleneck for corresponding software implementations [30]. More recently, the authors in [11] proposed the selective spanning with fast enumeration (SSFE) MIMO detection algorithm, which can be viewed as a sort-free approximation to the K-best algorithm; related MIMO detection algorithms were also proposed in [6, 31]. The SSFE method, however, results in a substantial error-rate

21 10 performance loss. In order to recover part of this performance loss, one can run a small number of SSFE instances in parallel, where each instance operates with a different permuted detection order [19, 32]. Since these instances perform the same set of operations but work on different input data, this improved algorithm maps very well onto data parallel processors such as GPUs [33, 34] MIMO OFDM System Model The considered MIMO system transmits N t independent data streams, and the destination receives signals on N t antennas. At the transmit-side, given a binaryvalued vector x = [x 0,..., x L 1 ] T with L = N t log 2 M, the modulation function maps the vector x to s = [s 0,..., s Nt 1] T, where s i is a complex number in a finite constellation alphabet Ω with cardinality M. For example, the constellation alphabet for QPSK is { 1 j, 1 + j, 1 j, 1 + j} with M = 4. The source then transmits the modulated signal vector s over N t antennas. The received symbols can be modeled as y = Hs + n, (2.1) where y = [y 0,..., y Nt 1] T is the received symbol vector, H = [h 0,..., h Nt 1] is the N t N t channel matrix. We consider a Rayleigh fading channel model, where each entry of H, denoted by h ij, is modeled by an i.i.d. circularly symmetric complex Gaussian (ZMCSCG) random variable with variance σ 2 h per complex dimension. Each element of the additive noise vector, n = [n 0,..., n Nt 1] T, is assumed to be i.i.d.

22 11 ZMCSCG with variance N 0 per complex dimension Modified Real-Valued Decomposition To perform MIMO detection in the real domain instead of the complex domain, we first perform a real-valued decomposition of the input-output relation (2.1). Specifically, we rewrite (2.1) as R(y) I(y) = R(H) I(H) I(H) R(H) R(s) + I(s) R(n) I(n), (2.2) where R(x) and I(x) denote the real and imaginary part of the complex variable x, respectively. In order to improve the error-rate performance of the proposed detector, we deploy the modified real-valued decomposition (MRVD) put forward in [35]. In particular, I permute the vector and matrix elements such that the real and imaginary part of the same complex entry are adjacent to each other. With this, the resulting

23 12 input-output relation is given by R(y 0 ) I(y 0 ) R(y 1 ) I(y 1 ). R(y NR 1) I(y NR 1) = H R(s 0 ) I(s 0 ) R(s 1 ) I(s 1 ). R(s NR 1) I(s NR 1) + R(n 0 ) I(n 0 ) R(n 1 ) I(n 1 ). R(n NR 1) I(n NR 1) which we abbreviate by ỹ = Hŝ+ñ. Compared to the original (complex-valued) system model in (2.1), MRVD doubles the number of elements in each vector and quadruples the dimensionality of H. Furthermore, each element of s i is drawn from a smaller (real-valued) alphabet, Ω, which has cardinality Q = M. For example, with QPSK, the MRVD-equivalent constellation alphabet is { 1, +1} with Q = Soft-Output MIMO Detection Given the received vector after performing the MRVD, ỹ, and the MRVD-equivalent channel matrix H, the soft-output MIMO detector at the receiver computes the a-posteriori probability log-likelihood (log-app) ratio, L k D, for each bit. Assuming equally likely transmitted bits, the log-likelihood ratio (LLR) of the k th transmitted

24 13 bit can be approximated via the max-log approximation [36]: L k D 1 ( ỹ 2 ) ỹ 2 min H s min H s. (2.3) 2N 0 x X k,0 x X k,1 Here, X k,0 is the set of all binary-valued vectors with the k th bit equals to 0, and X k,1 is the set of all binary vectors with the k th bit equal to 1. For the sake of brevity, the vector s corresponds to the modulated binary vector x. The number of binary vectors in X k,0 and X k,1 scales exponentially with N t. As a result, the computational complexity of evaluating (2.3) scales exponentially with N t. To reduce the complexity of (2.3), L k D can be approximated as with a reduced set of transmit vectors, or a candidate list, L. This candidate list, L, is generated by excluding transmit vectors with large Euclidean distances. We then split L into two sublists for L k,0 and L k,1. The list L k,0 contains the candidates with the k th bit equal to 0 while the list L k,1 contains the candidates with the k th bit equal to 1. We then approximate L k D via (2.4), where the list L k,0 is used for computing the 0-hypothesis part of the L k D, while the list L k,+1 is used for computing the 1-hypothesis part of the L k D. L k D 1 ( ỹ 2 ) ỹ 2 min H s min H s. (2.4) 2N 0 x L k,0 x L k,1 }{{}}{{} 0-hypothesis 1-hypothesis

25 14 Figure 2.1 : An example of the MIMO detection search process for a QAM MIMO system QR Decomposition To reduce the complexity of the candidate search, we first perform QR decomposition on H. With this, we can rewrite (2.4) as follows: L k D 1 ( ) min ŷ R s 2 min ŷ R s 2. (2.5) 2N 0 x L k,0 x L k,1 }{{}}{{} 0-hypothesis 1-hypothesis Here, R is the upper-triangular matrix obtained from the QR decomposition applied to H, and ŷ is the effective received vector obtained from Q T ỹ Way MIMO Detection Given ŷ and R, the MIMO detector computes LLR values in two steps. The first step finds candidate vectors with small distances. The second step computes an LLR value for each transmitted bit using the candidate list by applying (2.5).

26 15 The search algorithm is essentially an SSFE MIMO detector [6, 11, 31] operating in the real domain. The search algorithm attempts to find candidate vectors with small distances in a greedy fashion. As R is an upper triangular matrix, the search algorithm evaluates transmitted symbols in reverse order, from antenna N t 1 to antenna 0. The procedure is equivalent to a tree traversal. For example, a complete tree search for a 2 2 MIMO system using 16-QAM is shown in Figure 2.1. This search keeps all branches of antenna 1 by fully expanding the first two levels of the tree. For the subsequent tree levels, the branches of the tree are pruned by keeping the best outgoing paths. All surviving paths at the end of the procedure are in our candidate list which in this example would be sixteen. Let the k th path be p k = [ p k 2N t 1,..., pk t ], the set of nodes along the path from the root node to p k t, a node on level t. The best outgoing path can be found using Schnorr-Euchner enumeration [37]. The partial distance, the distance from p k t to the i th node on level t 1, w t 1 k,i, can be computed as w t 1 k,i = (ŷ t 1 t j=2n t 1 r k,j p k j ) r t 1,t 1 s i 2, (2.6) = b k t 1 r t 1,t 1 s i 2, (2.7) where r k,i is the k th row of the i th column of R and ŷ t is the t th row of ŷ. To expand this path, the best node in level t 1 that minimizes w t 1 k,i is simply the closest constellation point in Ω to γ j = (r t 1,t 1 ) 1 b k t 1, the zero-forcing solution. If node i is the best node found at level t 1, the k th distance can be updated by adding the

27 16 Figure 2.2 : N-way MIMO detector for a 2 2 MIMO System. partial distance, w t 1 k,i, to the distance from the previous level t as follows: d k = d k + w t 1 k,i. (2.8) N-Way Parallel MIMO Detection As described in the previous section, the soft-output SSFE detector consists of a single QR decomposition, a single candidate search and a single LLR generator. However, the error-rate performance of the soft-output SSFE detection is significantly worse than that of the soft-output max-log-map detection. We now describe a simple algorithm which improves upon the error-rate performance of the SSFE detector. We perform several tree searches with different antenna detection orders in parallel to improve error-rate performance of the detector. This improves performance as multiple tree searches generate a larger candidate list which

28 17 results in more reliable LLRs. In our design, we run N parallel candidate searches, where 1 N N t, to generate N parallel candidate lists, each with M candidates. We then generate LLR values from the combined candidate list, which consists of MN candidates. For example, the proposed algorithm for a 2 2 MIMO 16-QAM system is shown in Fig The example consists of two parallel detectors. The inputs consist of the received vector y and the channel matrix H. A different antenna detection order can be obtained by a simple circular rotation of columns of H. Each detector performs QR decomposition followed by candidate search to generate a candidate list. The results from the two detectors, two candidate lists, are then used to generate LLR values Performance We compared the BER performance of the N-way parallel MIMO detector against several other soft-output detectors, including soft-output trellis-based MIMO detector [38], fully parallel fixed complexity-sphere detector (FPFSD) [34] and soft-output max-log-map detector. The soft-output max-log-map detector computes LLR values using the set of all possible transmit vectors (i.e. a direct implementation of (2.4)) and serves as the performance bound. In our BER simulation, we first generate a random binary information vector which is then encoded by a rate 1 3 3GPP LTE turbo encoder where K = We then modulate the coded binary vector onto MIMO symbols. The symbols are transmitted

29 18 (a) 4 4, 16-QAM (b) 4 4, 64-QAM Figure 2.3 : BER Performance of soft-output 4 4 N-way MIMO detector in Rayleigh fading channels. through a Rayleigh fading channel with additive white Gaussian noise. The detector performs QR decomposition on the channel matrix and then performs soft-output

30 19 detection once to generate LLRs. The soft-output of the detector is then fed to a 3GPP Turbo decoder [39] which performs up to 8 turbo decoding iterations. The detectors use an LLR clipping value of 8 for all the detector configurations with the exception of N = 4 where LLR clipping is not required. Figure 2.3 compares the BER performance of detectors for 16-QAM and 64- QAM. The trends are similar in both plots. The N-way parallel MIMO detector is equivalent to SSFE when N = 1. For N = 1, the BER performance of the N-way MIMO detector is worse than that of the other soft-output detectors. As we increase N, the performance of the detector improves as a larger candidate list increases the probability of finding the smallest 0-hypothesis and the smallest 1-hypothesis for each transmitted bit. For N = 4, the detector s performance is within 0.25 db of the soft-output max-log-map detector. We note that the computational complexity difference between these two cases is significant the number of leaf nodes visited is NM for the proposed algorithm compared to M N for the soft-output max-log-map detector. For N 3, the N-way MIMO detector outperforms the soft-output trellis-based MIMO detector. The FPFSD MIMO detector is similar to the N = 4 case except that the FPFSD MIMO detector performs detection in the complex domain. In addition, the PFSD detector uses column-norm reordering preprocessing to improve performance. Nevertheless, we found the FPFSD detector performs similarly to the N = 4 N-way detector despite their differences.the column-norm reordering processing, however, is

31 20 an effective way of improving the N = 1 case. Nevertheless, the BER performance of the N = 1 case with column norm reordering preprocessing is still worse than that of the N = 2 case without column-norm reordering. 2.2 Iterative Detection and Decoding Current cellular systems, such as LTE and LTE-Advanced, employ single carrier frequency division multiple access (SC-FDMA) for uplink to reduce the linearity requirements of corresponding radio-frequency (RF) circuitry [40]. To achieve nearoptimal performance in SC-FDMA-based systems, the best-known receivers rely on iterative detection and decoding (IDD) [12 14], which exchanges reliability information on the coded bits between the (sub-optimal) data detector and the channel decoder (e.g., a turbo decoder) [20, 41, 42]. Unfortunately, corresponding optimal or near-optimal algorithms for the SC-FDMA uplink (users transmit to the BS station) exhibit high computational complexity, even for small-scale MIMO systems. We are aware of other detection algorithms for massive MIMO that do not consider SC-FDMA. However, their adaptation to SC-FDMA is not straightforward and hence, these algorithms are not in the scope of this thesis. The algorithm in [12] performs frequency-domain (FD) equalization followed by sphere decoding, which is known to be significantly more complex than linear methods [43]. The algorithms in [13, 14] avoid the use of sphere decoding, but require high complexity for systems having a large number of BS antennas due to the used FD equalization methods. Hence, for massive MIMO

32 21 systems where the number of BS antennas is in the order of tens or hundreds, the SC-FDMA detection algorithms in [12 14] result in excessive complexity. We propose a novel soft-input soft-output detection algorithm for SC-FDMAbased massive MIMO systems using IDD. The proposed detection algorithm detailed in Section builds upon the small-scale MIMO detector (designed for OFDM systems) in [20] and combines a novel, low-complexity FD minimum mean-square error (FD-MMSE) equalizer (see Section 2.2.3) with parallel interference cancellation (PIC). Our simulation results (shown in Section 2.2.4) demonstrate that we achieve near-optimal detection performance for realistic 3GPP LTE-based massive MIMO systems if the number of BS antennas exceeds the number of users by roughly MIMO SC-FDMA system model In this chapter, we introduce MIMO SC-FDMA system model, which is used to model LTE uplink. We consider the multi-user MIMO LTE uplink where B antennas at the base-station (BS) communicate with U B single-antenna users. The U users first encode their own bit stream b (i) using an LTE turbo encoder and then map the coded bit stream to constellation points in the finite alphabet O with cardinality M = O, average transmit power E s per symbol and Q = log 2 (M) bits per symbol. The L time-domain constellation points for the i th user are subsumed in the vector x (i) = [x (i) 1 x (i) L ]T, where x (i) j b (i) j is associated with a binary-valued vector = [b (i) j,1 b(i) j,q ]. Since the LTE uplink employs SC-FDMA [1], an L-point discrete

33 22 Fourier transform (DFT) matrix F L is used to modulate the time-domain symbols onto orthogonal frequency bands. The output of the DFT, the frequency-domain symbols or the SC-FDMA symbols, is s (i) = [ s (i) 1 s (i) ] T L = FL x (i). For each user, the frequency-domain symbols are mapped onto data-carrying subcarriers and transformed back to the time domain using an inverse DFT. After prepending the cyclic prefix, all U users transmit their time-domain signals simultaneously over the wireless channel. For data detection, the time-domain signals received at each BS antenna are first transformed back into the frequency domain using a DFT, followed by extraction of the data-carrying symbols. Assuming a sufficiently long cyclic-prefix (i.e., at least as long as the delay spread of the channel s impulse response), the received frequency-domain symbols can be modeled using the standard input-output relation y = Hs + n, with the following definitions: y = s = y (1). y (B) s (1). s (U) H (1,1) H (1,U)., H =.... H (B,1) H (B,U), and n = n (1). n (B)., Here, the vector y (i) = [ y (i) 1,..., y (i) ] T L contains the received symbols on the i th antenna in the frequency domain, where y (i) w is the symbol received on the w th subcarrier of

34 the i th antenna. The L L diagonal matrix H (i,j) = diag ( h (i,j) 1,..., h (i,j) ) L contains the channel s frequency response of length L between the i th receive antenna and j th transmit antenna on its main diagonal, and n (i) = [ n (i) 1,..., n (i) ] T L models thermal noise at the i th receive antenna in the frequency domain. The entries of the vector n (i) are assumed to be i.i.d. zero-mean Gaussian with variance N 0 per complex entry. Since H is block diagonal, we can also decompose the received signal into L independent parallel problems the received frequency symbols on the w th subcarrier in the frequency domain are modeled as y w = H w s w + n w with 23 y w = y (1) w. y (B) w, H w = h (1,1) w h (1,U) w..... h (B,1) w h w (B,U), s w = [s (1) w s (U) w ] T, and n w = [n (1) w n (B) w ] T. Here, y (i) w is the frequency symbol received on the w th subcarrier for antenna i, h (i,j) w is the frequency gain/attenuation on the w th subcarrier between the i th receive antenna and j th user. The scalar s (j) w w th subcarrier; the scalar n (i) w denotes the symbol transmitted by the j th user on the represents complex i.i.d. zero-mean Gaussian noise with variance N 0. Notation Lowercase boldface letters stand for column vectors; uppercase boldface letters designate matrices. For a matrix A, we denote its transpose and conjugate transpose

35 24 y, H Soft Symbol Modulator PIC MMSE Detector L in L out Turbo Decoder Soft-Input Soft-Output MIMO Detector Channel Decoder Figure 2.4 : Principle of iterative detection and decoding (IDD) in SC-FDMA-based MIMO wireless systems. A T and A H, respectively. The entry in the k th row and l th column of a matrix A is denoted by A k,l ; the k th entry of a vector a is designated by a k. The Frobenius norm and l 2 -norm of a matrix A and vector a are denoted by A F and a 2, respectively. The M M identity matrix is denoted by I M, and F M refers to the M M discrete Fourier transform (DFT) matrix, normalized as F H M F M = I M. In order to simplify notation, we make frequent use of the superscript ( ) (i,j) to indicate the i th base-station antenna and j th user; the subscript ( ) w designates the SC-FDMA subcarrier index Soft-input Soft-output FD-MMSE-PIC We now detail our soft-input soft-output detection algorithm for iterative SC-FDMA systems. Fig. 2.4 illustrates the main principle of iterative detection and decoding (IDD) in SC-FDMA. The soft-input soft-output MIMO detector generates log-likelihood ratio (LLR) values (indicated with L out in Fig. 2.4) for each transmitted bit using the received symbols and a-priori LLRs from the channel decoder (indicated with L in in

36 25 Fig. 2.4). Then, the channel decoder generates new a-priori LLRs using the LLRs from the MIMO detector. The detector and decoder exchange LLR values either until a maximum number iterations I is reached. To reduce the complexity of IDD, we build our soft-input soft-output MIMO detector on the algorithm in [20] and adapt it to SC-FDMA. Our algorithm can be summarized as follows: (i) a soft-symbol modulator constructs FD soft-symbol estimates of the transmitted symbols using a-priori LLRs; (ii) FD parallel interference cancellation (PIC) removes interference in the received signal using the soft-symbol estimates on a per-user basis; (iii)adaptive MMSE equalization estimates transmitted symbols; (iv) LLR computation then computes the LLR values. The following paragraphs detail these three steps. Soft symbol modulator The soft-symbol modulator generates soft-symbol estimates of the transmitted symbols given a-priori LLRs of the transmitted bits from the channel decoder. The procedure follows that in [20], except that an additional DFT is required to obtain the soft-symbol estimates in the frequency-domain. First, the a-priori LLRs are converted into probability values using Pr(b (i) j,k = 1) = ( ) 1 2 tanh 2 L(i) j,k, and Pr(b (i) j,k = 0) = 1 Pr(b(i) j,k = 1), where L(i) j,k is the a-priori LLR corresponding to

37 26 the transmit bit b (i) j,k. Second, the vector x(i) = [ x (i) 1 x (i) L ]T consists of the time domain soft symbols for the i th user, where the soft symbol x (i) j can be computed as x (i) j as Pr(x (i) j = a O Pr(x(i) j = a)a. The symbol probability Pr(x (i) j = a) can be computed = a) = k Pr(b(i) j,k = z k), where z k [0, 1] is the k th bit associated with the constellation symbol a. Finally, as the DFT is linear, the FD soft-symbol estimates can be computed as s (i) = [ s (i) 1 s (i) L ]T = F L x (i). Variance of frequency domain soft-symbol The variance of the FD soft-symbol s (i) w is Var[ s (i) w ] = E[(f w x (i) f w x (i) )(f w x (i) f w x (i) ) H ] = f w E[x (i) (x (i) ) H ]f H w f w x (i) ( x (i) ) H f H w = f w ( E[x (i) (x (i) ) H ] x (i) ( x (i) ) H) f H w = f w (i) f H w, where f w corresponds to the w th row of the DFT matrix. The off-diagonal terms of E[x (i) (x (i) ) H ] and x (i) ( x (i) ) H are the same. Thus, their difference, the matrix (i) is diagonal where the j th entry on the main diagonal is E[(x (i) j )2 ] have (i)) 2. ( x j Thus, we Var[ s (i) w ] = 1 L L k=1 E[ (x (i) k )2] 1 L L k=1 (i)) 2 ( x k

38 27 with E [ (x (i) k )2] = a O Pr( x (i) k = a ) a 2. Parallel interference cancellation (PIC) PIC removes interference in the received signal on a per-user basis. The procedure follows that in [20]. Let s (j) w be the soft estimate of the symbol transmitted by the j th user on the w th subcarrier. The result, ŷ w i, the frequency symbols received on the w th subcarrier post-cancellation for the i th user, is ŷ w i = y w j i h j,w s (j) w = H w z w i + n, where h j,w is the j th column of H w and elements of z w i = [ z (1) w i,, z(u) w i] T are defined as follows: z (j) w i = s w (i), s (j) w if j = i s (j) w, if j i. Adaptive MMSE Equalization The equalized receive symbols on the w th subcarrier of the i th user (in the frequency domain) are given by ŝ (i) w = w H w iŷw i = E s (h i,w ) H A 1 w iŷw i, (2.9)

39 28 where A w i = H w Λ w i H H w + N 0 I B B. The matrix Λ w i = E [ z w i ( zw i ) H ] is diagonal with entries λ (j,j) w i = E s, if j = i, Var (j) [ s w ], if j i, (2.10) where [ Var s (j) w ] = L ( 1 L k=1 E[(x (j) k )2 ] ( x (j) k )2 ) and E[(x (j) k )2 ] = a O Pr(x(j) k = a) a 2, as shown in Section Let ŝ (i) = [ŝ (i) 1,, ŝ (i) L ] be the frequency domain estimates for the ith user, the vector ˆx (i) = F H L ŝ(i) = (i) [ˆx 1,..., ˆx (i) ] T L is the time-domain estimates for the i th user, where F H L is the IDFT matrix. LLR Computation Given the time-domain estimates, the detector can compute the corresponding LLR values. As a consequence of SC-FDMA, this deviates substantially from the LLR computation approach used in the MIMO-OFDM detector [20]. To compute the extrinsic LLRs from the time-domain symbol estimates, we model the t th time-domain symbol estimate of the i th user as a Gaussian random variable ˆx (i) t = µ i x (i) t + e (i) t, where µ i is the effective channel gain and e (i) t is the post-equalization noise-plusinterference (NPI). Let ν 2 i be the variance of e (i) t and k be the bit index of the LLR associated with the t th symbol transmitted by the i th user. The LLRs of the

40 29 transmitted bits can be approximated as L (i) t,k = 1 ( (i) min ˆx νi 2 a Ob 0 t µ i a 2 (i) min ˆx a Ob 1 t µ i a ) 2, where O 0 k and O1 k are the sets of transmit constellation symbols for which the kth bit equals to 0 and 1, respectively. Noise-plus-interference (NPI) computation We now show that µ i =L 1 L w=1 wh w i h i,w and ν 2 i =E s µ i E s µ i 2. We first write the equalized symbols of the i th user, ŝ (i) = [ŝ (i) 1, ŝ (i) L ]H, as follows: ŝ (i) =W (i,:) i ŷ i =E s (H (:,i) ) H( ) HΛ i H H +N 0 I LB LB 1ŷi with the following definitions: ŷ i = [ŷ (1) i ŷ (B) i ] T, Λ i = Λ (1,1) i Λ (U,U) i H (1,1) H (1,U), H =....., H (B,1) H (B,U) and H (:,i) = [ H (1,i) H (B,j)] T. The matrix H (:,i) is the horizontal concatenation of the i th block column of (diagonal) submatrices of H, consisting of the FD channel responses between the receive and transmit antenna associated with the i th user. The L L diagonal matrix Λ (j,j) i = diag ( λ (j,j) 1 i λ (j,j) ) (j,j) L i with λ j i is defined in (2.10).

41 The vector ŷ (j) i = [ŷ (j) 1 i,, ŷ(j) L i ]T contains the post-equalization symbols for the j th receive antenna for all subcarriers. In order to obtain an explicit formulation of the effective channel gain µ i as well as the NPWe variance ν 2 i, we can write the t th symbol estimate of the i th user as follows: 30 ˆx (i) t = f H t ŝ(i) = f H t W (i,:) y. The row vector ft H corresponds to the t th row of the IDFT matrix F H L. Let H (:,j) = [ H (1,j),..., H (B,j)] T be the horizontal concatenation of the j th block column of (diagonal) submatrices of H, consisting of the frequency-domain channel responses between the receive antennas and the transmit antenna associated with the j th user. We first compute the effective channel gain as µ i x (i) t = E [ ft H W (i,:) y x (i) ] t = L 1 tr(w (i,:) H (:,i) )x (i) t. Since W (i,j) and H (i,j) are diagonal matrices, we can write µ i as a sum of per-subcarrier operations. Let h i,w be the i th column of H w. Then, we get µ i = L 1 L w=1 wh w i h i,w. We next compute the post-equalization NPWe variance ν 2 i of the residual noise

42 31 plus interference as νi 2 = E [ ˆx (i) t 2] E [ µ i x (i) t 2] = E s f H t (H (i,:) ) H (W (i,:) ) H f t E s µ i 2, which allows us to compute the post-equalization NPWe using the following simple formula: ν 2 i = E s µ i E s µ i Low-Complexity Adaptive MMSE Equalization The computational complexity of the proposed soft-input soft-output detector is dominated by the U matrix inverses A 1 w i, i = 1,..., U, in (2.9) that need to be computed for each subcarrier and in each iteration. We first review existing inversion methods and then, propose a new, improved scheme that is particularly suitable for massive MIMO systems. Existing equalization methods The method in [44] reduces the computational complexity of the inverses A 1 w i in (2.9) with rank-1 updates. First, this method computes Ãw = H w Λ w H H w + N 0 I B B and its inverse Ã 1 w, where Λ w is diagonal with the i th diagonal entry being λ i = Var [ s (i) ] w. Second, [44] performs the following rank-1 updates to obtain the desired inverses: A 1 w i = (1 λi )Ã 1 Ã 1 w h i,w h H w i,wã 1 w, 1 + (1 λ i )h H i,wã 1 w h i,w

43 32 where h i,w is the i th column of H w. The complexity of this method for each iteration is dominated by the initial matrix multiplication H w Λ w H H w and the subsequent B B matrix inversion Ã 1 w, requiring roughly IUB 2 + IB 3 operations per subcarrier (we ignore all constants). We note that the methods in [13, 14] compute similar B B inverses, which is the reason for their prohibitive complexity in massive MIMO systems with a large number of BS antennas B. The inversion algorithm proposed in [45] reduces the computational complexity by expressing (2.9) as ŝ (i) w = ww H iŷw i = E s e i B 1 w iŷmf w i, (2.11) where B w i = G w Λ w i + N 0 I U U, ŷ MF w i = HH w ŷw i, G w = H H w H w, and e i is a unit vector with a single 1 at the i th position and 0 elsewhere. The inversion B 1 w i can be computed at low complexity with a preprocessing step followed by iterative updates. In the preprocessing step, the algorithm computes the Gram matrix G w = H H w H w, which requires BU 2 operations per subcarrier. Each iterative update then computes the equalized symbols of an iteration using (2.11) on a per-subcarrier basis for each user. The complexity of each iteration for each user is dominated by B 1 w, which requires an U U matrix inverse. Consequently, the complexity scales roughly with BU 2 + IU 4 operations per subcarrier. We finally note that the method proposed in [20] requires only IU 3 operations per OFDM tone; unfortunately, this inversion approach cannot be applied to SC-FDMA

44 33 systems. Low-complexity equalization We now present a novel approach that reduces the computational complexity compared to the methods in [44] and [45]. As in [45], we obtain B 1 w by first computing the Gram matrix G w. However, instead of evaluating (2.11) directly, we perform the following steps per-iteration and per subcarrier. We compute B 1 w = (G w Λ w + N 0 I U U ) 1, requiring roughly U 3 operations per subcarrier. For each user, we then apply a rank-1 update to B 1 w to obtain e i B 1 w i. Let B 1 w i = eh i ( B w + G w i ) 1, where i is an all-zeros matrix except for the i th entry on the diagonal, which is δ i = E s Var [ s (i) ] w. Let g i,w be the i th column of G w and b i,w be the i th column of B 1 w. We apply a rank-1 update as e i B 1 w i =eh i ( B w +δ i g i,w bh i,w ) 1 = b H i,w δ i b H i,wg i,w bh i,w 1 + δ i bh i,w g i,w which consists of vector operations only. As a result, the complexity of each iteration is dominated by the initial matrix inverse B 1 w. The complexity of our inversion approach is only of the order of BU 2 + IU 3 per subcarrier, which is lower than that of [44] and [45], especially for massive MIMO systems.

45 Simulation Results To evaluate the performance of the proposed iterative SC-FDMA detector, we consider a 3GPP LTE uplink system [1] with B antennas at the BS and U B single-antenna users. All simulations are carried out for the most challenging scenario (from an error-rate performance perspective), i.e., we consider a 20 MHz bandwidth with 1200 subcarriers and the highest rate modulation and coding scheme (i.e., MCS 28) as specified in [1]. The system parameters correspond to 64-QAM and a rate GPP turbo code. To reflect a potential real-world scenario, we use the WINNER- Phase-2 model [46] to generate the channel matrices and assume a linear antenna array with an antenna spacing of 6 cm. All users are randomly placed within a circular area of 1 km radius. In Fig. 2.5 and Fig. 2.6, we assess the frame error rate (FER) performance of the iterative detection and decoding schemes for U = 4 and U = 8 users respectively. For each case, we vary the numbers of BS antennas, i.e., from a conventional (small-scale, symmetric) to realistic massive MIMO configurations. At the BS, we perform IDD as described in Section with a log-map LTE turbo decoder performing 8 decoder iterations per IDD iteration. For a single iteration (i.e., I = 1, which implies that no feedback from the channel decoder is used), our detector algorithm corresponds to the standard soft-output MMSE detector. We also show the FER performance of two (I = 2) and four (I = 4) IDD iterations using our method. In addition, we At the time of writing this thesis, we are unaware of any massive MIMO channel models. The chosen parameters resemble that of the measurement campaign in [47].

46 35 show the FER performance of our algorithm in so-called self-iteration mode (denoted by SI = 2). A self iteration corresponds to the case where we directly feed back the posterior LLRs from the MIMO detectors to its input. This mode has the advantage of significantly reducing the latency over full iterations over the channel decoder at the cost of worse FER performance. As a reference, we include the single-input multiple-output (SIMO) bound, which corresponds to be the (idealistic) case where no inter-user interference is present. For all considered antenna configurations, we see that IDD (often significantly) improves the FER performance. The performance in self-iteration mode is better than of soft-output MMSE detection (I = 1), especially for symmetric systems, i.e., where B = U, but worse than IDD with I 2. IDD in combination with our detection algorithm enables us to approach the SIMO bound by about 0.3 db (or less) for massive MIMO systems where the number of BS antennas exceeds the number of users by a factor of two. However, we also see that the performance improvement due to IDD depends on the ratio between BS antennas and users. For (symmetric) small-scale MIMO systems, IDD significantly reduces the FER while the gains with large-scale MIMO are comparably smaller, which suggests linear detection is sufficient for large-scale MIMO. Note that we only perform two self iterations as carrying out more self iterations shows no FER performance gains (over non-iterative detection).

47 Figure 2.5 : FER performance comparison for a SC-FDMA-based MIMO wireless system (U = 4). 36

48 Figure 2.6 : FER performance comparison for a SC-FDMA-based MIMO wireless system (U = 8). 37

49 38 Chapter 3 MIMO Detection for Large Scale Systems In the previous chapter, we showed that linear MMSE detection can achieve good performance for large-scale MIMO. Although the complexity of linear MMSE detection algorithm is significantly smaller than that of the iterative FD-MMSE-PIC algorithm, the presence of hundreds of antennas at the BS and a large number of users will increase the computational complexity of MIMO detection by orders of magnitude compared to small-scale MIMO systems. In this chapter, we propose novel low-complexity approximate soft-output data detection methods for wideband massive MU-MIMO systems that reduce the complexity compared that of an exact matrix inversion method. We analyze the implementation trade-offs associated with approximate and exact linear data detection in the large-scale MIMO uplink and we show that the performance of the approximation methods depends on the ratio between BS antennas and users. In particular, we show that for small BS-to-user antenna ratios, exact and implicit Cholesky decomposition-based equalization methods achieve the best trade-off; for large BS-to-user antenna ratios if the number of BS antennas is roughly 2 larger than the number of user antennas approximate and implicit methods such as Conjugate-Gradient [48], Gauss-Seidel [49], and accelerated implicit Neumann series approximations in combination with our post-equalization SINR approximation

50 39 enable further reductions in computational complexity at virtually no performance loss. Finally, we show that by combining the advantages of exact explicit and approximate implicit equalization, we can exploit frequency (and time) correlation in wideband massive MU-MIMO systems to perform near-optimal data detection at more than 2 lower computational complexity than competitive methods. 3.1 Linear MMSE Detection Without a-priori LLRs from the turbo decoder, the complexity of the linear MMSE detection reduces significantly. The equalized frequency-domain symbols is ŝ = Wy with the MMSE equalization matrix defined as follows [50]: W = ( H H H + N 0 E s 1 I LU ) 1 H H. Since the effective channel matrix H is built from diagonal L L submatrices, we can apply MMSE equalization on a per-subcarrier basis. Specifically, the equalized symbols on the w th subcarrier are given by ŝ w = W w y w, with the per-subcarrier MMSE equalization matrix defined as W w = ( H H w H w + N 0 E s 1 I U ) 1 H H w = A 1 w H H w. (3.1) In particular, our methods build upon the MMSE data detector in [20] initially developed for traditional, small-scale MIMO-OFDM systems. This algorithm performs

51 40 data-detection in two phases: (i) Estimates of the transmitted FD symbols in SC- FDMA systems are obtained using MMSE equalization on a per-subcarrier basis; (ii) log-likelihood ratio (LLR) values are computed in the time domain. (i) Equalization: To perform MMSE equalization in the FD, we compute the Gram matrix G w = H H w H w and the matched filter vector y MF w = H H w y w for each subcarrier w. We then compute the regularized Gram matrix A w = G w + N 0 Es 1 I U, which enables us to compute the equalized FD symbols as ỹ w = A 1 w y MF w. (3.2) These equalized FD symbols are then used to compute the LLR values required for soft-output data detection [20, 26]. (ii) Soft-output Data Detection: Since the LTE uplink utilizes SC-FDMA, we first perform an IDFT on ỹ (i) = [ỹ (i) 1,..., ỹ (i) L ]T to obtain the TD estimate x (i) = [ x (i) 1,..., x (i) L ]T. To extract LLRs from the time-domain symbol estimates, we approximate each estimate as an independent Gaussian random variable. The so-called max-log LLR value of the j th bit of t th symbol, L (i) (t,j), is then computed as [20] L (i) (t,j) = ρ(i) x min (i) t µ a (i) a O 0 j 2 min a O 1 j x (i) 2 t µ a. (3.3) (i) Here, O 0 j and O 1 j are the constellation subsets for which the j th bit is 0 and 1 respectively.

52 41 The post-equalization signal-to-interference-plus-noise ratio (SINR) is ρ (i) = (µ (i) ) 2 /ν 2 i, where ν 2 i = E s µ (i) E s µ (i) 2 for SC-FDMA-based systems. The effective channel gain is computed as µ (i) = L 1 L w=1 ah i,wg i,w, where a H i,w is the i th row of A 1 w and g i,w is the i th column of G w. See [26] for more details. 3.2 Explicit vs. Implicit Equalization There exist two distinct equalization methods to compute (3.2), namely explicit and implicit methods. Explicit methods first compute the matrix inverse A 1 w (or an approximation thereof) and then, use the matrix inverse to compute the equalized FD symbol as in (3.2). Implicit methods solve the system of linear equations A w ỹ w = y MF w either exactly or approximately to compute the equalized FD symbol ỹ w ; this approach avoids an explicit computation of A 1 w. The key advantage of implicit equalization methods is the fact that they require (often significantly) lower complexity than explicit methods. In contrast, explicit equalization methods have the following advantages: (i) Massive MU-MIMO systems are expected to operate as time-division multiplexing systems [16], where the BS estimates the channel during the uplink phase. As a result, the matrix inverse obtained during the uplink transmission can be re-used to perform MU precoding in the downlink. (ii) For slow-fading channels and/or channels with low-delay spread, the inverse can be re-used for consecutive symbols and/or adjacent subcarriers, respectively. (iii) Computation of the post-equalization SINR ρ (i) used in (3.3) is greatly facilitated

53 42 from the explicit inverse A 1 w (cf. Section 3.1). In the following Sections 3.3 and 3.4, we discuss explicit and implicit equalization schemes, respectively. 3.3 Explicit Approximate MMSE Detection We start by discussing explicit MMSE equalization, i.e., where we obtain the equalized symbol ỹ w by first computing or approximating the inverse matrix A 1 w, followed by computing (3.2). Unsurprisingly, the computation of all per-subcarrier inverses A 1 w, w, in (3.1) is responsible for the main computational complexity of linear MMSE detection in SC-FDMA-based large-scale MIMO systems. For a conventional small-scale LTE uplink scenario, i.e., where the number of receive antennas B and users U is small (on the order of U, B 6), existing VLSI designs for linear detection, such as [28,51,52], compute the exact inverse explicitly. For large-scale MIMO systems with a large number of users U however, the computation of the inverse A 1 w can quickly result in excessive complexity. Hence, practical solutions for large-scale MIMO detection in LTE necessitate low-complexity matrix inversion methods. In this section, we provide an overview of exact, explicit inversion methods based on the Cholesky decomposition and proceed by discussing existing and novel iterative methods that approximate the inverse A 1 w we will omit the subcarrier index w. at low complexity. For the sake of simplicity,

54 Exact Inversion via the Cholesky Decomposition A large number of exact methods to compute A 1 exist in the literature; see the references [28, 53, 54] for an overview. One of the most efficient methods (in terms of arithmetic operations) that can be implemented in VLSI at low complexity relies on the Cholesky decomposition [26, 55 57]. This approach first factorizes the regularized Gram matrix A = LL H, where L is a lower-triangular matrix with non-negative entries on the main diagonal. To obtain A 1, the approach then solves LX = I U for X using forward substitution one can then solve L H A 1 = X for A 1 using back substitution. The complexity that is required to explicitly compute A 1 using the Cholesky decomposition can become prohibitive as U grows large (see Section 3.7 for a discussion). Furthermore, computing the Cholesky decomposition, as well as performing forward or backward substitution, exhibits stringent data dependencies, which prevents highly-parallel hardware designs. To reduce the computational complexity of matrix inversion for the high-dimensional systems anticipated in massive MU-MIMO systems and to enable massively parallel hardware designs, we next propose novel, low-complexity methods that explicitly compute approximate versions of A 1.

55 Exact Inversion using Series Expansions Accelerated Neumann Series Expansion We now propose a general, accelerated version of the classical Neumann series, which enables the design of approximate data detectors that achieve superior error-rate performance at low complexity. Lemma 1 (Accelerated Neumann Series). Let Ã 1 0 C U U be a so-called initialization matrix with full rank. Suppose that lim k (I U Ã 1 0 A) k = 0 U U. (3.4) Then, we have the following accelerated Neumann series: A 1 = k=0 (I Ã 1 0 A) k Ã 1 0. (3.5) Proof. We use [58, Thm. 4.20], which establishes that for a given matrix P C U U for which lim k P k = 0 U U, we have (I U P) 1 = k=0 Pk and I U P is invertible. As a consequence, by defining Ã 1 0 A = I U P and assuming that lim k (I U Ã 1 0 A) k = 0 U U, we have A 1 Ã 0 = (Ã 1 0 A) 1 = k=0 (I U Ã 1 0 A) k, which can be rewritten to the accelerated Neumann series in (3.5), since Ã 1 was

56 45 assumed to be full rank. In Section 3.5, we will develop computationally efficient ways for computing initialization matrices Ã 1 0, which enable accurate approximations of A 1 with only a few terms of the accelerated Neumann series (3.5). In order to design such matrices, we will make use of the following convergence condition; the proof directly follows from [58, Thm. 4.20]. Lemma 2 (Convergence Condition). A sufficient condition for (3.4) to hold is that I U Ã 1 0 A < 1 (3.6) for any consistent matrix norm. Accelerated Neumann Series Recursion As it will be important for the implicit, approximate inversion methods discussed in Section 3.4, it is key to realize that (3.5) can alternatively be formulated using the following recursion for the iterations k = 1, 2,... given by Ã 1 k = Ã ( I U Ã 1 0 A ) Ã 1 k 1, (3.7) which we initialize with Ã 1 0 (hence, the name initialization matrix). Given that (3.6) holds, the recursion satisfies lim k Ã 1 k = A 1. The recursion can be derived from the right-hand side (RHS) in (3.5) by successively factoring I U Ã 1 0 A from the infinite

57 46 sum. We note that recurrent operations in (3.7) can be avoided by precomputing Ã 1 0 A. Schulz Recursion To obtain faster convergence rates than the accelerated Neumann series recursion (3.7), one may use higher-order recursions. One prominent method is the Schulz recursion [59], which has been proposed for small-scale MIMO systems in [60]. As for (3.7), if (3.6) holds, then the inverse A 1 can be computed recursively for k = 1, 2,... as [59] Ã 1 k = ( 2 I U Ã 1 k 1 A) Ã 1 k 1 (3.8) with the initialization matrix Ã 1 0. This recursion generates 2 k Neumann series terms for k iterations, whereas the accelerated Neumann recursion (3.7) only generates k + 1 terms per k iterations. The Schulz method (3.8), however, requires two matrix multiplications per iteration, whereas the accelerated Neumann recursion (3.7) requires only one. Hence, for a small number of iterations, i.e., for k 2, the accelerated Neumann series recursion is computationally more efficient (see, e.g., Section or [26] for a detailed complexity comparison).

58 47 Higher-Order Recursions The literature describes other recursive inversion methods [61, 62], which converge even faster than the Schulz recursion (3.8). For example, if (3.6) holds, then the inverse A 1 can be computed recursively for k = 1, 2,... as [62] Ã 1 k = Ã 1 k 1 ( ( 3 I U AÃ 1 k 1 3 IU k 1) ) AÃ 1 (3.9) with the initialization matrix Ã 1 0. This 3 rd order matrix inverse approximation requires three matrix multiplications per iteration (if one precomputes AÃ 1 n 1) and generates 3 k Neumann series terms for k iterations. Note that (3.9) is computationally only more efficient than the Schulz recursion for k Approximate Inversion using Truncated Series Expansions For a large number of iterations, computing the accelerated Neumann Series recursion (3.7), as well as the recursions in (3.8) and (3.9), is impractical and entails higher complexity than the Cholesky-based approach in Section However, if we restrict ourselves to a small number K max of iterations, which is equivalent to truncating the series (3.5) to K max terms, one can accurately approximate A 1 at low complexity. We next discuss existing and new variations of this idea.

59 48 Truncated Neumann Series The idea of the approximate, explicit inversion approach is to evaluate only K max terms in (3.5): Ã 1 K max = K max k=0 (I Ã 1 0 A) k Ã 1 0. (3.10) The initialization matrix Ã 1 0 strongly influence the accurancy of the truncated Neumann series. We presented a simple initialization matrix in [22 24, 26]. We first decompose the matrix A into its main diagonal part D and the off-diagonal part E = A D. Then, by using Ã 1 0 = D 1 as the initialization matrix, we obtain the following truncated Neumann series [22 24, 26] Ã 1 K max = K max k=0 ( D 1 E) k D 1 which accurately approximates A 1 for very small numbers of K max, in systems with large BS-to-user-antenna ratios. In Section 3.5, we will provide detail analysis the initialization matrix D 1 and present several other initialization matrices that achieve different performance complexity tradeoff.

60 49 Higher-Order Series Expansions Evidently, the above truncation approach can also be used in combination with for the Schulz recursion (3.8) or other higher-order recursions, such as the one in (3.9). In Sections 3.7 and We will analyze the associated performance/complexity trade-offs LLR Computation for Explicit Inversion Methods With the above methods for computing the inverse A 1 w, we can calculate the LLR values for soft-output data detection using (3.3). Specifically, for the Cholesky decomposition in Section and the exact series expansions in Section 3.3.2, we first compute the equalized symbols (3.2) and then, generate the TD estimates x w with an IDFT. The LLR values (3.3) are obtained directly from A 1 w and x w, where the quantities µ (i) w and ρ (i) w are computed as discussed in Section 3.1. For the approximate matrix inversion methods described in Section 3.3.3, Computation of the max-log LLRs by simply replacing the exact inverse A 1 w by the approximation Ã 1 w K effective channel gain µ (i) K to perform MMSE equalization. For this approximation, the and the variance of the residual post-equalization NPWe variance ν 2 i K now depend on K. In order to compute the effective channel gain µ (i) K, we first construct the LU LU matrix W 1 K 1 from the sub-carrier equalization matrices W w K, w, as explained in Section 3.1. With this, we have µ (i) K x(i) t = E [ ft H W (i,:) K y ] x(i) t, which can be rewritten

61 50 by replacing W (i,:) with (i,:) W K. Consequently, the effective channel gain is given by µ (i) K = L 1 L w=1 wh i,w K h i,w, where w H i,w K is the ith row of column of H (i,i) w. W (i,i) w K and h i,w K the i th In order to compute the post-equalization NPWe variance ν 2 i K one might assume that it simply corresponds to E s µ (i) K E s µ (i) K 2 as in the exact inverse case. Unfortunately, this expression no longer holds, because of the following fact: W K (E s HH H + N 0 I LB ) H H. Furthermore, the above NPWe variance expression is not guaranteed to be nonnegative and hence, using it to compute LLR values inevitably results in poor errorrate performance. As a consequence, an alternative expression for ν 2 i K is required when using the approximate matrix inverse for data detection. Following the steps of the derivation of ν 2 i in Section and by replacing W (i,:) with (i,:) W K, the exact post-equalization NPWe variance can be expressed as: ν 2 i K =f H t W (i,:) K (E shh H + N 0 I LB )( W (i,:) K )H (i) f t E s µ K 2. (3.11) Since H H (E s HH H + N 0 I LB ) = (E s H H H + N 0 I LU )H H, we have: ν 2 i K = E s f H t (Ã 1 K )(i,:) AG(Ã 1 K )(i,:) (i) f t E s µ K 2.

62 51 As (Ã 1 K )(i,i) is diagonal, we can decompose the computation as the sum of persubcarrier operations. To this end, let ã H i,w K be the ith row of Ã 1 w K, then: ν 2 i K = E s L w=1 ã H i,w KA w G w ã i,w K E s µ (i) K 2. (3.12) This expression, however, is computational intensive, as it involves the L matrix multiplications, each requiring O(U 3 ) operations. In order to reduce the complexity of computing ν i 2 K, we can use the K = 1 term approximation NPWe ν 2 i 1 = E s L (d (i,i) w w=1 ) 2 a H (i) i,wg i,w E s µ 2 (3.13) 1 as a substitute for ṽi 2 K. Here, d(i,i) w is the i th diagonal entry of D w, a H i,w is the i th row of A w, and g i,w is the i th column of G w. This approximation requires low computational complexity as it involves only L inner products, each requiring U operations. In addition, the larger K is, the closer the approximate inversion in (3.18) is to the exact inverse (assuming the Neumann series converges). Hence, for K > 1, the exact NPWe variance would be lower than ν i 2 K, which reveals that (3.13) is a pessimistic approximation. We emphasize that we can further reduce the computational complexity of the NPWe approximation in (3.13). In particular, let a (i,j) w be the i th entry of the vector a j,w and g (i,j) w be the i th entry of the vector g j,w. Since A w = G w + N 0 E s 1 I U, we have

63 52 the following identity: (d (i,i) w ) 2 a H i,wg i,w = (d (i,i) ) 2 a (i,i) w w g (i,i) w + j,i j (a (i,j) w ) H g (i,j) w. Since (i) a i,j w = gw i,j, i j, (ii) d (i,i) w = a (i,i) w, and (iii) d (i,i) w a (i,j) w in the case where U B, we can use the approximation (d (i,i) w propose the following low-complexity NPWe approximation: ) 2 a H i,wg i,w (d (i,i) ) 1 g (i,i). Hence, we w w ν 2 i E s L (d (i,i) w w=1 ) 1 g (i,i) w E s µ (i) 1 2. (3.14) Note that our own simulations show that the low-complexity NPWe approximation (3.14) performs well compared to the exact NPWe variance (3.12). 3.4 Implicit Approximate MMSE Detection We now discuss existing and novel implicit MMSE equalization algorithms. The idea of these methods is to obtain the equalized symbol ỹ w (or a corresponding approximation) directly, without ever computing the inverse A 1 w. As discussed in Section 3.2, the complexity of implicit methods is, in general, lower than for explicit methods. Nevertheless, exact computation of the post-equalization SINR, as required for LLR computation (3.3), is computationally expensive. To enable soft-output data detection with implicit equalization methods, we propose a low-complexity SINR approximation. For the sake of simplicity, we will omit the subcarrier index w.

64 Exact Inversion using Implicit Cholesky Decomposition Implicit equalization methods solve for ỹ directly without computing A 1 explicitly. One hardware-friendly approach for implicit equalization is to first perform the Cholesky decomposition to obtain A = LL H. Then, one can first solve Lx = y MF for x followed by solving L H ỹ = x, where ỹ corresponds to the equalized vector (see, e.g., [57]) Implicit Accelerated Neumann Recursion To reduce the computational complexity of implicit equalization, we can perform the following implicit, accelerated Neumann recursion; the proof immediately follows from right-multiplying both sides of (3.7) by y MF. Lemma 3 (Implicit Accelerated Neumann Recursion). Let ỹ 0 = Ã 1 0 y MF, where the initialization matrix Ã 1 0 satisfies (3.6). Then, for the iterations k = 1, 2,... ỹ k = ỹ 0 + ( I U Ã 1 0 A ) ỹ k 1 (3.15) we recursively obtain ỹ k = Ã 1 k ymf with Ã 1 k defined in (3.7). Evidently, the recursion (3.15) can be terminated after K max iterations to obtain an approximate to (3.2) at low complexity. Furthermore, Ã 1 A can be precomputed to avoid recurrent calculations. We note that the recursion in (3.15) is a generalization of the equalization algorithm proposed in [16], which uses Ã 1 0 = I U. Note that this

65 54 particular choice only performs well for suitably normalized channel matrices and massive MU-MIMO systems with large BS-to-user-antenna ratios. Unfortunately, the Schulz recursion and higher-order recursions do to the best of our knowledge not have efficient implicit forms. In fact, if we right-multiply both sides of (3.8) or (3.9) by y MF, we see that one needs to keep track of the matrix Ã 1 k in order to compute ỹ k ; this prevents a computationally efficient, implicit recursion with these methods Existing Approximate Implicit Equalization Methods A variety of low-complexity, implicit equalization methods for data detection in massive MU-MIMO systems have been proposed recently [48, 49, 63]. The Richardson method proposed in [63] can be rewritten as ỹ k = γy MF + (I U γa) ỹ k 1, which corresponds to a special case of the accelerated implicit Neumann series recursion in (3.15) with Ã 1 0 = γi; the quantity γ is an algorithm parameter. Other implicit methods, such as the conjugate gradient method (CG) method [48] and the Gauss- Seidel (GS) algorithm [49] are iterative methods that solve systems of linear equations for the positive semidefinite matrix A. Both CG and GS will converge to the exact solution for a sufficiently large number of iterations. GS is initialized by ỹ 0 = D 1 y MF ; for CG, we define the ỹ 0 as the output of the first iteration since the initial guess is an all-zero vector. CG and GS enable approximate equalization at (often) lower complexity [48, 49, 63] than other explicit and implicit algorithms. Sections 3.7 and We note that the parameter choice in [63] yields γ = (B + U) 1.

66 compare the computational complexity and performance of these equalization methods, respectively LLR Approximation for Implicit Inversion Methods We can evaluate (3.3) to obtain LLR values for the transmitted bits. Since the proposed implicit methods do not compute the matrix inverse A 1 w (or a corresponding approximation A w Kmax ), computing the quantities µ (i) and ρ (i) without A 1 w seems difficult. To enable soft-output data detection with implicit equalization algorithms, we propose a novel approximation for µ (i) and ρ (i) that does not need the explicit inverse A 1 w or a corresponding approximation A 1 w K max. We propose to use the effective channel gain µ (i) for the 0 th -term Neumann series approximation (i.e., K = 0) µ (i) µ (i) 0 = L 1 L w=1 (d(i) w ) 1 g w (i), (3.16) where d (i) w is the i th diagonal element of A w and g (i) w is the i th diagonal element of G w. Analogously, we propose to use the SINR for the 0 th -term Neumann series approximation µ (i) ρ (i) 0 = (µ (i) 0 ) 2 /(ν (i) 0 ) 2 = µ (i) ( 0 Es µ (i) 0 E s µ (i) 0 2) 1. As it will be demonstrated in Section 2.2.4, the resulting LLR approximation en-

67 56 ables implicit equalizers that achieve near-optimal performance at low computational complexity. 3.5 Initialization Matrices for Approximation MMSE Detection The proposed series-based explicit and implicit methods in Sections 3.3 and 3.4 require a suitable initialization matrix Ã 1 0 that not only improves the probability of convergence, i.e., the probability that the initialization matrix satisfies (3.6), but also improves accuracy of the approximated matrix inverse. We next discuss existing choices for Ã 1 0 and propose novel methods that lead to improved error-rate performance Relevance of the Initialization Matrix We first show that the choice of the initialization matrix Ã 1 directly affects the performance of (explicit and implicit) approximate equalizers that use truncated series expansions. Lemma 4 (Residual Estimation Error). Let ỹ Kmax = A 1 K max y MF be the result of an approximate equalizer using the truncated series expansions. Define the residual estimation error as e Kmax = ỹ Kmax A 1 y MF.

68 57 Then, we have the following upper bound on the residual estimation error: e Kmax I U Ã 1 0 A Kmax+1 ỹ, where ỹ = A 1 y MF is the estimate obtained through the exact equalizer and is a consistent matrix norm. Proof. We start by rewriting the error residual term as a function of A 1 0. In particular, we have the following identities: e Kmax = ỹ Kmax A 1 y MF = (Ã 1 K max A 1 )y MF ( = ) k=k (I max+1 U Ã 1 0 A) k Ã 1 0 y MF = (I U Ã 1 0 A) Kmax+1 A 1 y MF. By definition of induced norms, we get the following inequality: e Kmax I U Ã 1 0 A Kmax+1 ỹ, where we define ỹ = A 1 y MF. It is evident that by reducing I U Ã 1 0 A, we reduce the residual estimation error. Furthermore, if (3.6) holds, then increasing the number of accelerated Neumann series terms k forces the residual estimation error to zero, i.e., the series expansion is exact. Hence, it is of utmost important to chose an initialization matrix Ã 1 0 that

69 58 minimizes I U Ã 1 0 A in order to minimize the residual error and, consequently, the error-rate of approximate linear equalization Existing Initialization Matrices The most common initialization matrices that satisfy (3.6) are of the form Ã 1 0 = αa H, where α > 0 is a carefully chosen scalar. For example, reference [62] postulates the use of α 1 = (λ max + λ min )/2, where λ max and λ min are the largest and smallest eigenvalues of A H A, respectively. This choice minimizes the left-hand side of (3.6) by assuming the spectral norm. Unfortunately, the complexity required to compute the largest and smallest eigenvalues of A H A is, in general, larger than computing the inverse A 1 itself, which renders this method unattractive in practice. Related approaches that ensure (3.6) while requiring lower complexity are, for example, α 1 = AA H /2 or α 1 = A 1 A [61, 64]. Reference [63] proposes Ã 1 0 = (U + B) 1 I U, which only converges in the large antenna limit for i.i.d. circularly-symmetric complex Gaussian channel matrices H and does, in general, perform better than Ã 1 0 = D 1. This method, however, may still diverge for not-so-massive systems with small BS-to-user-antenna ratios (see Section 3.5.5) D 1, A Simple Initialization Matrix For large-scale MIMO systems, where the number of receive antennas is larger than the number of single-antenna users, i.e., for U B, the Gram matrices G, and,

70 59 consequently A, become diagonally dominant [16]. In fact, for i.i.d. Gaussian channel matrices H (with properly normalized entries) and in the large antenna limit, [15] shows that G I U. Inspired by this central property of large-scale MIMO, one can derive a low-complexity approximation of the inverse. In particular, let A D, where D is the main diagonal of A. As a result, the inverse A 1 can be approximated by D 1, i.e, let Ã 1 0 = D 1, which requires evidently much lower complexity than that of the exact inverse. Unfortunately, for realistic antenna/user configurations, such a crude approximation would cause a significant performance loss. We start by rewriting the inverse A 1 by expanding (3.5) using Ã 1 0 = D 1. We then can rewrite the Neumann series as: A 1 = n=0 ( D 1 w E w ) n D 1 w, (3.17) is guaranteed to converge if lim n (I X 1 A w ) n = 0 U U (or equivalently lim n ( D 1 w E w ) n = 0 U U ) is satisfied. Then by keeping only the first K max terms of the Neumann series (3.17), we compute obtain a truncated Neumann series approximation as follows: K max Ã 1 K max = ( D 1 E) n D 1, (3.18) n=0 which can be computed at low computational complexity for approximations consisting of only a few Neumann series terms, i.e., for small values of K.

71 60 With this approximation, the resulting approximate MMSE equalization matrix is given by W Kmax = Ã 1 K max H H w. For K max = 0, we obtain Ã 1 0 = D 1, which is simply a scaled version of the MF detector, as W 1 0 = D 1 H H. We emphasize that the row-wise scaling induced by D 1 does not affect the detection process, as long as D 1 exists. Hence, the proposed approximation (3.18) simply coincides with the MF detector for K max = 0. For K max = 1, we obtain Ã 1 1 = D 1 D 1 ED 1, whose computational complexity only scales with O(U 2 ) operations; this is in contrast to the O(U 3 ) complexity scaling required by computing an exact inverse. Hence, a second-order Neumann series approximation can be obtained at lower computational complexity. For K max = 2, we obtain Ã 1 2 = D 1 D 1 ED 1 + D 1 ED 1 ED 1, (3.19) whose complexity scales with O(U 3 ), which is equivalent to that of an exact inverse. Nevertheless, evaluating (3.19) requires fewer arithmetic operations than an explicit evaluation of A 1. Note that for K 3, computing the exact inverse can be of lower complexity than the proposed approximation, e.g., when using a Cholesky factorization. Analysis of the Approximation Error We next analytically characterize the error induced by the approximate inverse (3.18) for MMSE estimation. To this end, we define the approximation error as Kmax =

72 61 A 1 Ã 1 K max, which is equivalent to Kmax = ( D 1 E) n D 1 n=k max = ( D 1 E ) K max ( D 1 E) n D 1 n=0 = ( D 1 E ) K max A 1. Now, consider the situation of using the approximate Ã 1 K max in place of A 1 to compute the equalized frequency-domain symbols, i.e., ŝ = Ã 1 K max H H y = A 1 y MF Kmax y MF with y MF = H H y and ŝ = A 1 y MF being the exact estimate. We can bound the l 2 -norm of the residual estimation error resulting from this approximate equalization by Kmax y MF 2 = ( D 1 E) Kmax A 1 y MF 2 ( D 1 E) Kmax F A 1 y MF 2 D 1 E Kmax F ŝ 2. (3.20)

73 62 From (3.20), we see that if the condition D 1 E F < 1 (3.21) is satisfied, then the approximation error approaches zero exponentially fast as K. Moreover, one can show that (3.21) is a sufficient condition for (3.17) to converge. We now show that the condition D 1 E F < 1 is satisfied with high probability for large-scale MIMO systems with a larger number of BS antennas B than users U, and if the entries of H C B U are assumed to be i.i.d. circularly symmetric complex Gaussian with unit variance. More specifically, we arrive at a condition that only depends on U and B for (i) the proposed Neumann series to converge and (ii) the residual approximation error (3.20) to be small. Lemma 5. Let the scalars x (k) and y (k) for k = 1,..., B be i.i.d. circularly symmetric [ ] complex Gaussian with unit variance. Then, E B k=1 x(k) y (k) 4 = 2B(B + 1). Proof. We have E B 4 ( B x (k) y (k) =E x (k) y (k) k=1 = k=1 B k=1 ) 2 ( x (k) y (k)) ( ) B E [ x (k) 2 y (k) 2] + 4E [ x (k) 4 y (k) 4] 2 = 2B(B 1) + 4B = 2B 2 + 2B. The above steps can be summarized as follows. After expanding the quadratic

74 63 expression, the non-zero terms can be written as x (k) 4 y (k) 4 and x (k) 2 y (k) 2, where k = 1,..., B. Then, there are B terms of the form x (k) 4 y (k) 4 and ( B 2) of the form x (k) 2 y (k) 2. The facts that E [ x (k) 4] = E [ y (k) 4] = 2 and E [ x (k) 2] = E [ y (k) 2] = 1 concludes the proof. Lemma 6. Let B > 4 and x (k), k = 1,..., B be i.i.d. circularly symmetric complex Gaussian with unit variance and g = B k=1 x(k) 2. Then, E[ g 1 4] = ((B 1)(B 2)(B 3)(B 4)) 1. (3.22) Proof. We first rewrite g as 2 1 2B k=1 s(k) 2 where s (k), k = 1,..., 2B, are i.i.d. zero-mean real-valued Gaussian with unit variance. Then, 2g 1 is an inverse chi-square random variable with 2B degrees of freedom. The inverse chi-square distribution with 2B degrees of freedom χ(2b) corresponds to an inverse-gamma distribution with 2B degrees-of-freedom. The 4 th moment of this inverse chi-square distribution is given by 1 (B 1)(B 2)(B 3)(B 4) [65] and, hence, we obtain (3.22). 16 Lemma 7. Let B > 4 and the entries of H C B U be i.i.d. circularly symmetric complex Gaussian with unit variance. Then, we have E [ D 1 E 2 F ] ( U 2 U ) 2B(B + 1) (B 1)(B 2)(B 3)(B 4) Proof. The regularized Gram matrix corresponds to A = D+E = G+N 0 E s 1 I U U.

75 64 Thus, each element on the i th row and j th column of A, a (i,j) can be written as: a (i,j) = g (i,j) = B k=1 ( h (k,i) ) h (k,j), i j g (i,i) +N 0 E 1 s = B k=1 h (k,i) 2 +N 0 Es 1, i = j, with g (i,j) corresponding to the i th row and j th column of the Gram matrix G. We now have the following inequality: E [ D 1 E 2 F [ ] i=u = E i=u j=u i=1 j=1,i j j=u i=1 j=1,i j g (i,j) ] 2 a (i,i) [ E g (i,j) g (i,i) 2], which is obtained by omitting the non-negative regularization term N 0 E 1 s. By applying the Cauchy-Schwarz inequality, we can bound E [ D 1 E 2 F ] from above as E [ D 1 E 2 F ] i=u j=u i=1 j=1,i j [ (g E g (i,j) 4] E[ (i,i) ) 1 ] 4. Application of Lemmata 5 and 6 to the first and second expected values, respectively, we obtain E [ D 1 E 2 F ] i=u j=u i=1 j=1,i j 2B(B +1) (B 1)(B 2)(B 3)(B 4) = ( U 2 U ) 2B(B+1) (B 1)(B 2)(B 3)(B 4).

76 65 Theorem 8. Let B > 4 and the entries of H C B U be i.i.d. circularly symmetric complex Gaussian with unit variance. Then, we have Pr { D 1 E Kmax F < α } 1 (U 2 U) 2B(B +1) α 2 Kmax (B 1)(B 2)(B 3)(B 4). (3.23) Proof. To prove Theorem 8, we start by using Markov s inequality to obtain the following straightforward inequality: Pr { D 1 E Kmax F α } } = Pr { D 1 E 2F α 2 Kmax α 2 Kmax E [ D 1 E 2 F ]. With Pr { D 1 E Kmax F < α } = 1 Pr { D 1 E Kmax F α } and by using the upper bound for E[ D 1 E 2 F ] from Lemma 7, we finally obtain (3.23). We emphasize that this theorem provides conditions for which the Neumann series converges with a certain probability; this can be accomplished by setting α = 1 and K = 1 and by inspecting the convergence condition (3.21). Furthermore, Theorem 8 provides conditions for which the residual estimation error (3.20) is small. In both cases, we can see from Theorem 8 that increasing the ratio between the number of The result in (3.23) also holds for the case where the regularization term N 0 Es 1 vanishes, which coincides to ZF detection. As a consequence, the condition (3.23) is rather pessimistic and is likely to be sub-optimal, especially for N 0 Es 1 > 0. The derivation of a tighter condition is left for future work.

77 66 BS antennas B and the number of users U increases the probability of convergence. Moreover, for α < 1, increasing K also increases the probability that the residual estimation error caused by a K-term approximation in (3.20) is smaller than α. We note that Theorem 8 also provides insight into the behavior in the largeantenna limit, i.e., for B while U is held constant. In this case, we have Pr ( D 1 E Kmax F < α ) 1 for α (0, 1], which implies (i) that the Neumann series converges with probability 1 and (ii) that the approximation error for any K max approximation is arbitrary small, which includes the MF detector (corresponding to K max = 0). We note that this behavior is in accordance with existing results for MF detection in large-scale MIMO systems [15, 66] Two New Initialization Matrices We next propose two new initialization matrices that can be computed at low complexity and result in small approximation errors even for a few iterations K max. In addition, as we will show in Section 3.5.5, the proposed initializers outperform the methods discussed in Section We emphasize that the initialization matrices proposed next are suitable for any matrix inversion method that uses (truncated) series expansions and hence, for applications beyond data detection. The first initialization method requires low complexity; see Section for a discussion. We derive this initialization method from condition (3.6). Initialization 1. Let D contain the main diagonal of A. Then, the initialization

78 Ã 1 0 = α opt D 1 with α opt = U D 1 A 2 F probability of convergence of the (truncated) series expansions. 67 minimizes (3.6) and hence, increases the Proof. We start by noting that squaring both sides in (3.6) results in the equivalent sufficient condition I U Ã 1 0 A 2 < 1. Furthermore, by assuming the spectral norm (which is a consistent norm), we have I U Ã 1 0 A 2 I U Ã 1 0 A 2 F, which enables us to obtain a more restrictive sufficient condition that allows the design of efficient initializers: I U Ã 1 0 A 2 F < 1. (3.24) The initialization method developed next is of the form Ã 1 0 = SD 1, where S is a diagonal scaling matrix that is designed to meet condition (3.24). Let W contain the diagonal part of D 1 A and Q the off-diagonal part. We define f = I U SD 1 A 2 F = I U S(W + Q) 2 F = I U SW 2 F + SQ 2 F, (3.25) and seek an diagonal scaling matrix S that minimizes f. We define the diagonal scaling matrix to have the form S = α I, which leads to f = U i=1 1 α W i,i 2 + α 2 Q 2 F. We now find the optimum scaling parameter αopt The proposed initialization scheme holds even if we replace D 1 with an arbitrary matrix X, which is close to the exact inverse A 1.

79 68 by computing f/ α = 0 and solving for α. Standard manipulations yield α opt = ( U i=1 W ) i,i D 1 A 2 F, where W i,i are the complex conjugates of the diagonal entries of W. Since W = I U, we get α opt = U D 1 A 2 F. Consequently, the first initialization matrix is Ã 1 0 = α opt D 1. The second initialization scheme refines Initialization 1 at slightly higher computational complexity; see Section for a discussion. Initialization 2. Let D be the main diagonal of A. Furthermore, let Q contain the off-diagonal part of D 1 A. Then, the initialization Ã 1 0 = diag(α opt 1,..., α opt U )D 1 with α opt i = 1/(1 + r i 2 2), where r i is the i th row of Q minimizes (3.6) and hence, increases the probability of convergence of the (truncated) series expansions. Proof. For the second initialization method, we derive a more general diagonal scaling matrix of the form S = diag (α 1,..., α U ). We obtain f = U i=1 1 α i W i,i 2 + α i 2 r i 2 2, where r i corresponds to the i th row of Q. To find the optimal scaling parameters α i,

80 69 i = 1,..., U, we set f i / α i = 0 and solve for α i. Standard manipulations yield α opt i = W i,i/( W i,i 2 + r i 2 2), i = 1,..., U, and we use the fact that W i,i = 1, i. Consequently, the second initialization matrix is Ã 1 0 = diag(α opt 1,..., α opt U )D 1. We conclude by noting that both of these initialization schemes do, in general, not guarantee convergence according to (3.6). Nevertheless, they exhibit (often significantly) faster convergence compared to the ones discussed in Section and converge with high probability, even for small BS-to-user-antenna ratios. We next discuss the empirical convergence behavior of all discussed initialization matrices Comparison of Empirical Convergence Behavior To assess the convergence behavior of approximate equalizers using the truncated series expansions for different initialization matrices, we generate B U random matrices H and U-dimensional vectors x, where the entries are i.i.d. circularly-symmetric complex Gaussian with unit variance. For each matrix and vector pair, we compute y = Hx. Given H and y, we first compute the Gram matrix A = H H H and perform recursive matrix inversion using the accelerated Neumann recursion (3.7), the Schulz recursion (3.8), and the 3 rd order recursion (3.9). We then use the approximate inverse Ã 1 k to obtain x k = Ã 1 k y. In addition, we also estimate x k using CG [48] and GS [49], which are both implicit methods.

81 70 Figure 3.1 compares the relative error (RE) defined as RE(k) = x x k 2 x 2 at iteration k between the exact and the approximate solution for the Neumann recursion (Figure 3.1(a)), Schulz recursion (Figure 3.1(b)), and 3rd order recursion (Figure 3.1(c)). We report the average RE over 10k Monte-Carlo trials. By comparing the convergence behavior of the Neumann recursion (Figure 3.1(a)) against the Schulz recursion (Figure 3.1(b)), and the 3rd order recursion (Figure 3.1(c)), we see that the average RE decreases faster for higher-order recursions. Although CG and GS outperform the methods that use a truncated Neumann series, both they are inherently implicit and do not compute an approximate matrix inverse. Figure 3.1 also shows the performance of different initialization matrices. Traditional initializers, such as Ã 1 0 = αa H with (i) α = 2(λ max + λ min ) 1, (ii) α = 2( A H ) 1, and (iii) α = ( A 1 A ) 1 always converge, whereas (i) leads to the fastest convergence among these methods. The initialization matrix D 1 proposed specifically for massive MU-MIMO leads to faster convergence than the traditional initialization matrices for large BS-to-user-antenna ratios but diverges as k for small ratios. The Richardson method [63] uses Ã 1 0 = (B + U) 1 I U and exhibits similar convergence as D 1 and also diverges for small BS-to-user-antenna ratios. The proposed Initializers 1 and 2 enable faster convergence than all other initialization matrices for all BS-to-user-antenna ratios. Since both of the proposed initializers

82 71 exhibit similar convergence behavior, we prefer Initialization 1 as it is slightly less complex (see Section for a discussion). 3.6 Frequency-Adaptive Equalizer (FADE) We now propose the frequency adaptive equalizer (FADE), which combines the advantages of implicit, explicit, exact, and approximate equalization methods, and achieves near-exact performance at very low computational complexity Exploiting Correlation in Multipath Wireless Channels Practical multipath channels in wideband communication systems typically exhibit correlation across time and frequency. In fact, by assuming a wide-sense stationary uncorrelated scattering (WSSUS) channel model, the TD correlation between symbols is dependent on the Doppler spread and the FD correlation between subcarriers is dependent on the delay spread [67]. Existing wideband systems that rely on OFDM and SC-FDMA already exploit time and frequency correlation to estimate the channel coefficients. For example, 3GPP LTE-A [40] embeds pilot symbols across frequency and time in the transmitted signal, which allows the receiver to estimate the channel coefficients by means of interpolation. FD correlation has also been exploited to reduce the complexity for linear data detection in traditional small-scale MIMO systems [68, 69]. Reference [68] proposes to compute an explicit inverse only at a given number of subcarriers (so-called base-

83 72 points), whereas the other inverses at the remaining subcarriers are computed through interpolation. Given a sufficiently large number of base points (depending on the delay spread), this method was shown to be exact. The drawback of such exact, interpolation-based matrix inversion methods for massive MU-MIMO is the high computational complexity caused by rather long interpolation filters [69]. Inspired by these algorithms, we next propose an approximate interpolation-based equalization method that achieves excellent performance at low complexity Frequency Adaptive Equalizer (FADE) The key idea of FADE is to exploit correlation across frequency (and possibly time) and to take advantage of explicit and implicit equalization schemes. As in [68, 69], we first compute an explicit matrix inverse at a given (small) set Ω of subcarriers (base-points). Given the inverse matrix A 1 w at base-point w Ω, we can approximate the matrix inverse at nearby (adjacent) subcarriers w by using one of the accelerated recursions in Section with A 1 w For example, the matrix inverse A 1 w as the initialization matrix (i.e. Ã 1 w 0 = A 1 w ). at a neighboring subcarrier w = w + 1 can be approximated by computing one explicit recursion of the Neumann series in (3.15) These set of base-points can either be pre-assigned or varied on-the-fly depending on channel condition for better performance.

84 73 using Ã 1 w+1 0 = A 1 w as follows: Ã 1 w+1 1 = A 1 w + ( ) I U A 1 w A w+1 A 1 w = 2A 1 w A 1 w A w+1 A 1 w. (3.26) Unfortunately, the operation count of (3.26) is dominated by two matrix multiplications, which is higher than that of a Cholesky decomposition. To reduce the complexity, we perform implicit equalization on neighboring subcarriers instead, i.e., ỹ w+1 = 2A 1 w y MF w+1 A 1 w A w+1 A 1 w y MF w+1, (3.27) where y MF w+1 = H H w+1y w+1. Besides computationally-efficient matrix vector products, this implicit, approximate equalizer still requires computation of the regularized Gram matrix A w+1. It is, however, crucial to realize that the Gram matrix does not need to be computed when rewriting (3.27) as ỹ w+1 = 2A 1 w y MF w+1 (3.28) ( A 1 w H H w+1 H w+1 A 1 w yw+1 MF + N 0 Es 1 A 1 w yw+1) MF. By precomputing A 1 w y MF w+1, all subsequent operations in (3.28) consist of matrixvector multiplications; this is the main reason why FADE exhibits low complexity. Since FADE avoids computation of a matrix inverse A 1 w for adjacent subcarriers

85 74 w / Ω, we compute the quantities µ (i) and ρ (i) using the approximations outlined in Section for implicit methods to compute approximate LLR values. Note that these approximations requires the computation of D 1, which can be obtained at low complexity from reciprocals of the squared column norms of the channel matrix H. We expect this method to require fewer base-points to achieve good performance as B grows large. As pointed out by [70], the Gram matrices G w between neighboring subcarriers become similar in magnitude (or flat) as B due to channel hardening. As a result, by decreasing the number of base points as B grows large, we can maintain good performance while reducing the number of matrix inverse required. Although FADE as discussed above only exploits FD correlation, it can be extended to exploit correlation in the time domain as well. For example, the inverse of the w th subcarrier of the (t 1) th symbol can be used as the initial estimate for the w th subcarrier of t th symbol. Fig. 3.2 illustrates how FADE can exploit FD and TD correlations. In the following, we will show that FADE is not only computationally extremely efficient but also achieves near-exact performance, even for small BS-touser-antenna ratios. 3.7 Computational Complexity We now compare the computational complexity of existing and proposed (approximate) inversion methods in terms of real-valued multiplications. For each complex-valued multiplication, we assume four real-valued multiplications and two real-valued ad-

86 75 Table 3.1 : Complexity of different initialization methods. 2 A H 1 A H (U + B) 1 I U D 1 Init. 1 Init. 2 2U U 2 + U + 1 2U 2 + U ditions. We also exploit symmetries (e.g., the fact that A is Hermitian) and avoid multiplications with zeros and ones Initialization Methods Table 3.1 compares the complexity for all initialization methods discussed in Section that can be implemented without an eigenvalue decomposition. Note that these complexity results ignore the computation of the regularized Gram matrix A. Evidently, computing D 1 as in [26] and (U + B) 1 I U as in [63] does not require any multiplications. The complexity of the traditional initializer 2 A H 1 A H is dominated by the term A H 1 and requires a total of 2U real-valued multiplications. The complexity of the proposed initialization methods, Initialization 1 and 2, is 2U 2 +U +1 and 2U 2 + U, respectively. Both of them are dominated by the computation of the entry-wise norm of D 1 A. While the multiplication count for both initializers are very similar, Initialization 1 requires only one reciprocal operation whereas Initialization 2 requires U such operations. In what follows, we exclusively focus on Initialization 1, since Initialization 2 provides only slightly better performance (cf. Section 3.5.5).

87 76 Table 3.2 : Complexity of exact and approximate explicit matrix inversion methods for K max 1. Method αd 1 αa H Neumann recursion 2BU 2 + (K max 1)(2U 3 ) + (2U 2 2U) 2BU 2 + (K max + 1)(2U 3 ) Schultz recursion 2BU 2 + (K max 1)(6U 3 ) + (2U 2 2U) 2BU 2 + K max 4U 3 3 rd order recursion 2BU 2 + K max 10U 3 2BU 2 + K max 6U 3 Cholesky decomposition 2BU U U Explicit Series Expansions and Exact Inversion Table 3.2 compares the complexity of exact inversion via the Cholesky factorization and of various explicit series expansions as discussed in Section 3.3. All results in this table include the complexity of computing A, which is necessary for all considered explicit methods. The complexity of Cholesky-based exact inversion scales with U 3 and is lower than the complexity of a standard matrix multiplication. We note one could use more efficient matrix-multiplication algorithms, such as the Strassen algorithm which scales with U [71]. The irregularity of such algorithms, however, renders efficient hardware designs difficult. The complexity of the proposed explicit series expansions depends on two factors: the initialization matrix and the iteration count K max. Impact of the Initialization Matrix The initialization matrix αd 1, for example, causes the intermediate terms Ã 1 K A to not be Hermitian, in general. In contrast, using 2 A H 1 A H ensures that Ã 1 K A is Hermitian, leading to different operation counts for these two initialization matrices.

88 77 Impact of the Iteration Count The case K max = 0 corresponds to the computation of the initialization matrix as summarized in Table 3.1. For K max = 1, the Neumann series expansion with the initialization matrix α opt D 1 leads to Ã 1 1 = 2α opt D 1 α 2 optd 1 AD 1, which requires only column and row scaling of A. Hence, the associated complexity scales only in U 2. In this case, the truncated Neumann series approximation exhibits lower complexity than the explicit Cholesky-based inverse and is an attractive method for explicit equalization in massive MU-MIMO systems [22 24, 26]. For K max = 3, the associate complexity scales in U 3 due to matrix multiplication. However, the output of the matrix multiplication is symmetric. As a result, the complexity for Kmax = 3 is still lower than that of Cholesky-base inverse as shown in Fig As expected, K 3 results in higher complexity than a Cholesky-based exact inversion. For K max = 1 and the initialization matrix 2 A H 1 A H, however, the complexity of the truncated Neumann series expansion is larger than that of the explicit Choleskybased inverse, because of the matrix multiplication required for the term Ã 1 0 A. For K max = 1 and α opt D 1, the Neumann recursion coincides to the Schultz recursion. For K max > 1, however, the Schultz recursion requires two matrix-matrix multiplications per iteration, resulting in substantially higher complexity. Similarly,

89 78 Table 3.3 : Complexity of implicit matrix inversion methods. Method αd 1 or (U + B) 1 I 2 A H 1 A H Neumann 2BU 2 +K max (4U 2 +2U)+ 2U 2BU 2 +K max 8U 2 +4U 2 CG 2BU 2 + (K max + 1)(4U U) GS 2BU 2 + K max 4U 2 + 2U Cholesky 2BU 2 + 2U 3 + 4U 2 2U 3 3 the 3 rd order recursion requires three such operations per iteration. As a result, both of these explicit higher-order recursions are unattractive in terms of complexity despite the fact they enable very fast convergence (cf. Section 3.5.5) Implicit Series Expansions and Exact Inversion Table 3.3 compares the complexity of various implicit equalization schemes, including Cholesky-based exact inversion, various approximate series expansions, as well as iterative methods. As expected, the complexity of implicit methods is significantly lower than that of explicit methods (cf. Table 3.2). Furthermore, the complexity of Cholesky-based implicit equalization scales cubically with U, whereas all other approximate methods scale only quadratically with U. As for explicit, approximate methods, the complexity of the implicit Neumann recursion depends on the initialization matrix. The choice A 1 0 = αd 1 requires one matrix-vector multiplication per iteration; the choice A 1 0 = 2 A H 1 A H requires two such operations and, hence, is less attractive (also from a convergence point-of-view; see Section 3.5.5). Table 3.3 also includes the complexity of CG [48] and GS [48], which scales quadratically in U. As for the implicit Neumann recursion, GS must be

90 79 Table 3.4 : Break-Even Point for Implicit Inversion. The break-even point is the smallest U such that the method exhibits lower complexity than the implicit Cholesky decomposition. K max D 1 and (U + B) 1 I α opt D A H 1 A H CG GS initialized by computing ỹ 0 = D 1 y MF, whereas CG uses an all zero vector for the initial guess. We emphasize that the method with lowest complexity is not immediately clear from Table 3.3. As it turns out, the implicit Cholesky decomposition often has the lowest complexity depending on U and K max. Table 3.4 provides an overview of this rather surprising behavior by listing the break-even points, i.e., the smallest value of U such that the complexity of a given approximate implicit method is lower than that of the exact, implicit Cholesky decomposition. To ensure a fair comparison, we take into account the complexity required to compute the necessary initialization terms (D 1, α opt D 1, and 2 A H 1 A H ). We observe that the Neumann recursion, CG, and GS, are only competitive with the exact, implicit Cholesky decomposition for a very small number of iterations K max. In these cases, GS exhibits the lowest complexity among all considered implicit equalization methods.

91 Complexity of FADE We now assess the complexity of the proposed frequency-adaptive equalizer (FADE). Since this method combines two methods: (i) an exact, explicit inversion using the Cholesky decomposition at each base point and (ii) an approximate, implicit Neumann recursion update on adjacent subcarriers (3.28), the total complexity of FADE is an average of the two methods. The complexity of explicit inversion using the Cholesky decomposition is shown in Table 3.2; the complexity of the implicit Neumann recursion update in (3.28) is given by 8BU + 8U 2 + 4U. Let p [0, 1] and p = 1 p [0, 1] be the percentage of base-point subcarriers and the percentage of adjacent subcarriers, respectively. Then, the average number of real-valued multiplications required per subcarrier is simply the weighted sum of the two phases given by p ( 2BU U U) + p (8BU + 8U 2 + 4U). (3.29) The percentage parameter p controls a performance/complexity trade-off large values of p perform more explicit matrix inversions, which result in high complexity but deliver excellent error-rate performance; small values of p perform less explicit inversions which reduce the complexity at the cost of error-rate performance. This trade-off is studied next.

92 Performance and Complexity Trade-off To evaluate the error-rate performance of the proposed soft-output data detectors, we consider a 3GPP-LTE uplink system [1] with U = 8 single-antenna user terminals and B {32, 64, 128, 256} BS antennas. For all simulations, we consider a 20 MHz bandwidth with 1200 subcarriers, and we use 64-QAM with a high-rate GPP turbo code. To consider frequency and spatial correlation, we used a WINNER-Phase 2 channel model [46] with 8.9 cm antenna spacing; the maximum delay spread for this model is 6 taps. All simulations assume perfect channel-state knowledge. To assess the error-rate performance, we use the so-called SNR operation point [42], which is defined as the minimum SNR required to achieve 10% block error-rate (BLER) for U = 8 users. The BLER is obtained via Monte-Carlo simulations averaged over 2000 frames. To assess the computational complexity, we use the real-valued multiplication counts in Section 3.7. For the initialization matrix α opt D 1, we use α opt = U D 1 A 2. We now investigate the performance/complexity trade-offs associated with the proposed equalizers and existing solutions. Figure 3.4 shows the performance/complexity trade-offs for all considered exact, approximate, explicit, and implicit methods, as well as FADE. First, we emphasize that the matched filter equalizer (which is equivalent to K max = 0) achieves the lowest complexity but is unable to achieve 10% BLER for all considered antenna configurations. In contrast, the explicit matrix inversion using the Cholesky decomposition leads to exact MMSE equalization performance, while requiring the highest complexity. The

93 82 implicit Cholesky decomposition achieves a BLER close to that of the exact MMSE detector for all the considered antenna configurations at slightly lower complexity the performance loss of this implicit data detector comes from the SINR approximation in Section For B = 32, we see that only Cholesky-based exact inversion and FADE is able to achieve (near-optimal) performance. Furthermore, FADE with p = 4% exhibits the same performance at 2 lower complexity. For B = 64, CG and GS are the only approximate, implicit methods that achieve 10% BLER; these methods, however, exhibit a similar complexity as the implicit Cholesky decomposition at about 0.5 db performance loss. Again, FADE outperforms all methods in terms of performance and complexity. By increasing the number of BS antennas to B 128, explicit as well as implicit Neumann series approximations start to approach the performance of the exact MMSE equalizer. However, only the implicit Neumann recursion enables lower complexity than the implicit Cholesky decomposition (cf. Table 3.4), which makes it attractive for massive MU-MIMO systems with large BS-to-user-antenna ratios. FADE reduces the complexity by more than 2 at near-optimal performance for only p = 1% base points. In summary, by exploiting frequency correlation, FADE is able to significantly reduce the complexity compared to all other methods, by eliminating the need to compute the regularized Gram matrix A at all subcarriers. We also observe that the complexity advantage of FADE becomes more pronounced for larger BS antenna

94 83 arrays where the complexity of computation of Gram matrix becomes the dominating operation.

95 Relative Error Relative Error Relative Error Relative Error K K K α 1 = A 1 A α 1 = AA H /2 α 1 = (λmax +λmin)/2 (B +U) 1 D 1 Init 1 Init 2 GS CG K (a) Average RE of the truncated Neumann series recursion Relative Error Relative Error Relative Error Relative Error K K K α 1 = A 1 A α 1 = AA H /2 α 1 = (λmax+ λmin)/2 (B+U) 1 D 1 Init 1 Init 2 GS CG K (b) Average RE of the truncated Schulz recursion Relative Error Relative Error Relative Error Relative Error K K α 1 = A 1 A α 1 = AA H /2 α 1 = (λmax +λmin)/2 (B +U) 1 D 1 Init 1 Init 2 GS CG 10 4 K K (c) Average RE of the truncated 3rd order recursion. Figure 3.1 : Average relative error (RE) comparison for different antenna configuration, algorithms, and initialization methods.

96 85 Frequency (w) Base Points Time (t) Adjacent Subcarriers Figure 3.2 : Illustration of the frame structure of a wideband system. FADE only computes explicit matrix inverses at the base points (in black); equalization at adjacent (in time and frequency) subcarriers is performed using one iteration of the implicit accelerated Neumann series recursion.

97 86 Number of real multiplications K=3 exact inverse K=2 K=1 K= Number of users (U) Figure 3.3 : The number of real-valued multiplications required for Neumann series expansions depends on the number of users U.

98 87 32x x Complexity SNR operation point [db] Complexity p=4% p=1% MRC 2000 p=4% p=1% K=2 p=0.5% MRC SNR operation point [db] 2 x 128x x 256x K=2 K=1 2 p=4% p=0.5% p=0.5% p=1% 1 MRC Complexity Complexity p=4% p=1% K=1 K=2 K=2 MRC SNR operation point [db] SNR operation point [db] Exp. Chol. Imp. Chol. Exp. D 1 Exp. α opt D 1 Imp. D 1 Imp. α opt D 1 CG GS FADE Figure 3.4 : Error rate performance vs. complexity trade-off. The complexity is defined as the number of real-valued multiplications and the performance as the SNR operation point, which is the minimum SNR that is required to achieve a BLER of 10%.

99 88 Chapter 4 MIMO Detector Implementations In this chapter, we present implementations of two different MIMO detectors, one design for for small-scale MIMO and another design for large-scale MIMO. We first implemented the proposed N-way MIMO detector on both the NVIDIA Fermi and Kepler GPUs for small-scale MIMO systems. The massive amount of computational power provided by GPUs is particularly useful for platforms in need of great flexibility, such as software defined radios, or to speed up MIMO system simulations. We have shown that by changing the number of parallel candidate searches, our MIMO detector provides a wide range of detection options: from a design that provides excellent error-rate performance (within 0.25 db of the soft-output maxlog-map detector) to a design that achieves several hundred Mb/s to Gb/s detection throughput. We also designed two reference VLSI MIMO detection architectures for a large-scale system, one relying on the approximate inverse and the other on an exact Cholesky based matrix inversion. Both architectures have been successfully implemented on a state-of-the-art Xilinx Virtex-7 FPGA and achieve more than 600 Mb/s, exceeding the peak data rates specified in 3GPP LTE-A for 20 MHz bandwidth. We show that the approximate matrix inversion is able to significantly reduce the hardware

100 89 implementation complexity (compared to that of the exact inversion) with only a slight error-rate performance degradation for systems with very large ratios between BS antenna and users. 4.1 N-Way MIMO Detector for Small-Scale Systems We first describe our N-Way MIMO detector implementation for small-scale MIMO system (as described in Section 2.1) on Nvidia GPU. Currently, most wireless standards employ orthogonal frequency division multiplexing (OFDM), which simplifies equalization by dividing the available bandwidth into multiple orthogonal subcarriers. With OFDM, each subcarrier corresponds to an independent MIMO detection problem. As a result, the receiver needs to perform MIMO detection on every subcarrier. Since many wireless standards use a large number of subcarriers and GPUs consist of many independent processing cores, GPUs are very suitable for this application as hundreds of independent MIMO detection problems can run in parallel on GPUs, which allows one to achieve high throughput. Since Nvidia GPUs can be viewed as multi-core SIMD processors, a suitable algorithm needs to be data parallel to maximize the use of the available execution units. In addition, the device memory latency is high on GPUs. To reduce the memory latency, a small amount of on-chip resources, such as registers and shared memory, can be used. As a result, a good algorithm needs to have a memory footprint small enough to fit into the available on-chip memory to reduce the number of expensive

101 90 device-memory accesses. We implemented the soft-output N-way MIMO detector on GPU using the CUDA C programming language. In this programming model, the programmer specifies a kernel, or a set of computations. At runtime, threads execute the same set of computations specified by the kernel on different input data to perform the task. The proposed MIMO detector implementation consists of two kernels. The first kernel performs a QR decomposition on the channel matrix. The second kernel searches for likely transmit vectors via a generated candidate list. Then, the algorithm uses this candidate list to compute soft-output information for each transmitted bit. In the following two sections, these two kernels are described in detail QR Decomposition Kernel To achieve low computational complexity and good numerical stability, we implemented the Modified Gram-Schmidt QR factorization [73] on the GPU. Algorithm 1 shows the pseudo-code for the modified Gram-Schmidt CUDA kernel function, which corresponds to a parallel implementation of the modified Gram-Schmidt algorithm. At runtime, 2N t threads execute the set of instructions defined in Algorithm 1 to perform one QR decomposition. The kernel function can be summarized as follows. In the initialization part, 2N t threads fetch the complex-valued inputs, the received signals y and the channel H, from device memory. To perform MRVD, the threads use the We assume the reader is familiar with CUDA. A detailed description and explanation can be found in [72].

102 91 input data to construct a real-valued extended matrix, V = [ H ỹ] = [v 0,..., v 2Nt ], in shared memory. The vector v i is the i th column of V and the scalar v k,i is the k th entry of v i. Next, 2N t threads perform QR decomposition on the extended matrix V. The result is an extended upper triangular matrix E = [R ŷ] which is stored in device memory, where the scalar e k,i is the k th element of the i th column of E. The whole process consumes 2N t iterations. Each iteration consists of a set of serial operations and a set of parallel operations. The serial operations are summarized in Lines 4 7 of Algorithm 1, where we compute v i 2, the squared l 2 norm of the i th column of V, and the corresponding scaling factor s. These serial computations are handled by one thread. In our implementation, these serial computations are handled by the i th thread. Subsequent computations are done in parallel. In Line 9, 2N t threads compute the i th orthogonal projection, v i, in parallel. In Lines 11-14, we assign one thread to each column of V and the number of columns (of V) updated decreases by one after each iteration. For the i th iteration, only threads k i update the columns of V, which leads to the condition k i in Line 11. Line 12 constructs the remaining elements in the i th row of E in parallel. In Line 13, the matrix V is updated for subsequent iterations. In this instance, the k th thread updates v k+1 by subtracting the projection of v k+1 on to the v i from v k+1. The serial computations can be handled by any thread. For example, it is possible to always pick the 1 st thread to compute the squared l 2 -norm.

103 92 The matrix V is stored in shared memory as elements of V are accessed repeatedly. Storing V in shared memory is much faster than storing the matrix in device memory. Storing the entire matrix in shared memory is possible as the number of elements in the matrix V is small for typical MIMO systems. Furthermore, the memory access pattern of this kernel is very regular. A column-major layout for V results in bank-conflict-free shared memory accesses Candidate Search Kernel The search algorithm is implemented with one kernel. The corresponding CUDA kernel function is shown in Algorithm 2. At runtime, M threads execute the set of instructions defined in Algorithm 2 in parallel to perform the candidate search. Each thread is assigned to one modulation point, where the k th thread is assigned to the k th modulation point in the finite alphabet Ω. The first two levels of the tree are fully expanded as shown in Lines 3-4. For the subsequent levels, level 2N t 3 to level 0, the algorithm first computes partial distances and then prunes outgoing branches by keeping the best outgoing paths. The partial distance for the k th path is computed in Lines 6-9. Line 10 computes γ k, which is used to find the best node at level i in lines The best node is selected with a simple round function followed by a threshold function on γ k. With the best node found, Line 13 updates the k th distance by adding the partial distance of the best node. At the end of the loop, the path p k is a candidate in our candidate list.

104 Hypothesis and 1-Hypothesis Generation Kernel With the candidate list, the 0-hypothesis and 1-hypothesis for each transmitted bit can be computed. The computations for the k th thread are summarized in Algorithm 3. As shown in Line 1, the k th path, p k, is first demodulated into a binary vector b k which is then stored in shared memory. The corresponding distance for the k th path, d k, is also stored in shared memory. We use N t log(m) threads, one thread per bit, to compute the 0-hypothesis and the 1-hypothesis for each bit. In lines 5-11, the k th thread scans the binary vectors one by one to find the 0-hypothesis, h 0 k, and the 1-hypothesis, h 1 k, for the kth bit. Finally, the LLR for the k th bit is the difference between h 0 k and h1 k. Optimizations Beyond the above procedures, we also improve throughput in several ways. 1. We unroll the loops in both algorithms to reduce the total number of instructions. 2. As there are no data dependencies between the threads in the search process, it is possible to store the path history P in registers instead of using shared memory to eliminate shared memory read/load instructions and memory address computation. Storing path history in registers also reduces the number of device memory accesses as computation is carried out with on-chip resources.

105 94 3. On Nvidia GPUs, the SIMD instructions (or WARP instructions) are 1024 bit wide, where each instruction operates on 32 elements. At runtime, 32 threads share the same instruction. As a result, multiple MIMO detections are packed into one thread block to ensure that each thread block consists of an integer multiple of 32 threads to improve efficiency. For example, for a 4 4 MIMO 16-QAM system, QR decomposition uses 8 threads and MIMO detection uses 16 threads. We pack at least 4 QR decompositions into one thread block and at least two 16-QAM detectors into one thread block to ensure there are 32 threads within a thread block N-Way Parallel MIMO Detection Kernel In our implementation, we spawn 2NN t threads for an input pair H and y. Each set of 2N t threads constructs a permuted version of H in shared memory by reading the input data in a different order. Each set of 2N t threads then performs QR decompositions on its permuted channel matrix. We then spawn MN threads to perform N parallel MIMO detections and LLR generation. To reduce communication among cores on the GPU, all threads corresponding to an instance of a MIMO detection problem reside within the same thread block. N-Way Parallel LLR Computation Kernel The threads generate LLRs using the larger combined candidate list, which consists of MN candidates. We can use one thread to scan through all MN candidates to find the

106 95 0-hypothesis, h 0 k, and 1-hypothesis, h1 k, for the kth bit. In this case, we use N t log 2 (M) threads, one thread per transmitted bit, to compute LLRs for all transmitted bits. Since our thread block consists of MN threads, we can improve the efficiency of the LLR generator by parallelizing the workload further. Instead of one thread per transmitted bit, we split the workload of finding the h 0 k and h1 k among N threads. Since there are N lists where each list consists of M candidates, we assign one list to each thread. Using the assigned list, each thread attempts to find the 0-hypothesis and the 1-hypothesis. As a result, there is one set of N 0-hypotheses and one set of N 1-hypotheses per bit. To find the 0-hypothesis and the 1-hypothesis, we use two threads to scan through the N 0-hypotheses and N 1-hypotheses to find the minimum of each set. The two minimums correspond to the 0-hypothesis and 1-hypothesis respectively. The difference between these values is L k D, the LLR for the kth bit. In total, we use NN t log 2 (M) threads to compute LLRs for all transmitted bits. The complexity of the LLR generator is directly proportional to the number of candidates in the candidate list, which is MN. The parallel searches do not necessarily generate unique candidates. As a result, there may be duplicates in the candidate list. Since (2.4) consists of min( ) operators, duplications do not affect the result. Although it is possible to reduce the complexity of the LLR generator by eliminating duplications in the candidate list, the number of unique candidates is not fixed, which leads to indeterminate runtime. As a result, we do not eliminate duplicate candidates in the list.

107 Implementation Performance In this section, we investigate and analyze the detector s throughput performance on Fermi and Kepler graphics cards for various different configurations. We then compare and show the advantages of our detector to other GPU-based soft-output MIMO detector implementations. To measure the throughput performance of our implementation, we used two types of graphics cards. We first used an NVIDIA GeForce GTX 470 graphics card (Fermi) with 448 shaders running at 1215 MHz and with 1280 MB of GDDR5 with a peak bandwidth of GB/s. In addition, we used one GPU on NVIDIA GeForce GTX 690 graphics card (Kepler), which has 1536 shaders running at 915 MHz and has 2GB of GDDR5 with a peak bandwidth of GB/s. This setup is equivalent to a GTX 680 graphic card downclocked by 10%. In our benchmark, the reported execution time is averaged over 1000 runs. We first analyze the runtime of N-way detection without considering transport time, the time required to copy data from host memory to GPU and vice versa. We then look at the runtime of QR decomposition without considering transport time. Finally we report the runtime of the entire design considering transport time. MIMO Detection Kernel Effect of optimizations on kernel performance: We first present the effect of different optimization techniques on the performance of the detector. The nested loop within

108 97 Algorithm 2 depends on N t (the number of antennas), while the loop within Algorithm 3 depends on N t as well as M (the modulation order). Unrolling these loops reduces the number of instructions which increases the throughput of the MIMO detector. Similarly, locality of the data also affects performance of the detector. We tried different unrolling techniques in conjunction with different data placement. As an illustrative example, we consider the peak throughput of a QAM detector where N = 1. We present the results on the Fermi GPU as the trend is similar for Kepler. The BER performances of different cases are presented in Figure 4.1. We now Parameterized (Shared) Mb/s Hand Unrolled (Shared) Hand Unrolled (Local) Modified Template Unroll (Local) 0 Figure 4.1 : Effect of optimizations on 4 4, 64-QAM MIMO detectors, N = 1, 8192 subcarriers. describe and explain each case in detail: 1. For the initial case (labeled as parameterized (shared) in Figure 4.1), we put the path history in shared memory and attempt to use NVIDIA unroll directives

109 98 to unroll these loops. Since both N t and M are input parameters, the MIMO detector kernel is designed as a template function where N t and M are template parameters. This enables the compiler to generate different instances of the function for different combinations of N t and M. For the nested loop within Algorithm 2, we notice that the compiler only unrolls the inner loop and does not unroll the outer loop even though unroll directives are used on both loop levels. 2. For the second case (labeled as hand unrolled (shared) ), we unrolled loops manually by hand, which results in significant improvement. 3. For the third case (labeled as hand unrolled (local) ), the array access pattern is now completely deterministic as the loops are completely unrolled. As a result, we put the path history into local memory instead of shared memory. Since memory access patterns are deterministic at compile time, the compiler stores the path vectors into registers (instead of device memory), which eliminates shared memory load/store instructions and increases throughput. 4. For the fourth case (labeled as modified template unroll (local) ), we note that manually unrolling is not practical as we need to manually generate N t M instances of the MIMO detector functions, one instance per MIMO configuration. We use C++ templates to perform compile-time transformations to force the compiler to unroll these loops. Although the number of instructions is more than

110 99 manual unrolling, it is significantly faster than our initial cases and does not require multiple copies of the kernel code. Therefore, we use this configuration in the subsequent performance results. MIMO detection kernel performance: Table 4.1 shows the throughput of the N-way detector for different MIMO configurations in a system with 8192 subcarriers. We packed up to 8 different detection problems per thread block for 16-QAM MIMO configurations. Since the number of threads required for the MIMO detector scales linearly with N, runtime of the N-way detector is directly proportional to N when M and N t are fixed values. In the case where N and M are fixed values, the computation required to generate LLR values, as show in Algorithm 3, is a constant overhead that depends on M but does not depend on N t. As a result, runtime is not directly proportional to N t. Consequently, runtime of the QAM MIMO detector is not half of the QAM MIMO detector. The detector achieved higher throughput on Kepler for the majority of the cases. For the computationally intensive cases such as the QAM configuration where N = 4, Kepler achieved a speedup of 1.7 over Fermi. We used a large number of subcarriers to report peak throughput. However, we do not need an extremely large number of subcarriers to achieve throughput close to the peak performance. Figure 4.2 shows throughput as a function of the number of subcarriers. We see that in all of these cases, the throughput starts to plateau after 2048 subcarriers.

111 100 QR Decomposition Kernel An N-way Parallel MIMO Detection requires N QR decompositions on an input y and H to generate inputs for the detector. Table 4.2 shows the results for 2 2 and 4 4 MIMO configurations with different values of N. We packed up to 8 different QR decomposition problems in one thread block. The results show that the QR decomposition kernel is not the bottleneck as the runtime is much smaller than that of MIMO detection. Nevertheless, we also used C++ templates to fully unroll the loops in Algorithm 1 to reduce runtime. Performance of the Complete Design Table 4.3 shows the total runtime of the complete design measured with the CPU timer. The behavior of the detector is similar to the runtime of the detector. For example, in the case where N t and M are fixed values, runtime increases as N increases. However, due to constant overhead such as transport time, runtime is no longer linearly proportional to N. We note that the total runtime results are pessimistic. Using CUDA streams, data transfers can be overlapped with computation to further increase the throughput. Comparison with Existing Work We compare our kernel time of the MIMO detector with the kernel time of other soft-output MIMO detection implementations.

112 101 We compared the N = 4 case against the soft-output trellis-based MIMO detector [38] and the fully parallel fixed complexity-sphere detector (FPFSD) [34]. As shown in Section 2.1.6, the error-rate performance of the N = 4 case is better than the BER performance of soft-output trellis MIMO detector and is similar to the BER performance of FPFSD. To compare these detectors, we also provided throughput normalized by core count and core frequency. Compared to N-way MIMO detection, trellis based MIMO detection [38] requires more instructions. For trellis based MIMO detection, the number of instructions required to find the best path out of M possible paths scales with modulation order M. By comparison, N-way detection uses a round and threshold function to find the best outgoing path. As a result, the number of instructions required to find the best outgoing path is constant and does not depend on M. Consequently, particularly for higher modulation orders, the N-way MIMO detector achieves higher normalized throughput than that of the trellis-based MIMO detector. The work that is the most similar to ours is the FPFSD in [34], which uses a similar detector on the same GPU. We emphasize, however, that the throughput of FPFSD is lower than that of our work. In our design, we perform detection in the real domain through RVD, while FPFSD performs MIMO detection in the complex domain. We do not expect this to cause a large throughput difference, as RVD does not reduce the computation complexity. We believe there are two key differences between our detector and the FPFSD. First, in our design, we store the candidate list in registers to reduce

113 102 the number of device memory accesses. For FPFSD, the candidate list is stored in device memory which increases the number of slow device memory accesses. Second, we reduce the number of instructions by aggressively unrolling loops in our kernels. This is not straightforward; we used C++ templates to automatically unroll loops for different MIMO detector configurations (see Section 4.1.5). These optimizations were not done for the FPFSD detector. As a result, we achieved higher normalized throughput than that of the FPFSD detector. 4.2 Linear MIMO detectors for Large-Scale Systems We now detail two VLSI architectures suitable for large-scale MIMO detection in 3GPP LTE-A. The first design implements the proposed approximate inversion approach via Neumann series expansion (see Section 3.3.3) and the second design implements an exact inverse (see Section 3.3.1); this enables us to perform a fair hardware complexity vs. error rate performance comparison (see Section for the comparison) Architecture Overview The proposed general architecture is depicted in Figure 4.3 and consists of the following parts. The preprocessing unit performs matched filter computation, i.e., computes y MF w = H H w y w, the regularized Gram matrix, and the (approximate) inverse. Note that for the approximate inversion unit, we also output D 1 w and G w, which are needed to compute the SINR. To achieve the peak throughput specified in LTE-A [1],

114 103 while being able to handle the (worst) case where the channel estimates change from subcarrier to subcarrier and from SC-FDMA symbol to SC-FDMA symbol (see, e.g, [74]), we use multiple instances of the preprocessing unit. The matched filter output, the (approximate) inverse, and the regularized Gram matrix, are then passed to the subcarrier processing unit. This unit performs equalization, i.e., computes ŝ w = A 1 w yw MF and the post-equalization SINR (detailed in Section for the exact inverse and in Section for the Neumann series approximation). To perform per-user data detection, a buffer is required that aggregates all equalized symbols and SINR values, which are computed on a per-subcarrier basis. The architecture then performs an IFFT, which transforms the equalized symbols from the subcarrier domain into the user domain (or time domain). The LLR computation unit finally computes, together with the buffered post-equalization NPWe values, soft-output information in the form of max-log LLRs. We next provide the details for the key blocks of the proposed detector architecture Cholesky-based Inversion Unit In order to enable a fair performance/complexity assessment of the proposed approximate matrix inversion unit, we also implemented a reference unit that performs an exact matrix inversion. This unit simply replaces the approximate inverse unit detailed in Section We next summarize the used Cholesky-based inversion algorithm In many practical scenarios, the channel estimates may change only slowly. Hence, one does not need to compute the inverse for every SC-FDMA symbol. This fact could be either exploited to reduce the power consumption or to increase the achievable throughput of our detector designs.

115 104 and then, outline the corresponding VLSI architecture. Inversion algorithm In the proposed exact inversion unit, we compute A 1 w in three steps: (i) we form the regularized Gram matrix A w = G w + N 0 Es 1 I U ; (ii) we perform a Cholesky decomposition according to A w = L w L H w, where L w is a lower-triangular matrix with real-values on the main diagonal [75]; (iii) we compute the inverse A 1 w using an efficient forward/backward substitution procedure proposed in [20]. Specifically, we first solve L w u i = e i for u i, i = 1,..., U, where e i is the i th unit vector, via forward substitution. We then solve L H w v i = u i for v i, i = 1,..., U, via back substitution, which leads to the desired inverse A 1 w = [ v 1 v U ]. Note that this approach avoids a costly matrix-by-matrix multiplication, which would be needed by directly computing A 1 w = (L H w ) 1 L 1 w. Cholesky decomposition architecture The VLSI architecture for the Cholesky-based inverse differs from the one in Section In particular, we deploy three separate units that compute (i) the regularized Gram matrix, (ii) the exact inverse using the above algorithm, and (iii) a forward/backward substitution unit to compute the inverse A 1 w. All units are detailed next and separated by pipeline stages. The regularized Gram matrix is computed as a sum of outer products, i.e., as G w = B i=1 r ir H i, where r i designates the i th row of H w. Since the Gram matrix is symmetric, it can be computed efficiently with a

116 105 triangular systolic array of multiply and accumulate units (MACs). The architecture is shown in Fig. 4.4 for B = 3. Each processing element (PE, denoted by P kl in Fig. 4.4) in the systolic array consists of a multiply-and-accumulate (MAC) unit. The transposed input matrix H T is shifted one column at a time into the systolic array. Each processing element performs a MAC operation with both input operands. To ensure that each PE processes the correct set of operands, the values in i th row of H T are delayed by i 1 cycles. Once an input value reaches a diagonal PE, the value is conjugated and passed to the lower part of the systolic array. There are two PE variants in the architecture. A PE on the main diagonal requires two multipliers to compute the squared absolute value of a complex number, a + bj 2 = a 2 + b 2. A PE on the off-diagonal computes the product of two complex numbers, i.e., (a + bj)(c bj). To reduce the number of multipliers, we deploy strength reduction, which requires three multipliers and five additions [76]. Consequently, the number of multipliers required in this unit is (3B 2 + B)/2. To minimize the critical path, each MAC unit is pipelined and has a throughput of one MAC operation per clock cycle. The Gram computation unit reads one row of H w at a time and is able to output a Gram matrix every B th clock cycle. To obtain the regularized Gram matrix A w, we add N 0 E 1 s to the diagonal of G w in the final clock cycle. We then perform the Cholesky decomposition of A w with a lower-triangular systolic array to obtain the lower-triangular matrix L w. The systolic array consists of two distinct processing elements (PEs): (i) the PEs on the main diagonal and (ii) the PEs

117 106 on the off-diagonal. The data flow is similar to the linear systolic array (the obvious case ) proposed in [77]. The difference is that our design processes an incoming column of A w with multiple PEs, whereas an incoming column is processed with a single PE in [77]. As a result, our design is able to achieve the peak throughput requirements of LTE-A. In our design, the pipeline of one column of PEs is 16 stages deep and streams out one column of L w every clock cycle (after a latency of 16(U 1) clock cycles). Consequently, the achieved throughput corresponds to one Cholesky decomposition every U clock cycles. Forward/backward-substitution architecture The forward/backward substitution unit (FBSU) receives a lower-triangular matrix L w as input, and computes A 1 w = (L H w ) 1 L 1 w as outlined in Section The FBSU consists of three major components: (i) a forward substitution unit (FSU), which solves for L w u i = e i, (ii) a backward substitution unit (BSU), which solves for L H w v i = u i, and (iii) a Hermitian transpose unit, which computes L H w. Since the computations for the FSU and the BSU are symmetric, we implement the forward substitution architecture and re-use it for the backward substitution, by reversing the order of the columns of the matrix L H w and vector u i before reading them into the BSU. To simplify notation, we assume that the equation to be solved by the forward substitution corresponds to Lx = b for some x and b. Since the forward substitution of solving the equation Lx i = b i for each b i (i = 1,..., U) is independent, we use U

118 107 processor elements (PEs) to solve for all x i in parallel. Each PE is implemented using a fully pipelined architecture, which consists of U stages of computation logic. Each stage contains two multiplexers, a complex-valued multiplier, and a complex-valued subtraction. In each stage, either i = b i j L i,jx j or i /L i,i is computed according to the control signals. Therefore, for an input matrix L w of dimension U, the FSU uses U 2 complex-valued multipliers; the entire FBSU utilizes 2U 2 complex-valued multipliers. The matrix conjugate unit is implemented using multiplexers and U FIFOs (realized by on-chip B-RAMs in the FPGA). The conjugate matrix L H w is also reordered based on the pattern of the input sequence of the BSU Approximate Inversion We used a reconfigurable multiplication systolic array that computes regularized Gram matrix and the approximate inverse. The array can compute truncated Neumann series expansions for any arbitrary K value, i.e., the number of Neumann series terms can be selected at run-time. The array first computes the regularized Gram matrix A w = D w + E w. The array also computes the reciprocal of diagonal entries of A w to obtain D 1. The systolic array then computes D 1 w E w, by using the matrices D 1 w and E w computed in the first phase. Next, the systolic array computes the K = 2 term Neumann series approximation Ã 1 w 2 = D 1 w D 1 w E w D 1 w. Finally, the array computes the K-term Neumann series approximation with the intermediate results D 1 w E w and D 1 w. In

119 108 particular, given Ã 1 w k 1, we can compute Ã 1 w k as: Ã 1 w k = D 1 w D 1 w E w Ã 1 w k 1 which is a matrix addition and a matrix multiplication. The last step can be repeated for a configurable number of iterations to compute an arbitrary K-term approximation Matched filter and Equalization Matched filter computation The matched filter (MF) unit consists of a linear array of U PEs. Each PE is associated with one row of the Hermitian matrix H H w, and contains a single multiply accumulate unit (MAC). The MF unit reads a new entry of y w every clock cycle, and multiplies it with the corresponding entries in H H w in each PE and then, adds it the previous results. This unit computes a MF output every B clock cycle. Equalization unit The equalization unit consists of a linear array of U MAC units, and reads the normalized approximate inverse Ã 1 w KB and the ymf w /B from the matched filter unit. For each clock cycle, this unit takes one column of Ã 1 w KB, multiplies it with one element from yw MF /B, and adds the scaled column to the previous results. The unit outputs an equalized symbol ŝ w every U clock cycles.

120 SINR computation unit The SINR computation unit simply consists of U MAC units that sequentially compute the approximate effective channel gain µ (i) K. This unit furthermore computes the approximate NPWe (3.14) using a single MAC unit. Subsequently, the unit multiplies µ (i) K with the reciprocal of the approximate NPWe ν2 i to obtain the post-equalization SINR ρ 2 i. The same unit computes the reciprocal of µ (i) K computation unit detailed next. which is used in the LLR IFFT and LLR Computation Units IFFT unit In order to transform the per-subcarrier data into the user (or time) domain, we deploy a single Xilinx Discrete Fourier Transform IP LogiCORE unit (see [78] for the specifications). This unit supports all forward and inverse DFT modes specified in 3GPP LTE [79], but we only make use of its IDFT capabilities. The IFFT unit reads and outputs data in a serial manner. For an IFFT transform size of 1200 subcarriers, the core can process a new set of data every 3779 clock cycles. This FFT unit achieves more than 317 MHz on a Virtex-7 XC7VX980T FPGA and hence, achieves a throughput beyond 600 Mb/s for 8 users, 64-QAM, and 20MHz bandwidth.

121 110 LLR computation unit The LLR computation unit (LCU) generates max-log soft output values given the effective channel gains µ (i) from the IFFT block and the post-equalization SINR values ρ 2 i obtained from the SINR block. Since LTE specifies Gray mappings for all modulation schemes (BPSK, QPSK, 16-QAM, and 64-QAM), one can simplify the computation of the max-log LLR values by rewriting L (i) t (b) = ρ 2 i λ b (ˆx (i) t ) and realizing that λ b ( ) is a piecewise linear function that depends on the bit index (see [20] for the details). To this end, the LCU first scales the real and imaginary parts of the equalized time-domain symbol with the reciprocal of the effective channel gain 1/µ (i). Then, it evaluates the piecewise linear function λ b (ˆx (i) t ) and scales the result with the post-equalization SINR ρ 2 i. The resulting max-log LLR value is then delivered to the output of the unit. In order to minimize the circuit area, the proposed architecture evaluates each piecewise linear function with logical shifts and additions only. The reciprocals are computed with a lookup table that is stored in B-RAM units (see [22] for architectural details). A single instance of the resulting LCU is able to processes one symbol every clock cycle, resulting in a peak throughput of 1.89 Gb/s for 64-QAM at 317 MHz Implementation Results and Trade-offs The approximate detection engine for 3GPP-LTE and the exact Cholesky-based detector have been implemented on a Xilinx Virtex-7 XC7VX980T FPGA. The fixed-point

122 111 parameters, FPGA implementation results, and the associated performance/complexity trade-offs are presented next. Fixed-Point Design Parameters In order to minimize the hardware complexity, fixed-point arithmetic is used in the entire design. The associated fixed-point parameters were determined via extensive simulations. In the following, the word-lengths refer to the real or imaginary part of a complex-valued number. The channel matrices H w, the receive-vectors y w, and the noise variance N 0 E 1 s, are all quantized to 15 bit. The word-length of the output of the Gram matrix and inversion unit are also set to 15 bit; equivalently, the matched filter unit has 15 bit at the input and output. For both matrix inversion circuits, all multiplications have been mapped onto Xilinx DSP48 slices. In order to achieve sufficient precision at minimum implementation complexity, the MAC registers within the DSP48 units are set to 22 bit. The LUT in the reciprocal unit consists of 1024 addresses with 12 bit outputs. Hence, it can be implemented efficiently using a single block-ram (B-RAM) available on the FPGA. The equalizer module uses a 15 bit input and its output, which is stored in the data buffer, is quantized to 12 bit. The buffer stores (complex-valued) data for 1200 subcarriers and U users. The SINR computation module has a 15 bit input and 12 bit output. The input and output of the IFFT unit are 12 bit; the precision of the internal multipliers is set to 18 bit. The inputs of the LLR computation are quantized

123 112 to 12 bit and the computed LLRs are represented by 8 bit. The resulting fixed-point performance is shown in Figure 4.5 (labeled by FP ) for 64 4 and systems. As it can be seen, the fixed-point implementation is virtually indistinguishable from the floating-point golden model. In particular, the implementation loss is less than 0.05 db SNR at 10% BLER. FPGA Implementation Results Table 4.5 summarizes the key (post-place-and-route) implementation results of the proposed approximate and exact soft-output data detector for LTE-based massive MIMO wireless systems. We parameterized the architecture for U and B to explore the impact on the required FPGA resources and the corresponding throughput. The implementation results for antenna configurations of and 64 4 are detailed in Table 4.5. In order to support 75 Mb/s data rate for each LTE-A user in 20 MHz bandwidth, we use multiple instances of the preprocessing unit. Specifically, we used 8 and 5 instances of approximate matrix inversion units for the and 64 4 system, respectively. For the exact inverse, we used 6 and 3 regularized Gram matrix units for the and 64 4 system, respectively. In addition, we used one Cholesky decomposition unit and one forward and backward substitution unit for both cases to meet the data rate requirements. As shown in Table 4.5, all designs are capable of running at 317 MHz and the critical path is the routing between different blocks of the detector. For the and

124 systems, the proposed units can achieve 603 Mb/s and 301 Mb/s, respectively. For the 64 4 system, the design meets the 300 Mb/s peak data rate requirement specified in LTE-A with 4 users and 20MHz bandwidth. In addition, our design can scale beyond LTE-A specifications, i.e., the proposed designs can support up to 8 users and still achieve a 75 Mb/s per-user requirement. In terms used resources on the Virtex-7 XC7VX980T FPGA, the approximate soft-output data detector is smaller than the Cholesky-based unit. There are notable saving in logic slices and DSP48 units. For 64 4, K = 3 uses 56% fewer LUT slices and 29% fewer DSP48 units compared to that of the Cholesky-based unit. For 128 8, K = 3 uses 19% fewer LUT slices and 26% fewer DSP48 units compared to that of the Cholesky-based unit. We emphasize that the savings in hardware resources become significantly larger as the number of users U increases. Performance/Complexity Trade-off Based on the simulated BLER results in Figure 4.5 and the associated FPGA implementation results, we are now ready to characterize the error-rate performance vs. hardware complexity trade-offs associated with the detector containing the proposed approximate matrix inversion and the Cholesky-based exact inversion. To this end, we show the associated hardware complexity against the minimum SNR required to achieve 10% BLER in Figure 4.6. Since both designs are dominated by multipliers, we define the hardware complexity as the number of multipliers required to achieve a

125 Mb/s per-user throughput. From Figure 4.6, we observe that the hardware complexity of the Cholesky-based detector is larger than that of the approximate inversion circuit for K = 3 and K = 2. In addition, for large ratios between the number of BS antennas to the number of users B/U, we clearly see that the SNR performance of the approximate inverse with K = 3 and the exact inverse are very similar. For small ratios B/U, however, the performance difference between the approximate inverse and the exact inverse is rather large, which is reflected in the analysis shown Section Hence, the ratio B/U determines whether an approximate or exact inversion is beneficial in a practical large-scale MIMO system. Note that for and 64 4, the approximate inverse with K = 2 is unable to achieve 10% BLER (cf. Figure 4.5). We note that when considering 16-QAM modulation (rather than 64-QAM modulation, as shown here), the approximate inversion for K = 2 is capable of achieving similar performance as the exact inverse (see [22, 23] for corresponding simulation results). Related FPGA Designs for Linear Data Detection A host of FPGA designs for linear data detection in conventional (small-scale) MIMO systems have been proposed in the literature [28, 51, 52, 54, 80 83]. Unfortunately, all these designs differ in various ways. First, the corresponding architectures rely on different matrix inversion algorithms, such as the QR decomposition [51, 54, 80, 83], Gram-Schmidt orthogonalization [52, 84], LU decomposition [20], direct matrix

126 115 inversion [82], divide-and-conquer methods [81,85]. Second, all FPGA implementations do not generate soft outputs, with the exception of [86]. Third, the designs were implemented on different FPGA types. Since the soft-output detector implementations proposed in this thesis are for large-scale MIMO systems having hundreds of BS antennas and none of the small-scale MIMO detector designs in [28, 51, 54, 80 85] was implemented on a Xilinx Virtex-7 FPGA, a fair comparison of our design with the above-mentioned implementations is difficult. Hence, we decided to resort to the comparison with our own reference circuit, i.e., the Cholesky-based inverse, as shown in Section

127 116 Algorithm 1 Modified Gram-Schmidt CUDA Kernel: computations performed at the k th thread. 1. Input: y, H 2. Initialization: (a) s = 0, where s is stored in shared memory (b) Fetch y and H to construct V = [ H ỹ] in shared memory 3. for step i = 0 to 2N t 1 do 4. if (k = i) 5. e i,i = v i 2 6. s = 1/ e i,i 7. end if 8. syncthreads() 9. v k,i = v k,i s 10. syncthreads() 11. if (k i) 12. e i,k+1 = v H i v k v k+1 = v k+1 v i e i,k end if 15. end for

128 117 Algorithm 2 Candidate search CUDA kernel: the k th thread search for the k th candidate 1. Input: E = [R ŷ] 2. Initialization: (a) Q = M (b) p k = [0, 0,..., 0, I(Ω k ), R(Ω k )] 3. d k = (ŷ 2Nt 1 r 2Nt 1,2N t 1 p k 2N t 1 )2, 4. d k = d k + ( ŷ 2Nt 2 ) 2 2N t 1 i=2n r t 2 2N t 2,i p k i 5. for step i = 2N t 3 to 0 do 6. b k i = ŷ i, 7. for step j = 2N t 1 to i b k i = b k i r i,j p k j 9. end for // Find the best outgoing node 10. γ k = bk i/r i,i 11. p k i = round ( 1 2 (γ k + Q 1) ) 2 Q if ( p k i > Q 1) p k i =sign(p k i ) (Q 1) // Update the distance of the k th path 13. d k = d k + (b k i r i,i p k i ) end for

129 118 Algorithm 3 LLR computation CUDA kernel: The k th thread updates the k th 0-hypothesis and the k th 1-hypothesis 1. b k = demod(p k ), d k = d 2. syncthreads() 3. if (k < N t log 2 (M)) 4. Initialization: h 1 k = and h0 k =. 5. for step j = 0 to M 1 do 6. if (k th bit of b j = 1) and (d j h 1 k ) 7. h 1 k = d j 8. else if (k th bit of b j = 0) and (d j h 0 k ) 9. h 0 k = d j 10. end if 11. end for 12. end if 13. L k D = h0 k h1 k

130 119 Table 4.1 : MIMO Detection kernel time for 8192 MIMO symbols GPU Configuration N = 1 N = 2 N = 3 N = 4 Fermi Kepler 2 2, 16-QAM ms/946.8 Mb/s ms/511.5 Mb/s , 16-QAM ms/ Mb/s ms/406.0 Mb/s ms/302.2 Mb/s ms/239.7 Mb/s 2 2, 64-QAM ms/204.2 Mb/s ms/102.8 Mb/s , 64-QAM ms/ Mb/s ms/164.7 Mb/s 1.75 ms/111.7 Mb/s ms/76.5 Mb/s 2 2, 16-QAM ms/ Mb/s ms/701.1 Mb/s , 16-QAM ms/ Mb/s ms/533.7 Mb/s ms/351.7 Mb/s ms/254.2 Mb/s 2 2, 64-QAM ms/423.7 Mb/s ms/205.5 Mb/s , 64-QAM ms/519.2 Mb/s ms/274.9 Mb/s ms/ Mb/s ms/ Mb/s

131 N=1 N=2 N=3 N= N=1 N=2 N=3 N= ,048 4,096 6,144 8,192 Number of subcarriers Number of subcarriers (a) 16-QAM, Fermi Mb/s Mb/s (b) 64-QAM, Fermi N=1 N=2 N=3 N= N=1 N=2 N=3 N= Mb/s Mb/s ,048 4,096 6,144 8,192 Number of subcarriers Number of subcarriers (c) 16-QAM, Kepler (d) 64-QAM, Kepler Figure 4.2 : Throughput of 4 4 N-way MIMO detector on GPU vs. workload size.

132 121 Table 4.2 : QR decomposition GPU kernel time for 8192 MIMO symbols. GPU Configuration N = 1 N = 2 N = 3 N = 4 Fermi Kepler ms ms ms ms ms ms ms ms ms ms ms ms

133 122 Table 4.3 : Total runtime for 8192 MIMO symbols including data transport time. GPU Configuration N = 1 N = 2 N = 3 N = 4 Fermi Kepler 2 2, 16-QAM ms/191.1 Mb/s ms/147.7 Mb/s 4 4, 16-QAM ms/147.2 Mb/s ms/115.9 Mb/s ms/ 96.3 Mb/s ms/79.9 Mb/s 2 2, 64-QAM ms/119.5 Mb/s ms/72.6 Mb/s 4 4, 64-QAM ms/125.6 Mb/s ms/89.9 Mb/s ms/66.4 Mb/s ms/0.6 Mb/s 2 2, 16-QAM ms/224.3 Mb/s ms/145.3 Mb/s 4 4, 16-QAM ms/188.0 Mb/s ms/141.6 Mb/s ms/108.3 Mb/s ms/89.7 Mb/s 2 2, 64-QAM ms/200.5 Mb/s ms/125.8 Mb/s 4 4, 64-QAM ms/198.9 Mb/s ms/136.9 Mb/s ms/ 94.8 Mb/s ms/79.6 Mb/s

134 123 Table 4.4 : Throughput comparison of the MIMO detection kernel with other GPU MIMO detectors. Throughput Normalized Throughput Detector Type GPU Configuration [Mb/s] [(Mb/s)/(Core GHz)] Trellis-based [38] FPFSD [34] Telsa C1060 Tesla C2070 N-way, N = 4 GTX , 16-QAM , 64-QAM , 16-QAM , 64-QAM , 16-QAM , 64-QAM Table 4.5 : Implementation results on a Xilinx Virtex-7 XC7VX980T FPGA Antenna configuration a Inversion algorithm b K = 3 Cholesky K = 3 Cholesky Clock frequency [MHz] Throughput [Mb/s] LUT slices FF slices DSP48 units Block RAMs (28%) (34%) (6%) (12.9%) (16%) (17.4%) (3.2%) (3.2%) (30%) (40.2%) (7%) (9.14%) (0.6%) (2.17%) (0.4%) (1.07%) a refers to B = 128 BS antennas and U = 8 single-antenna users. b K = 3 designates the approximate inversion with 3 Neumann series terms.

135 Figure 4.3 : High-level VLSI architecture of the large-scale MIMO detection engine for 3GPP LTE-A. 124

136 125 H T 0,N-1 H T 0,1 H T 0,0 Reg P00 Reg H T 1,N-1 H T 1,1 H T 1,0 Reg P10 Reg P11 Reg Reg H T 2,N-1 H T 2,1 H T 2,0 Reg Reg P20 P21 Figure 4.4 : High-level architecture of the systolic array for B = 3. Reg P22

137 BLER 10 1 B=64, K=1 B=64, K=2 B=64, K=3 FP B=64, K=3 B=64, exact B=128, K=1 B=128, K=2 B=128, K=3 B=128, exact SNR[dB] (a) BLER SNR[dB] (b) B=64, K=1 B=64, K=2 B=64, K=3 B=64, exact B=128, K=1 B=128, K=2 B=128, K=3 FP B=128, K=3 B=128, exact B=256, K=1 B=256, K=2 B=256, K=3 B=256, exact 10 0 BLER 10 1 B=128, K=1 B=128, K=2 B=128, K=3 B=128, exact B=256, K=1 B=256, K=2 B=256, K=3 B=256, exact SNR[dB] (c) Figure 4.5 : Block error-rate (BLER) performance comparison for (a) U = 4 (b) U = 8, and (c) U = 12 single-antenna users where M = 64 and MCS = 28; FP designates the performance of a fixed-point implementation.

138 127 Chol 258x8 Hardware Complexity 10 3 K=3 Chol K=2 128x4 Chol 128x8 K=3 towards more users towards more BS antennas K=3 K=2 Chol 64x4 K= Min. SNR [db] required to achieve 10% BLER Figure 4.6 : Performance/complexity trade-off. Hardware complexity is defined as the number of DSP48E1 slices required to achieve the LTE-A uplink 75 Mb/s per-user peak throughput.

139 128 Chapter 5 Turbo Decoder Turbo codes are capacity-approaching channel codes that can be decoded at high throughput and low power using dedicated hardware accelerators [87]. Hence, turbo codes are used in a large number of cellular wireless standards, such as 3GPP HSPA+ [88] and LTE-Advanced [79]. As a result, we utilize turbo codes extensively in our performance evaluation (see Section 2.1.6, Section and Section 3.8). Obviously, a fast software-based turbo decoder implementation would significantly reduce simulation time. In addition, software-based turbo decoders can enable better software-based wireless testbeds. Recently, a number of software-based wireless testbeds have been developed to demonstrate the feasibility of software-based real-time communication systems on general purpose processors [89 91]. Implementations on such architectures are attractive for multiple reasons. Although turbo codes offer superior error-rate performance over convolutional codes, turbo decoding requires higher computational complexity [92]. Unsurprisingly, turbo decoding is typically carried out with specialized hardware accelerators, such as ASIC designs [93 97] or FPGA implementations [98] to enable high-throughput performance. Consequently, SDR systems such as the ones in [89 91] rely on convolution codes instead of turbo codes to avoid the high

140 129 complexity of channel decoding. The use of convolutional codes, however, results in inferior error-correction performance (compared to turbo codes). In addition, LTE-Advanced specifies the use of turbo codes for both uplink and downlink. Hence, corresponding SDR receiver designs necessitate the development of software-based turbo decoder solutions that are capable of achieving high throughput. In this chapter, we evaluate the performance of turbo decoder implementations on two different high performance programmable processors a quad-core Intel i7-3770k (Ivy Bridge) and a Nvidia GeForce GTX 680 (Kepler GK104). We design two parallel turbo decoder implementations with a similar feature set. Both proposed implementations support HSPA+ and LTE-Advanced and take advantage of the unique features of both platforms. In particular, we perform a variety of device specific optimizations, such as the use of linear-map approximation on the CPU and the use of shuffle instructions on the GPU, to maximize the throughput and/or to improve the error-rate performance. We conclude by comparing the throughput of both implementations. 5.1 Overview of Turbo Decoding The high-level structure of a rate- 1 /3 3GPP turbo decoder is shown in Figure 5.1. The turbo decoder consists of two concatenated component decoders exchanging soft information in terms of the log-likelihood ratio (LLR) for each transmitted information bit through an interleaver (denoted by ) and a deinterleaver (denote by 1 ). HSPA+ uses intra-row and inter-row permutations to generate the interleaver addresses [88],

130 Figure 5.1 : High-level structure of a rate- 1 /3 3GPP turbo decoder. whereas LTE-Advanced (LTE-A) uses a quadratic permutation polynomial (QPP) interleaver [99].

1 Algorithm outline Turbo decoding is carried out in multiple iterations (denoted by I) where each iteration consists of two component decoding phases.

141 130 Figure 5.1 : High-level structure of a rate- 1 /3 3GPP turbo decoder. whereas LTE-Advanced (LTE-A) uses a quadratic permutation polynomial (QPP) interleaver [99]. The component decoders are the same for HSPA+ and LTE-A Algorithm outline Turbo decoding is carried out in multiple iterations (denoted by I) where each iteration consists of two component decoding phases. In each phase, a component decoder performs maximum a-posteriori (MAP) decoding using the BCJR algorithm [100], which generates so-called extrinsic LLRs given the LLRs obtained by the detector and a-priori LLRs obtained from the other component decoder. The BCJR algorithm consists of one forward and one backward traversal on a trellis, which is defined by the underlying code. Specifically, to decode a codeword of N information bits, the BCJR algorithm performs the following steps: (i) In the forward traversal step, it iteratively computes N sets of forward state metrics for each transmitted information bit. (ii) In the backward traversal step, it iteratively computes N sets of backward state metrics for each transmitted information bit. To compute the extrinsic LLRs,

the BCJR algorithm then combines the forward and backward state metrics. 5.1.

142 131 Figure 5.2 : Structure of the 3GPP trellis. There are 8 states per trellis step and one step per transmitted information bit. The vector s k consists of all state metrics at trellis step k. The values u k and p k, are the k th information bit and the parity bit (both ±1) respectively. the BCJR algorithm then combines the forward and backward state metrics Branch-metric computation HSPA+ and LTE-Advanced both operate on a 8-state trellis, which is illustrated in Figure 5.2. Let s k+1 j be the j th state associated with information bit k + 1. There are two incoming branches into state s k+1 j. Each incoming branch is associated with values u k and p k, the k th information bit and the parity bit (both ±1), respectively.

143 132 The branch metrics associated with states s k i and s k+1 are computed as follows: γ ( ) s k i, s k+1 j = 0.5(L k sys + L k a)u k + 0.5(L k pp k ). Here, L k sys and L k a are the systematic channel LLR and the a-priori LLR for k th trellis step, respectively. In addition, the parity LLRs for the k th trellis step are L k p = L k p0 for MAP decoder 0 and L k p = L k p1 for MAP decoder 1. Note that we do not need to evaluate the branch metric γ (s k, s k+1 ) for all 16 possible branches (see Figure 5.2), as there are only four different branch metrics: γk 0 = 0.5(Lk sys + L k a + L k p), γk 1 = 0.5(Lk sys + L k a L k p), γk 0, and -γ1 k Forward and backward state metric computation The forward state metrics can be computed iteratively from trellis step to trellis step. The forward state metrics of step k+1 correspond to the vector α k+1 = [α k+1 1,..., α k+1 8 ], where the j th forward state metric α k+1 j only depends on two forward state metrics of stage k. These state metrics are computed by α k+1 j = max i F { α k i + γ(s k i, s k+1 j ) } (5.1)

144 133 where the set F contains the two indices of the states in step k connected to state s k+1 j (as defined by the trellis). The max { } operator is defined as max {a, b} = max{a, b} + log (1 + exp( a b )), (5.2) where log (1 + exp( a b )) is a correction term. For the max-log approximation, we approximate max by max (a, b) max(a, b). In this case, one can scale the extrinsic LLRs by a factor of 0.7 to to partially recover the error-rate performance loss induced by the approximation (see, e.g., [96, 101] for additional details). Computation of the backward state metrics is similar to that of the forward trellis traversal in (5.1). The vector of backward state metrics, denoted by β k = [β1 k,..., β8 k ], is computed as β k j = max i B { β k+1 i + γ(s k j, s k+1 i ) }. (5.3) Here, the set B contains the indices of states in step k + 1 connected to state s k j as defined by the trellis.

145 LLR computation After the forward and backward iterations have been carried out, the extrinsic LLRs for the k th bit are computed as { L k e = max {s k,s k+1 } U α k 1 i + β k+1 j + γ ( )} s k i, s k+1 j { max {s k,s k+1 } U α k 1 i + β k+1 j + γ ( )} s k i, s k+1 j L k sys L k p, where the sets U 1 and U 1 designate the set of states connected by paths where u k = 1 and the set of states connected by paths where u k = 1, respectively. 5.2 Turbo Decoder Implementations At a high level, both the Ivy-Bridge and Nvidia Kepler architectures can be viewed as multi-core SIMD processors. For the Intel CPU, we explicitly deploy SIMD instructions via Intel intrinsics to vectorize the MAP decoding algorithm. For the NVIDIA GPU, we used CUDA [72] to parallelize the workload. The CUDA compiler can easily vectorize the GPU computations, since in the CUDA programming model, threads execute the same set of computations, just on different input data. To achieve high decoding throughput, the BCJR algorithm outlined in Section 5.1 needs to be vectorized. We deploy the vectorization scheme put forward in [102], which vectorizes the BCJR algorithm into SIMD operations on vectors with eight

146 135 8 bit elements to accelerate a UMTS turbo decoder on an Analog Devices DSP. In the following subsections, we compare and contrast our turbo decoder implementations for the Intel CPU and the NVIDIA GPU SIMD data types For the quad-core Intel Ivy-Bridge processor, each core can execute SIMD instructions, supporting operations on various vector data types. Most hardware implementations of turbo decoders carry out fixed point calculations and use 10 bit-to-12 bit precision to achieve an error-rate performance close to a floating point implementation [94, 96, 97]. To achieve high throughput on the CPU, while maintaining good error-rate performance, we used vectors with eight 16 bit integer elements. The targeted Nvidia GTX 680 consists of 8 Next Generation SM (SMX) units, where each SMX unit is roughly equivalent to an Intel Ivy-Bridge core. An SMX unit can issue multiple 1024 bit SIMD instructions in parallel, where each instruction operates on vectors with 32 elements each having 32 bit. The architecture is optimized for single-precision floating-point operations (integer operations can be up to 6 slower). As a result, we used single-precision floating point operations in our GPU implementation. Since the computations for turbo-decoding consists of operations on vectors with 8 elements, we also decode at least 4 codewords in parallel to ensure a full utilization of the 1024 bit SIMD instruction.

147 Memory allocation Among all possible codes in HSPA+ and LTE-A, the longest code is the LTE codeword with K = 6144 information bits. The length of the rate- 1 /3 encoded data is bit, which is the largest amount of memory required among all codewords. For our CPU implementation, we store all data as 16 bit values. The implementation requires 48 KB for input/output LLRs, and 96 KB for forward state metrics. Since the amount of L2 cache per core is 256 KB, all data fits into the cache. On the GPU, shared memory, a small amount of memory (48KB per SMX) managed using explicit load and store instructions, can be used to cache data locally. Unfortunately, we cannot store data in the shared memory. This is because we decode at least 4 codewords in parallel to ensure the full utilization of the 1024 bit SIMD instruction and requires at least 4 the amount of storage, which outstrip the amount of available shared memory. Therefore, we store the input/output LLRs and forward state metrics in the device s memory, which has high access latency, reducing the throughput of the design Multi-mode interleaver lookup table To support HSPA+ and LTE-A, we need to support both interleaver types. Generating the HSPA interleaver addresses is rather complicated [88]. To achieve high throughput, we decided to generate lookup tables which contain all possible interleaved and deinterleaved memory addresses instead of computing the addresses on-the-fly. For the

148 137 Intel architecture, we store the lookup table in memory and rely on the fact that the entries in the lookup table will be cached. For the Nvidia architecture, we explicitly copy the correct entries of the lookup table at the start of the decoding iteration into constant memory, a small amount of read-only cache available on the GPU; this enables efficient lookup table accesses Max operation For the max-log approximation, we simply omit the correction term of max operator, as defined in (5.2), and approximate max as a simple max operation followed by scaling the extrinsic LLRs by a factor of 0.7. For both the CPU and GPU implementations, the max-log approximation corresponds to a vertical (element-wise) max instruction. Since the GPU supports element-wise logarithm and exponential functions, we can implement the log-map algorithm directly. Overall, the log-map algorithm requires one vertical max instruction, one vector subtraction, one vector absolute value, one call to the element-wise log function, one call to element-wise exponential function, and one vector addition. The Intel architecture does not support scalar or vector fixed-point logarithm or exponential functions. We therefore approximate the correction term c(a, b) = log (1 + exp( a b )) with a piece-wise linear function c(a, b) = max{0.25(2.77 a b ), 0},

149 138 as proposed in [103]. We then add this correction term to max{a, b}. This approximation requires 6 additional instructions compared to the max-log approximation: one vector subtraction, one vector absolute value, one vector shift, one vector maximum, and one vector addition Branch-metric computation In order to compute all branch metrics for every state k, we need to compute four branch metrics, γ 0 k and γ1 k and the negated versions, γ0 k and γ1 k (see Section 5.1.2), which requires scalar operations only. To parallelize the workload on the CPU and GPU, we fetch 8 consecutive systematic channel LLRs, 8 consecutive parity LLRs and 8 consecutive a priori LLRs at a time. We then compute the branch metrics, {γk+1 1,..., γ1 k+8 } and {γ0 k+1,..., γ0 k+8 }, in parallel. Finally, we compute the negated versions. In total, the branch metric computation requires two vector additions and three vector subtractions Forward and backward traversal Figures 5.3(a) and 5.3(b) depict the vectorized implementation of the forward and backward state-metric computation units in (5.1) and (5.3), respectively. Compared to the implementation in [102], we rearranged the placement of shuffles (data exchange between SIMD lanes) to increase the instruction-level parallelism (ILP). This approach does not increase the number of required shuffle operations and is beneficial for the Intel architecture, since multiple shuffles can be executed in parallel. Figure 5.3(c)

150 Figure 5.3 : Vectorized implementations of turbo decoder operations: (a) Vectorized computation of α k+1 for the 3GPP turbo code. The block vmax implements the vectorized element-wise max operator. (b) Vectorized computation of β k 1 for the 3GPP turbo code. (c) Vectorized LLR computation for the 3GPP turbo code. The block hmax reduces 8 elements in the input vector to one element using the max operator. 139

151 140 depicts the computations used to generate the extrinsic LLRs, where β + and β are intermediate values computed while computing β k (see Figure 5.3(b)). On the Intel architecture, the α k and β k computations can be implemented very efficiently using intrinsics. The vector γ k is constructed using one shuffle instruction. The rest of the α k and β k computation consists of two vector additions, two 128 bit shuffles and two element-wise max operations. Since we use 16 bit for the forward state metrics, these metrics can overflow. Hence, we re-normalize the metrics by subtracting α k (1) from α k [97]. To reduce the number of instructions required by this re-normalization step, we normalize the state metric only every 8 trellis steps. As a result, the overhead of renormalization is low, requiring three additional instructions (extract the first element, broadcast, and vector subtract) every 8 trellis steps. The same normalization scheme is used during the backward state metric computation phase. In our previous work [27], we emulated shuffle instructions with shared memory load and store instructions on the Nvidia architecture. One new feature of Kepler is that it explicitly supports shuffle instructions. We therefore replaced the shared memory operations with shuffles. As a result, the number of instructions for α k and β k computation is similar to that of the CPU implementation. Since all computations are carried out in floating point arithmetic, the forward and backward state metrics do not need to be normalized.

152 LLR computation As shown in Figure 5.3(c), the LLR computation consists of two vector addition instructions, two hmax functions, and a few scalar operations. The function hmax reduces the elements to one element using the max operator; this can be accomplished using a tree reduction, which requires 3 shuffle instructions and 3 max operations on the CPU. Unfortunately, a tree reduction does not use all lanes of the SIMD instruction. To increase computational efficiency, we buffer the inputs to hmax for 8 consecutive stages in shared memory column wise, instead of evaluating hmax one stage at a time. We then apply the element-wise max operation row-wise to find the minimum in each column Multi-codeword decoding Transmitted frame for both LTE-A and HSPA+ consists of multiple codewords. Since both CPU and GPU are multi-core processors, we parallelize the workload across all available cores to maximize the peak decoding throughput. For the CPU, such a parallelization is straightforward. We assigned at least one codeword per core using OpenMP to maximize the core utilization and, hence, the throughput. The GPU heavily relies on multi-threading to hide pipeline stalls and memoryaccess stalls. A corresponding implementation requires a large number of independent

153 142 sets of instructions [72]. As a result, we assigned a large number of codewords to each SMX using CUDA. To reduce the number of codewords needed to achieve high throughput on the GPU, we employed a windowed decoding approach, which divides a codeword into P sections (or windows) and processes these sections in parallel [27]. Since the forward and backward state metrics are computed from the first trellis stage to the last stage (in a continuous fashion), we exchange the forward and backward state metrics among different windows between the iterations. This approach reduces the performance loss associated with parallel processing. Nevertheless, there is a tradeoff between the number of windows and the decoding latency, which we will discuss in the following section. 5.3 Implementation Results We now show our implementation results on an Intel Core i7-3770k, a quad core Ivy Bridge processor, and a Nvidia GTX 680, a Kepler GK104 GPU. In our experiments, we used the Intel C++ Compiler 14.0 and the CUDA 5.5 toolkit. In the following, we first compare the error-rate performance in terms of the frame error rate (FER) with a floating point reference decoder [96], and then, we compare the throughput of our optimized turbo decoder implementations.

154 FER performance We evaluate the FER in an additive white Gaussian noise (AWGN) channel. For the following experiments, we used the longest LTE code-word, which consists of K = 6144 information bits, and a code rate of 1 /3. Since the GPU turbo decoder supports windowed decoding, we consider a different number of windows (all of equal size), where the number of windows is P {1, 32, 64, 96, 128}. Figure 5.4(a) compares the FER of the CPU decoder using the linear approximation detailed in Section and the GPU decoder using log-map decoding. Figure 5.4(b) compares the FER of all turbo decoders using the max-log-map algorithm. As expected, the use of 16 bit precision for the CPU implementation leads to a small loss ( 0.14 db for the max-log-map case) in terms of FER. The linear approximation results in only a very small FER degradation ( 0.12 db). For the GPU implementation, the FER performance with P = 1 matches the reference (floatingpoint) implementation; for larger window sizes P, the FER performance degrades only slightly. We note that the same behavior applies to the HSPA+ codes, which are not shown here for the sake of brevity Peak decoding throughput Table 5.1 shows the peak throughput of our CPU and GPU implementations. The results for HSPA+ are similar; for log-map algorithm and 6 decoding iterations, we achieved 41.6 Mbps and 36.7 Mbps on CPU and GPU respectively.

155 144 Table 5.1 : GPU peak throughput for the K = 6144, rate- 1 /3 LTE turbo code. CPU (Mbps) GPU (Mbps) I max-log-map linear log-map max-log-map log-map As expected, the throughput for both implementation is inversely proportional to the number of decoding iterations I. The CPU implementation appears to be instruction bounded as additional instructions increase the runtime proportionally. The number of instructions required to compute a forward state metric is 7 using the max-log approximation, whereas the use of the linear log-map approximation requires 6 additional instructions. As a result, the number of instructions required for linear log-map is 2 of max-log-map. For 6 decoding iterations, the throughput of the linear log-map approximation is 40 Mbps. The throughput of the linear log- MAP approximation is approximately 2 lower than the throughput of max-log-map implementation, which is 76.2 Mbps. For the GPU implementation, we found the instructions per cycle (IPC) is low, typically 1. Using the Nvidia profiler, we found that instructions operands are often unavailable (due to memory latency), leading to a low execution unit utilization. As a result, additional instructions (that do not require additional memory access) can execute on the idling execution units and thus do not significantly degrade the throughput. Hence, the throughput of log-map is not significantly slower than that

156 145 of the max-log-map algorithm on the GPU. For the CPU implementation, a workload of 8 parallel codewords is sufficient to reach the peak throughput. For the GPU implementation, significantly more codewords need to be processed in parallel. Given a codeword, increasing the number of windows, P, does not increase peak throughput as the number of computation and memory transactions required to process the codeword stays the same. Increasing P, however, is an effective way of reducing the number of codewords required to reach peak throughput. This trend is similar to our previous implementation [27]. On the Kepler architecture, we require 512 codewords for P = 1, 16 codewords for P = 32 and above, to reach the peak performance. In summary, our GPU implementation is up to 2 slower than that of our CPU implementation for the max-log approximation, and only 1.2 slower for the optimal log-map algorithm. For the Nvidia Kepler architecture, the maximum IPC is 6-to-7, while the achieved IPC is 1. The achieved IPC is low as operands of the instructions are often not ready due to memory access latency, leading to low execution unit utilization and low throughput. Coupled by the fact that CPU is clocked much faster than the GPU, the CPU implementation was able to outperform the GPU implementation. We emphasize that it is possible to further improve the decoding throughput on the GPU. For example, we can reduce the number of memory accesses via data compression, which reduces the time for which execution units wait for operands.

157 146 Nevertheless, the CPU implementation seems to be better suited for SDR applications since we achieve higher throughput with 2 fewer number of parallel codewords, which significantly reduces the latency of the entire system.

158 Figure 5.4 : FER performance of (a) log-map decoding and (b) max-log-map decoding for K = 6144 and 6 decoding iterations on CPU and GPU 147

Implicit vs. Explicit Approximate Matrix Inversion for Wideband Massive MU-MIMO Data Detection

Journal of Signal Processing Systems manuscript No. (will be inserted by the editor) Implicit vs. Explicit Approximate Matrix Inversion for Wideband Massive MU-MIMO Data Detection Michael Wu Bei Yin aipeng