Efficient Inverse Cholesky Factorization for Alamouti Matrices in G-STBC and Alamouti-like Matrices in OMP

Efficient Inverse Cholesky Factorization for Alamouti Matrices in G-STBC and Alamouti-like Matrices in OMP Hufei Zhu, Ganghua Yang Communications Technology Laboratory Huawei Technologies Co Ltd, P R China Email: zhuhufei@huaweicom, yangyangganghua@huaweicom Wen Chen the Department of Electronic Engineering Shanghai Jiao Tong University, Shanghai, P R China Email: wenchen@sjtueducn Abstract We apply the efficient inverse Cholesky factorization to Alamouti matrices in Groupwise Space-time Block Code (G-STBC) and Alamouti-like matrices in Orthogonal Matching Pursuit (OMP) for the sub-nyquist sampling system By utilizing some good properties of Alamouti or Alamouti-like matrices, we save about half the complexity Then we propose the whole square-root algorithms for G-STBC and OMP, respectively The proposed square-root G-STBC algorithm has an average speedup of 296 36 with respect to the sub-optimal G-STBC algorithm On the other hand, when comparing the complexities of all the steps except the projection step, the complexity for the proposed square-root OMP algorithm is about 30% of that for the fast OMP algorithm by matrix inverse update I INTRODUCTION The equivalent channel matrix in Group-wise spacetime block code (G-STBC) is an Alamouti matrix ] 3], ie, a block] matrix involving 2 2 Alamouti sub-blocks c c 2 c 2 c (where c and c 2 are complex scalars) Alamouti matrices have some good properties 3], which have been utilized to develop efficient matrix manipulations ] and a fast recursive G-STBC algorithm 2] On the other hand, in the sub-nyquist sampling system 4], the Orthogonal Matching Pursuit (OMP) algorithm 5] 7] computes the inversion of Alamouti-like matrices, ie, block ] c c matrices involving 2 2 Alamouti-like sub-blocks 2 c 2 c (that is obtained by deleting the negative sign in the Alamouti sub-block) Similar to Alamouti matrices, Alamouti-like matrices have also some good properties that can be utilized to reduce the computational complexity, as shown in this paper In this paper, the efficient inverse Cholesky factorization proposed by us recently 8] is applied to Alamouti matrices in G-STBC and Alamouti-like matrices in OMP We exploit the properties of Alamouti or Alamouti-like matrices to reduce the complexity dramatically Then we develop efficient squareroot algorithms for G-STBC and OMP, respectively In the following sections, ( ) T, ( ) and ( ) H denote matrix transposition, matrix conjugate, and matrix conjugate transposition, respectively I M is the identity matrix with size M, and 0 M is the zero column vector with length M II SYSTEM MODELS In this section, we introduce the system model for G-STBC and that for OMP in the sub-nyquist sampling system 4] A System Model for G-STBC The considered G-STBC system 2] consists of 2M transmit antennas and N( M) receive antennas in a rich-scattering and flat-fading wireless channel The transmit side de-multiplexes a single data stream a,a 2,,a M,a M2 into M layers Each layer a m,a m2 (m =, 2,,M) is encoded by an independent Alamouti encoder, and then it is transmitted by its respective 2 transmit antennas in 2 time slots Let h n,i (n =, 2,,N and i =, 2,, 2M) denote the channel coefficient from transmit antenna i to receive antenna n Let x n and x n2 (n =, 2,,N) denote the signal received at antenna n in the st time slot and the 2 nd time slot, respectively Then the equivalent channel model 2] for G- STBC can be represented as x = Ha + w, () where w is the complex Gaussian noise vector with statistically independent entries, the equivalent receive signal vector x = ] x x 2 x N x T N2, (2) the equivalent transmit signal vector a = a a 2 a M a M2 ] T, (3) and the 2N 2M equivalent channel matrix h h 2 h,2m h,2m h 2 h h,2m h,2m H = h N h N2 h N,2M h N,2M (4) h N2 h N h N,2M h N,2M Assume E(aa H )=σai 2 2M and E(ww H )=σwi 2 2N The minimum mean-square error (MMSE) estimation of a is 2] where â = R H H x, (5) R = H H H + σ 2 w/σ 2 ai 2M (6) 978--4673-88-5/2/$300 202 IEEE

The recursive G-STBC algorithm 2] detects M Alamouti codes iteratively with the optimal ordering In each iteration, the undetected Alamouti code with the highest signal-to-noise ratio (SNR) is detected by (5), and then its effect is cancelled B System Model for OMP in Sub-Nyquist Sampling System Compressive sensing (CS) theory 9] asserts that a signal vector s C N with at most K nonzero components can be exactly recovered from a measurement vector y C M with M N The signal vector s is called K-sparse The measurement vector y = Ts, (7) where the full rank M N matrix T = ] t t 2 t N is commonly called the dictionary, and its i th column, ie t i, is commonly called an atom 6] The signal recovery problem is to find the sparse solution to the underdetermined linear system (7), ie, to represent y as a weighted sum of the least number of atoms The signal recovery algorithms include OMP 5] 7], a greedy algorithm that selects atoms from T iteratively to approximate y gradually In each iteration, OMP selects the atom best matching the current residual (ie the approximation error), renews the weights for all the already selected atoms (to obtain the least squares approximation of y), and then updates the residual accordingly In 4], the signal vector (ie X(f) in 4, left line 8 on page 387]) is conjugate symmetry, and so is the dictionary Thus the OMP algorithm is slightly modified to select a symmetric pair of atoms in each iteration 4, left line 6 on page 387] Accordingly we modify the OMP algorithm described in 6, left lines 27-34 on page 2] to summarize the OMP algorithm in 4] We also follow these conventions in 6]: the set Γ n is a set containing n indices, and the matrix T Γ n is a sub-matrix of T containing only those columns of T with indices in Γ n Finally, we summarize the OMP algorithm in 4] as follows The OMP Algorithm in 4] Initialise Γ 0 =, the residual ỹ 0 = y 2 for k =, 2,,K/2 (2a) Projection: k = T H ỹ k (2b) i k = argmax N ( ) k i i= (2c) Select a symmetric pair of atoms by Γ = Γ i k ( N i k + ) (2d) Renew the weights for the selected atoms: s = ( T H Γ T ) Γ T H Γy (8) The dictionary T ] = S FD 4, equation (9)] F = F L0 F L0 4, right line 9 on page 380] is conjugate symmetry since F i = F i 4, equation (8)] S is real, since S ik = α ik 4, right line 22 on page 380] and α ik is real 4, the line below equation (4)] Moreover, D = diag ( ) d L0 d L0 4, right line 23 on page 380] has d i = d i 4, equation (6)] Thus we can verify that T = S FD = ] S F L0 d L0 S F L0 d L0 has S F i d i =( S F i d i ), ie, T = S FD is conjugate symmetry (2e) Update the residual by ỹ k = y T Γ s 3 Output: ỹ K/2, s 2 K/2 = s K III EFFICIENT INVERSE CHOLESKY FACTORIZATIONS FOR ALAMOUTI MATRICES AND ALAMOUTI-LIKE MATRICES We need to compute matrix inversions in (5) and (8) R in (5) is an Alamouti matrix 2] On the other hand, in (8) let R Γ = T H Γ T Γ (9) Obviously, R Γ =t i t i t i k t i k ]H t i t i t i k t i ]= k t H i t i t H i t i t H i t i k t H i t i k t T i t i t T i t i t T i t i k t T i t i k is an Alamouti-like matrix t H i t k i t H i t H i t k i k t H i t k i k t T i t k i t T i t T i t k i k t T i To reuse the already listed equations k for brevity, let us use (i) Γ to express equation (i) with the subscript k of each matrix replaced by Γ k and with overlines added to all vectors and scalars Then R (ie the submatrix in the first rows and columns of R) and R Γ can be represented as 2] R v v R = v H β r, (0) v H r, β and (0) Γ, respectively The inverse Cholesky factor 8] of R is F that satisfies F F H = R () We can simply apply equations () and (7) in 8] twice to compute F from F Then we have F = F u u 0 H λ f, 0 H 0 λ, (2) where we can employ equation (7) in 8] to compute λ =/ β v H F F H v, (3a) u = λ F F H v (3b) We can also employ equation (7) in 8] to compute the last column of F However, we will find some algorithms that are more efficient, to reduce the computational complexity A Inverse Cholesky Factorization for Alamouti Matrices R is an Alamouti matrix, and so are R 3, Lemma (Invariance Under Inversion)] Then the LDL T factors of R are Alamouti matrices 3, Lemma 3 (Invariance Under Triangular Factorization)], which include the unit lower triangular L and the diagonal D = diag ( ) d d 2 d Accordingly L D = L diag ( ) d d is also an Alamouti matrix, which is the unique Cholesky factor of R 0, Theorem 425 (Cholesky Factorization)], and is equivalent to F (since the permuted F is the Cholesky

factor of the correspondingly permuted R 8]) Thus F must be an Alamouti matrix Correspondingly in (2), f, =0, (4a) λ = λ, (4b) u = Alamouti(u ), (4c) where (4c) means that u is obtained from u directly since ] u u is an Alamouti matrix B Inverse Cholesky Factorization for Alamouti-like Matrices It is easy to verify that the sum, difference, or product of two Alamouti-like matrices is another Alamouti-like matrix Then we can prove that the inversion of an Alamouti-like matrix is another Alamouti-like matrix, as the proof for 3, Lemma (Invariance Under Inversion)] Now let us consider (2) Γ We employ (3b) Γ to obtain ] ] ū = λ f F Γ F H Γ (5), r, FΓ ū where F Γ = ] 0 H From (5) we deduce λ { ū = λ û λ ˆf, ū, (6a) f, = λ λ ˆf,, (6b) where { ˆf, = ū H + λ r,, (7a) û = F Γ F H Γ (7b) We can avoid computing (7b) and obtain û from F Γ F H Γ = ū ] / λ directly, since F Γ F H Γ = R ] Γ is an Alamouti-like matrix It can be seen from (2) Γ that the Alamoutilike sub-block in the last two rows ] and columns] of λ F Γ F H f, λ 0 Γ is 0 λ f, = λ λ2 + ] f, 2 f, λ f, λ, from which we deduce λ2 λ 2 = λ 2 + f, 2 (8) Finally we substitute (6b) into (8) to obtain λ =/ λ 2 2 ˆf, (9) Now we only need to compute (7a), (9) and (6) to obtain the last column of F IV EFFICIENT SQUARE-ROOT ALGORITHMS FOR G-STBC AND OMP IN SUB-NYQUIST SAMPLING SYSTEM A Efficient square-root algorithm for G-STBC We use F 2m (ie the equivalent Cholesky factor of R 2m ) instead of R 2m in the recursive G-STBC algorithm 2], to develop a square-root algorithm for G-STBC Similar to the square-root BLAST algorithm 8], in the initialization phase we permute the channel matrix H to be H 2M = ] h 2t h 2t h 2tM h 2tM (20) with t M,t M,,t to be the assumed successive detection order of M Alamouti codes, while in the iterative detection phase, we perform interference cancellation in and we block-upper triangularize F 2m by z 2M = H H 2M x, (2) F 2m Σ = F 2m 2 u 2m 2 u 2m 2 0 H 2m 2 f2m,2m f2m,2m (22) 0 H 2m 2 f 2m,2m f 2m,2m where Σ is unitary From (), (2) and (5) we obtain â 2m = F 2m F H 2mz 2m, ie, â 2m =(F 2m Σ)(F 2m Σ) H z 2m, into which we substitute (22) to estimate the last two entries of â 2m by ] ] â2m f = 2m,2m f2m,2m â 2m f 2m,2m f 2m,2m ] u H f 2m,2m f2m,2m (23) u H f 2m,2m f2m,2m z 2m The estimations â 2m and â 2m are quantized into a 2m and a 2m, respectively, whose effect is then cancelled in z 2m by 2] z 2m 2 = z 2] 2m v 2m 2 v 2m 2 ] a2m a 2m ] T, (24) where z 2] 2m is z 2m with the last two entries removed, and v 2m 2 and v 2m 2 are in R 2m, as shown in (0) Table I summarizes the proposed square-root G-STBC algorithm Notice that the transformation Σ in step (b) can be performed by the Givens rotations for Alamouti matrices that include inner rotations and outer rotations ] Generally, in the two minimum length rows (found in step (a)) that can be represented as rows and, the entries in the fist columns are all zeros Let Ω n i,j denote a complex Givens rotation matrix that rotates the i th entry and the j th entry in each row of F m and zeros the i th entry in the n th row 8] Then the transformation Σ in step (b) can be performed by Σ = m Ω 2j,2j m ( Ω 2j,2j+ Ω 2j,2j+2), (25) where Ω 2j,2j is an inner rotation that zeros two entries (ie f,2j and f,2j ) in an Alamouti sub-block and rotates the other two entries (ie f,2j and f,2j ) into real values, and Ω 2j,2j+ Ω 2j,2j+2 is an outer rotation that zeros the remaining two real values (ie f,2j and f,2j )ina previously inner-rotated Alamouti sub-block ] Since F 2m is still an Alamouti matrix after an inner or outer rotation, only half of the entries need to be computed ]

Initialization TABLE I THE PROPOSED SQUARE-ROOT ALGORITHM FOR G-STBC Permute H into H 2M by (20), and permute a 2M = a accordingly Compute R 2M and z 2M by (6) and (2), respectively Compute F 2 =(/ r, )I 2 Then compute F 2m from F 2m 2 iteratively by (3), (4) and (2) for m =2, 3,,M, to obtain F 2M Setm = M Iterative (a) Find the minimum length odd row and even row of F 2m, and permute them to the last odd row and Detection even row, respectively Correspondingly permute a 2m, z 2m, and rows and columns in R 2m (b) Block-upper triangularize F 2m by (22) to obtain F 2m 2 (c) Compute â 2m and â 2m by (23), which are quantized into a 2m and a 2m, respectively (d) Cancel the effect of a 2m and a 2m in z 2m by (24), to obtain z 2m 2 Remove a 2m and a 2m in a 2m to obtain a 2m 2 Also remove the last two rows and columns in R 2m to obtain R 2m 2 (e) If m>, go back to step (a) with F 2m 2, R 2m 2, z 2m 2 and a 2m 2 Letm = m B Efficient square-root algorithm for OMP in Sub-Nyquist Sampling System From (8), (9) and () Γ we can deduce In (26), let which satisfies y Γ = s = F Γ F H Γ TH Γy (26) y Γ = TH Γy, (27) y T Γ th γ y t H γ y ] T (28) where t γi is the i th column in T Γ Then we subsitute (27) and (2) into (26), to finally obtain s = ] s T T 0T 2 +f f ]f f ] H y Γ, (29) where f and f are the ( ) th and th columns of F Γ, respectively Moreover, from () Γ we deduce F Γ 2 = r, r,2 r, r, /(r, 2 r,2 2 ) 0 r, /(r, 2 r (30),2 2 ) To obtain the square-root algorithm for OMP, we only need to use step (2d ) instead of step (2d) in the OMP Algorithm in 4] (summarized in section IIB) Step (2d ) is as follow: if(k =): compute y Γ 2, F Γ 2 and s 2 by (27), (30) and (26) (where T H Γ y = y Γ2), respectively else: compute F Γ from F Γ by (3) Γ, (7), (9), (6) and (2) Γ (where (7b) in (7) does not need any computations, as above-mentioned) Compute y Γ and s by (28) and (29), respectively, from y Γ and s V COMPLEXITY EVALUATION In the proposed inverse Cholesky factorization for Alamouti matrices and that for Alamouti-like matrices, the complexities for the computations of the even columns (ie (4), (7a), (9), (6)) can be neglected, with respect to those for the computations of the odd columns (ie (3), (3) Γ ) Thus compared to the inverse Cholesky factorization 8] directly applied to Alamouti or Alamouti-like matrices, the proposed inverse Cholesky factorization can save about half the complexity TABLE II COMPLEXITY COMPARISON BETWEEN THE V-BLAST ALGORITHM IN 8] AND THE PROPOSED G-STBC ALGORITHM Step The V-BLAST Alg 8] The G-STBC Alg Compute R ( 2 M 2 N) (2M 2 N) Compute F ( 3 M 3 ) ( 4 3 M 3 ) Givens Rot ( 3 M 3, 9 M 3 )to(0) ( 8 9 M 3, 2 9 M 3 )to(0) Sum from ( 2 3 M 3 + 2 M 2 N, from ( 20 9 M 3 +2M 2 N, 4 9 M 3 + 2 M 2 4 N) 9 M 3 +2M 2 N) to ( 3 M 3 + 2 M 2 N) to ( 4 3 M 3 +2M 2 N) Let (j, k) denote the complexity of j complex multiplications and k complex additions, and it is simplified to be (j) if j = k Table I compares the computational complexity of the proposed square-root algorithm for G-STBC and that of the square-root algorithm for Vertical Bell Laboratories Layered Space-Time architecture (V- BLAST) in 8] Notice that in the Hermitian Alamouti matrix R 2M and the upper triangular Alamouti matrix F 2m, only 4 entries need to be computed In (25), Ω 2j,2j and Ω 2j,2j+ Ω 2j,2j+2 only rotates non-zero entries in the first 2j and 2j+2 rows, respectively The complexities of a complex inner rotation and a real outer rotation are (4,2) and (2,0), ( respectively ( Thus the average complexity of ) (25) ) is M m m= k= ( 8M 3 9, 2M 3 9 m m (4, 2)(2j)+2 m (2, 0)(2j +2) ), where the probabilities for k =, 2,,m are assumed to be equal The complexity of (25) is zero when the assumed detection order is the same as the actual order 8] When the square-root V-BLAST algorithm 8] is directly applied to the equivalent channel model () that can be regarded as a 2M 2N V-BLAST system, the required complexity is from ( 6 3 M 3 +4M 2 N, 32 9 M 3 +4M 2 N) to ( 8 3 M 3 +4M 2 N) Thus when M = N, the average speedup in floating-point operations (flops) of the proposed square-root G-STBC algorithm over the square-root V-BLAST algorithm 8] directly applied to () ranges from 29 to 2 On the other hand, the complexity of the MMSE sub-optimal ordered successive interference cancelation (OSIC) detector for G-STBC using sorted QR decomposition (SQRD) is (8M 3 +4M 2 N) /2

FLOPS 7 x 04 6 5 4 3 sqrt BLAST Alg (min mean value) sqrt BLAST Alg (max mean value) proposed sqrt G STBC Alg (min mean value) proposed sqrt G STBC Alg (max mean value) recursive G STBC Alg directly apply sqrt BLAST Alg to G STBC (min) directly apply sqrt BLAST Alg to G STBC (max) sub optimal OSIC SQRD G STBC Alg FLOPS 0 7 0 6 0 5 0 4 2 0 2 3 4 5 6 7 8 9 0 Number of Receive Antennas OMP Alg by matrix inversion OMP Alg by QR decomposition 0 3 OMP Alg by matrix inverse update Apply inverse Cholesky factoriztion directly in OMP Alg proposed square root OMP Alg 0 2 0 0 2 0 3 0 4 N: Dimension of the Original Signal Fig Comparison of computational complexity among the sub-optimal OSIC G-STBC algorithm in ], the OSIC G-STBC algorithm obtained by directly applying the square-root V-BLAST algorithm 8], the proposed square-root OSIC G-STBC algorithm, the recursive G-STBC algorithm 2], and the square-root V-BLAST algorithm 8], equation (9)] When M = N, the average speedup (in flops) of the proposed OSIC G-STBC detector over the suboptimal OSIC G-STBC detector ] ranges from 296 to 36 Assuming N = M, we carried out numerical experiments to count the average flops per time slot of the presented G- STBC or V-BLAST algorithms, for different N, the number of receive antennas The considered G-STBC system includes M Alamouti sub-streams and 2M transmitters, while the considered V-BLAST system includes M sub-streams and M transmitters All results are shown in Fig It can be seen that they are consistent with the theoretical flops calculation On the other hand, in the sub-nyquist sampling system 4], we simulate the flops of the proposed square-root OMP algorithm, the original OMP algorithm using direct matrix inversion 5], the original OMP algorithm with direct matrix inversion replaced by the inverse Cholesky factorization in 8], the fast OMP algorithm based on QR decomposition 5], 6], and the fast OMP algorithm by matrix inverse update 6], 7] Our simulations do not include the flops of the projection step (ie step (2a)) since they are the same for all the presented OMP algorithms As the simulations for Fig 5a in 2], we set K = N/2, M = K (log 2 N 2), and give flops for N = 2 i (i = 4, 5,, 0) The results are shown in Fig 2 Our simulations also show that the flops for the proposed square-root OMP algorithm are only about 30% of those for the fast OMP algorithm by matrix inverse update 6], 7] VI CONCLUSION When applying the efficient inverse Cholesky factorization 8] to Alamouti matrices in G-STBC and Alamouti-like matrices in OMP for the sub-nyquist sampling system 4], we utilize some good properties of Alamouti or Alamouti-like matrices to save about half the complexity Then we propose the whole square-root algorithms for G-STBC and OMP in Fig 2 Comparison of flops among the presented OMP algorithms 4], respectively The proposed square-root G-STBC algorithm has an average speedup of 2 29 with respect to the square-root V-BLAST algorithm 8] directly applied to the equivalent channel model for G-STBC, and has an average speedup of 296 36 with respect to the sub-optimal OSIC G-STBC detector in ] On the other hand, when comparing the complexities of all the steps except the projection step, the complexity for the proposed square-root OMP algorithm is about 30% of that for the fast OMP algorithm by matrix inverse update 6], 7] REFERENCES ] J Eilert, D Wu, D Liu, D Wang, N Al-Dhahir and H Minn, Complexity reduction of matrix manipulation for multi-user STBC-MIMO decoding, IEEE Sarnoff Symposium, April 30 - May 2 2007 2] H Zhu, W Chen, B Li and F Gao, A Fast Recursive Algorithm for G-STBC, IEEE Trans on Communications, vol 59, no 8, 20 3] A H Sayed, W M, Younis and A Tarighat, An invariant matrix structure in multiantenna communications, Signal Processing Letters, IEEE, vol 2, Nov 2005, pp 749-752 4] M Mishali and Y C Eldar, From Theory to Practice: Sub-Nyquist Sampling of Sparse Wideband Analog Signals, IEEE Journal of Selected Topics in Signal Processing, vol 4, no 2, pp 375C39, Apr 200 5] Joel Tropp and Anna Gilbert, Signal recovery from random measurements via orthogonal matching pursuit, IEEE Trans on Information Theory, vol 53, no 2, pp 4655-4666, December 2007 6] Thomas Blumensath and Mike E Davies, In Greedy Pursuit of New Directions: (Nearly) Orthogonal Matching Pursuit by Directional Optimisation, EUSIPCO 2007, Poznan, Poland, 2007 7] Y Fang, L Chen, J Wu and B Huang, GPU Implementation of Orthogonal Matching Pursuit for Compressive Sensing, ICPADS 20 8] H Zhu, W Chen, B Li and F Gao, An Improved Square-root Algorithm for V-BLAST Based on Efficient Inverse Cholesky Factorization, IEEE Trans on Wireless Communications, vol 0, no, pp 43-48, 20 9] D Donoho, Compressed sensing, IEEE Trans on Information Theory, vol 52, no 4, pp 289-306, April 2006 0] G H Golub and C F Van Loan, Matrix Computations, 3rd edition, Johns Hopkins University Press, 996 ] M Gomaa and A Ghrayeb, A Low Complexity MMSE Detector for Multiuser Layered Space-Time Coded MIMO Systems, CCECE 2007, 22-26 Apr 2007 2] S Kunis and H Rauhut, Random sampling of sparse trigonometric polynomials II - Orthogonal matching pursuit versus basis pursuit, Preprint, 2006 Available online at http://dspriceedu/cs