Singing voice enhancement for monaural music recordings with a cascade two-stage algorithm

2018 Ñ 9 Ð Ô 32 Ô 3 Ý Sept. 2018 Communication on Applied Mathematics and Computation Vol.32 No.3 DOI 10.3969/j.issn.1006-6330.2018.03.007 ÂßÑÀ¹ÏÇ ²Å ( 200444) É Ë³Ó²±ĐÀÎ Đµ±Ü»Ð À Đ Ñ Ö ÓÛ ¼Ú Í Ð ß ÐÁ RPCA ÄµÖ Û ¹ ÂÐ ÇÀ ÓÛ ÐÇÚÎ ĐÀË ß» ÅÆ ÓÔ ĐÉ REPET ÐÁ Î Đ É ¹ÞÃ Ó¾±Æ MIR-1K ¹ ß» Ö ÅÉ Å ÔĐ 2010 ÉÛ 93A30»ÉÛ TN912.35 ¼½Î A ¼Ê¼ 1006-6330(2018)03-0497-12 Singing voice enhancement for monaural music recordings with a cascade two-stage algorithm YU Shiwei, ZHANG Hongjuan (College of Sciences, Shanghai University, Shanghai 200444, China) Abstract In this paper, taking into account the unique properties of the singing voice that belongs to neither harmonic nor percussive sounds, we propose a cascade method for monaural singing voice enhancement. Specifically, under this framework, the RPCA technique is first applied to decompose the music mixture spectrogram into a sparse singing voice part and a low-rank background music part. Under this strong assumption, some percussive components (i.e., bass drum) in the background music are prone to be incorrectly assigned to the vocal part, and these percussive components are more repetitive than the singing voice essentially. Therefore, the REPET technology is applied to extract them further, leaving out more purely singing voice. Evaluations on the MIR-1K public dataset show that the proposed method has the ability to improve the separation performance, when compared with three state-of-the-art methods. Key words singing voice enhancement; robust principal component analysis; repetition pattern extraction 2010 Mathematics Subject Classification 93A30 ÍÆ 2016-08-06; Æ 2017-05-15 Ö Ô (11501351) Ë ÙÈ ßÁ E-mail: zhanghongjuan@shu.edu.cn

498 Ô 32 Chinese Library Classification TN 912.35 0 ÒÅ Æ ÏÞÇ Æ«µ¼Æ Å Ä Ì Æ Ò Æ Ò ÓÊÌ«½ Æ ÅÝÝÌÁ Å Ì [1] Ä [2] [3] ÄÅ Õ ³Æ Å Ï Ô µå ÌÒ ÑÏ ÚÞØº ÈÐÊÆ«Æ «ÊÆµÐ Ð ÄÝÐÊÆÅ Ü Å È«ÐÉ Õ¾ ÆÐ Á½ Õ Ï«³Ë½ «µ²òæ ÙÕ² Ì Rafii Pardo Á³ÞÆ À (repetition pattern extraction technique, REPET) È Æ ««ÑÆÆÜ ÉÚ³Æ Ó Æ È««[4-5]. Ú³ ««Ð Æº «Ó ²Æº² ½ ÝÁ Ú³ Liutkus Ì³ÊÎ Æ² ¾ÜÌÆ [6]. ÒÐ [7] «Huang ÖÚÆÆ Ì³ ÞÆË É µ Ì³ Þ ÝÁ µðæ º Ð [8] Ò ÈÁ±º ÝÉ ÒÐ [9] «Yang Ö³ ÝµÌ Þ Õ ÅÏ«º Æ Ûº Ò ÕÍÍÆ ± Ò ÕÍÍÆ ÆÍ Tachibana ³ÑÁ³ «[10], Ò²Ò¾ÕÍÆÌÁ± / Í À (harmonic/percussive sound separation, HPSS) [11], Ú ««Ï ± ÍÐÎÆ ÆÆ ÒÐ [12] «FitzGerald ¹ÌÁË»Ù Ì«Î [13] À HPSS À ÒÐ [14] «Zhu Ì³»»Ù ÒÏ«½««Á ÐÁ ¾ÕÚÖ ¾ÕÚÖ Þ (non-negative matrix factorization, NMF). Ö RPCA Ï««ÅÒ «Ù Í ÄÝÚ Ò¾ º ÚÚÏÊ «Í¾ RPCA «ÑÒ ÚÚ²Å ÄÅ ÒÏ«³ËÍ Æº ÚÕ Ð µ ¹Ì REPET À Í ÝÂ Ò¾ Ó ÆÆ

Ô 3 Ý ½µ À ĐÑ Ö 499 Ó ÆÐ ÁÆ È Ï Ý ÐÒ½ Å MIR-1K Þ ««µº 1 ³Æ Ð «Æ µ²æ± Í± ÒÕÚÖ²Ò ± ÚÚÅ Úº ¾¾Ü É ÒÇØ ÆÒÒ ÅÒ¾ ÍÕ º ¾ ¾Ü Ð ÒÕÍÆÒÄ ²Æ RPCA À ÆÆ Ì³ Ü ÚÚ¹ÎÆ Û Ä ¾ÜÌ ÒÕÚÖÅÒ [15]. Ò ÒÕÚÖ ÐÇØ¹Î± ÚÁ À«Ò³¾Õ Ä º Ú± ¾Ã ³Ë Ú ß º µ «ÝÆ H ± º Øº ¾º Æ ( ). V ¹ÎÆ Û ( ). P È Ø Æ (Ù). Ú Æ ÆËÒ± Í ± X = H + P + V, (1) «X Æ ÀÚ H, P V ± Í 1.1 ÏÊÕÝ Ú ÌÈ RPCA. ÆÐ RPCA Ò Å Ý³Å Ø «Ï««ÓÕÚÖ² Þ E Þ A ÙÕ µ Æ { min A + λ E 1, s.t. X = A + E, (2) «λ ³Ì ÆÞ A Þ E ºĐ ºÅ Þ A ÆÅ (i.e., Ã ), ÌÞ A [16]. 1 L 1 ÆÅ Ì ÞÍ

500 Ô 32 Ú³ÎÌĐµÌ «(alternating direction method of multipliers, ADMM) [17] Ð²ÌÆ Õ²ÅÅ L(A, E, Y, µ) = A + λ E 1 + Y, X A E + µ 2 X A E 2 F, (3) «Y ² µ > 0 ³ ºÅ Ï«ÇÆ Ã 1 ADMM Ã X, λ. A, E. Ì Y 0 = (i) Ï«² : X J(X), E0 = 0, µ 0 > 0, ρ > 1, t = 0. (ii) UΣV = svd(x E t + 1 µ t Y t ); A t+1 = US 1 µ t V T ; (iii) E t+1 = S λ µ t ( X A t+1 + 1 µ t Y t ); (iv) Y t+1 = Y t + µ t (X A t+1 E t+1 ); (v) µ t+1 = ρµ t ; (vi) t := t + 1. Ã Ï«1 «ρ µ ³¾µ t ÝØ ³ÇÅ U V Ã (SVD) ĐÞ Σ ĐÞ Ö 1(c) Ò Ö³Æ Æ «ÙÍ ÚÚ Ú ² ÄÝÒ¾ÆÜ Í¾¾ Ü Ð º [8]. Åµß «Ú Í¾ RPCA ²» 1 (a) MIR-1K [18] Ani-1-01 ÎÅÆ 0 É Å (b)(c) É RPCA ÉÜ Î

Ô 3 Ý ½µ À ĐÑ Ö 501 ÑÒ Ý Ú² ÒÏ«ÈÆ À REPET RPCA µ Í µ ÑÒ 1.2 Ë (REPET) Rafii REPET «[4-5] ÆÜ ÆÓ Æ ³ (i) ÆÜ ÜµÐÑÏÞ ÆÀ¾ É ÜËº x Ð¾Î (short time Fourier transform, STFT), Ò¾ÕÌ X, X «Í ÉÕÚÖ V. ÉÑÏ»ÍÚ V 2 (V «³Í Ø ) Ü Þ B, B(i, j) = m j+1 1 V 2 (i, k)v 2 (i, k + j 1). (4) m j + 1 k=1 B ÒÚ b, Ì b(j) = 1 n n B(i, j), i=1 b(i, j) = b(j) b(1), (5) «i = 1, 2,, n (ÕÍ), n = N/2 + 1, j = 1, 2,, m (¾Ü ). (ii) ÆÓ È ÒÚ b ÑÁÆÜ µðõúö V «Ü p ÑÒËÝ r É Ú± r ÕÚÖÓ «Î Æ S, S(i, j) = median{v (i, l + (k 1)p)}, (6) «i = 1, 2,, n (ÕÍ), l = 1, 2,, p (¾Ü), k = 1, 2, r, p Ü Ö³Æ ÒÕÚÖ ¾Ü² Ì ÚÆÆ (Ä¾Ü ÜÊ²). Ú ÐÌ«Î µ Î ºÆÆ (iii) Æ Æ Ú É ÌÈÃ «Æ Æ Ì W ÆÕÚÖ Ô ÕÚÖ V W, V W Ý W V. ÄÅ ÆÕÚÖ W µð S V Ü² W(i, l + (k 1)p) = min{s(i, l), V (i, l + (k 1)p)}. (7) ÆÀÚ W ÌÁ¾ÕÞ M, M(i, j) = W(i, j), M(i, j) [0, 1], (8) V (i, j)

502 Ô 32 «i = 1, 2,, n (ÕÍ), l = 1, 2,, p (¾Ü), W «ÆÆ Ý 1, ÔÝ 0. ²ÉÐ¾ÕÞ M µæõúö X m ÕÚÖ X v, «i = 1, 2,, N, j = 1, 2,, m. 1.3 ĐÒÁºÈ Ç ÒÚ X m (i, j) = M(i, j)x(i, j), (9) X v (i, j) = (1 M(i, j))x(i, j), (10) ÂÐ Ð Õ Ï«ÇÖ Ö 2 ÒÚ³»ÙÆ Ö³ Ì RPCA Ã ÕÚÖ X Á Æ X H X 1 V. RPCA Ï«²Å ÄÝ³ ÛÞ «ÕÂ Å ««µ³ Ò ±ÛÌ ½Æ ÚÚÏ ¹²Ò Æ «ÙÍÚÚÊ Ò ÄÝÚ Ò¾µ² [8]. Åµß «Ú Í¾ RPCA ²ÑÒ ³Ë RPCA ÎÕ ÒÏ«Í Æº ÒÓ ¹Ì REPET À ÆÍ X P ÝÂ X V. Ó ÆÆ X P Ó Æ X H Ð ÁÆ X M, X M = X P +X H. ¼ Þ «ÒÒ³¾ÕÍÆ Ý²Ò³ Í» 2 ÞÐÐÈÄ É X vocal X music µ µðæ Wiener Î X V X vocal = X, X V + X M (11) X M X music = X, X V + X M (12) «X Ã ÕÚÖ «À³

Ô 3 Ý ½µ À ĐÑ Ö 503 µ µ ¹³Å µåà Ð¹Ì Wiener Î ¾ÕÅ X vocal Ú M V M B Å ÐÆ ³µ ÕÚÖ M V =, X vocal + X music (13) X music M B =, X vocal + X music (14) X vocal = M V X, (15) X music = M B X, (16) ²ÉÌÎ (ISTFT) ÕÚÖÌ¾ 2 º 2.1 Ø ÐÌ½ MIR-1K Å Ï«ºÙ Å Hsu Jang [18], 1 000 ³Æ«Ó ¹ÍÝ 16 khz, Ý 4 13 s. ³ «ÓÝÒ OK ÊÆ¾ È³ ÒÊ Ú 1 000 ³«Ó ¼ Ã²ÒÅÀ Ù É ³¼Ó Ã ²Ò 5 ( À ), 0 ( ÀÒ) +5 ( À ). Ú Á³ 1 000 ³ Ã ¼Å ²ÒÅ Ü «²Ò 2.2 Í¾ ºÙ Ì BSS EVAL º v2.1. Ü (source to distortion ratio, SDR) (source to interference ratio, SIR) º (source to artifacts ratio, SAR) º Ï«Ù Å Ù ÍÚ ÂÁ³Ì SDR (normalized SDR, NSDR), NSDR( v; v; x) = SDR( v; v) SDR(x; v), (17) «v Æ³Ã v x Ã NSDR SDR Ò²Ã x v ³ ÌÆÀ³Ã º

504 Ô 32 ²ÉÌ Global NSDR (GNSDR) ÙØ³Åº GNSDR = N w i NSDR( v i, v i, x i ) i=1, (18) N w i i=1 «w i Ó i ³Ó N ³Å««ÓÅ SDR, SAR, SIR, NSDR GNSDR 2.3 Ç» Ð «Ï«REPET Rafii Pardo ÛÞ «[4]. RPCA Huang ÛÞÞ Ï«[7]. MLRR Yang ÛÞ «[9]. CA Ð««Ò««ÕÚÖÝÐ¾Î STFT ÑÏ Á Ý 64 ms, FFT Ý 1 024 ³¹Õ Á ÆÙÝ 25%. RPCA ºÅ³ 1 Ý λ =, CA ºÅ³Ý λ = 1. max(m,n) max(m,n) 2.4»Ô Ö 3 Ð GNSDR ÁÊ Ï«Ò 5, 0 +5 Æ º È«µÆ Ï ÚÊ Ï«ÑÝ Á Ó Ò 5 0 Æ «²º (GNSDR ²), ÕÕ +5 ¾ MLRR Á REPET RPCA Ú ÛÞ «Í Ú REPET Ý RPCA ³É ÒÆ À À ¾ º ÕÕ Ö 4 ÐÌ SDR (Ó ) SIR (Ó ) SAR (Ó ) Ê Ï«É º È³ 5 0 +5 ÒÃ«ÖÈ³ RPCA (P) REPET (R) MLRR (M) Ð «(CA). ³Ö«ÜÒ (ÇÇØ) «ÈÖ 4 «µ² Ò 5 0 Æ Ï«CA Á ² SDR SIR, Ó SAR Ò +5 Æ CA SIR ² SAR ² SDR Ó

Ô 3 Ý ½µ À ĐÑ Ö 505 --» 3 ÀÉ RPCA, REPET, MLRR ¾ ±Ä (CA) Æ 5, 0, +5 É ÂÁÉÜµÎ GNSDR, ÌÅÌ ÉÜÅ» 4 RPCA (P), REPET (R), MLRR(M) ¾ ÞÐÐÈÄ (CA) SDR(), SIR() SAR() Æ 5( ), 0( ), +5( À ) É Î ¹Â MIR-1 K ÎÅµÉÜÎ ¹ Ò 5 0 Æ CA «² º ÛÞ «MLRR. RPCA REPET Ï«È «(SIR ²). ÚÈ º Ý Ø ( SAR). Ã Æ ÔÑ

506 Ô 32 Ö 5 Ö 6 Ê «Ò Ý 0 ¾ Æ ¾ ÈÆ ÖÇ (Æ ) RPCA REPET MLRR Ð «CA È ² CA Æ À»Þ «Ð «CA, «RPCA MLRR ÈÁ Ò Ò ÈÆ ² RPCA REPET «CA Æ» ² ÆÆ «Í É Ð «CA ²ÏÒ Í Ý 3 Ó Ã Ð Æ ÛºÚ Á³ «ÒÚ³» ÙÆ À RPCA REPET ÀÌ Æ ÆÍ Ú ««²Ò Ð «ºÆ 1 000 ³Ð«Ó MIR-1K Å ÐÐ «Ý Ï«Á ¾ «¼ ÅÀ Ò À» 5 MIR-1K Ani-1-01 ÎÄ»

Ô 3 Ý ½µ À ĐÑ Ö 507» 6 MIR-1K Ani-1-01 ÚÄ» ÀÒ¾ Ú «ÛÞ«Æ ÄÅÅ Ü ÒÅÉº «Ð µè ÅÏ«º Ù½ [1] Han J, Chen C W. Improving melody extraction using probabilistic latent component analysis [C]//Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing, 2011: 33-36. [2] Fujihara H, Goto M, Ogata J, Komatani K, Ogata T, Okuno H G. Automatic synchronization between lyrics and music cd recordings based on viterbi alignment of segregated vocal signals [C]//Proceedings of International Symposium on Multimedia, 2006: 257-264. [3] Berenzweig A, Ellis D P W, Lawrence S. Using voice segments to improve artist classification of music [C]//Proceedings of AES 22nd International Conference: Virtual, Synthetic, and Entertainment Audio, 2002: 1-8. [4] Rafii Z, Pardo B. Repeating pattern extraction technique (REPET): a simple method for music/voice separation [J]. IEEE Transactions on Audio, Speech and Language Processing, 2013, 21(1): 73-84.

508 Ô 32 [5] Rafii Z, Pardo B. A simple music/voice separation method based on the extraction of the repeating musical structure [C]//Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing, 2011: 221-224. [6] Liutkus A, Rafii Z, Badeau R, Pardo B, Richard G. Adaptive filtering for music/voice separation exploiting the repeating musical structure [C]//Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing, 2012: 53-56. [7] Huang P S, Chen S D, Smaragdis P, Johnson M H. Singing voice separation from monaural recordings using robust principal component analysis [C]//Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing, 2012: 57-60. [8] Yang Y H. On sparse and low-rank matrix decomposition for singing voice separation [C]// Proceedings of ACM International Conference on Multimedia, 2012: 757-760. [9] Yang Y H. Low-rank representation of both singing voice and music accompaniment via learned dictionaries [C]//Proceedings of International Society for Music Information Retrieval Conference, 2013: 427-432. [10] Tachibana H, Ono N, Sagayama S. Singing voice enhancement in monaural music signals based on two-stage harmonic/percussive sound separation on multiple resolution spectrograms [J]. IEEE Transactions on Audio, Speech and Language Processing, 2014, 22(1): 228-237. [11] Ono N, Miyamoto K, Roux J L, Kameoka H, Sagayama S. Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram [C]// Proceedings of European Signal Processing Conference, 2008: 1-4. [12] FitzGerald D, Gainza M. Single channel vocal separation using median filtering and factorisation techniques [J]. ISAST Transactions on Electronic and Signal Processing, 2010, 4(1): 62-73. [13] FitzGerald D. Harmonic/percussive separation using median filtering [C]//Proceedings of International Conference on Digital Audio Effects (DAFx-10), 2010. [14] Zhu B, Li W, Li R, Xue X. Multi-stage non-negativematrix factorization for monaural singing voice separation [J]. IEEE Transactions on Audio, Speech and Language Processing, 2013, 21(10): 2096-2107. [15] Ikemiya Y, Yoshii K, Itoyama K. Singing voice analysis and editing based on mutually dependent f0 estimation and source separation [C]//Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing, 2015: 574-578. [16] Candés E J, Li X, Ma Y, Wright J. Robust principal component analysis? [J]. Journal of the ACM, 2009, 58(3):1-73. [17] Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers [J]. Foundations and Trends in Machine Learning, 2011, 3(1): 1-122. [18] Hsu C L, Jang J S R. On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset [J]. IEEE Transactions on Audio, Speech and Language Processing, 2010, 18(2): 310-319.