L 2, Norm and its Applications Yale Chang Introduction According to the structure of the constraints, the sparsity can be obtained from three types of regularizers for different purposes.. Flat Sparsity. This types of sparsity is often achieved by l -norm regualarizer. Optimization techniques include LARS, linear gradient search, proximal methods. 2. Structual sparsity, including group features detection, jointly vector sparsity, hierarchical group features, etc. The sparsity is often obtained by l 2 /l -norm. 3. Matrix/tensor sparsity, such as matrix/tensor completion. The typical regularizer is the trace norm which can be solved by Singular Value Decomposition thresholding. 2 Definition Given a matrix M R n m, m i is the i-th row while m j is the j-th column. Then the Frobenius norm of the matrix is defined as M F = n m m i 2 2 = Mij 2 () j= The l 2, norm of the matrix is defined as M 2, = m i 2 = m Mij 2 (2) j= l 2, norm is rotational invariant for rows: MR 2, = M 2,. The l 2 norm can be generalized to l r,p norm[]. ( M r,p = m i p r ) p m = ( M ij r ) p r j= p (3)
3 Applications 3. Rotational Invariant L-norm PCA for Robust Subspace Factorization[2] In R -norm, distance in spatial dimensions(attribute dimensions) are measured in L 2, while the summation over different data points uses L. Let X = {x,, x n } be n data points in d- dimensional space. In matrix form X = (x ji ), index j over spatial dimensions and index i sum over data points. R -norm is defined as Formulation of standard PCA is Formulation of R -PCA is X R = d j= J SV D = X UV 2 F = U,V x 2 ji 2 (4) x i Uv i 2 (5) J R P CA = X UV R = U,V x i Uv i (6) A common feature of previous approaches using Frobenius norm and L -norm is that they treat the two indexes i and j in the same way. However, these two indexes have different meaning: i runs through data points, while j = run through the spatial dimensions. In strict matrix format, this subtle distinction is easy to get lost. R -norm captures this subtle distinction. 3.2 Robust(L 2, ) Feature Selection[] W J(W ) = XT W Y 2, + γ W 2, (7) where X = {x,, x n } R d n, W R d c, Y R n c. The residue W T x i y i 2 is not squared and thus outliers have less importance than the squared residual W T x i y i 2. 3.3 Robust Nonnegative Matrix Factorization[3] The assumption of Gaussian noise leads to the formulation of standard NMF by imposing constraints F 0, G 0. While the assumption of the Laplacian noise leads to the formulation of L 2, NMF. The formulation of standard NMF is X F F 0,G 0 G 2 F (8) where X R d n, F R d q +, G Rq n +. The robust NMF(L 2, NMF is formulated as) X F G 2, (9) F 0,G 0 2
3.4 L, Feature Selection[4] W ) + λ W, (0) W R d qj(xt This formulation induce sparsity on the maximum absolute value of the element of each row; thereby, pushing all the elements of each row to zero. Sparse PCA/LDA enforce L norm regularization on W. This would set individual sparsity on the elements of W but would not necessarily achieve feature selection. For L p, type regularization, increasing p increases the sparsity sharing between the elements in each row. Therefore L 2, (p = 2) norm is more suitable for feature selection compared to L,. Moreover, p = promises full sharing of the elements. The advantage of working on the central subspace is that the data may contain dimensions irrelevant to the task, applying Φ(X T W ) avoids those noisy subspaces. The other reason is that the projection W operates on the original features X rather than on the non-linearly transformed features Φ(X), which makes it more amenable for interpreting which of the original features are important. 3.5 Joint Feature Selection and Subspace Learning[5] W R d q W 2, + µtr(w T XLX T W ) () Note the Laplacian matrix L is constructed from the original data matrix X. 3.6 Feature Selection via Joint Embedding and Sparse Regression[6] arg Tr(Y LY T ) + β( W T X Y 2 2 + α W 2, ) (2) W,Y Y T =I q q where Y R q n, L = (I n n S) T (I n n S) is the graph Laplacian of Local Linearity Embedding. W R d q. 3.7 Unsupervised Feature Selection Using Nonnegative Spectral Analysis[7] Tr(F T LF ) + α( X T W F 2 F + β W 2, ) (3) F T F =I q,f 0,W where F R n q, L = I n n D /2 SD /2, W R d q. Note that all the elements of F are nonnegative by definition. However, the optimal F has mixed signs without nonnegative constraints, which violates its definition. To address this problem, it is natural and reasonable to impose a nonnegative constraint into the objective function. When both nonnegative and orthogonal constraints are satisfied, there is only one element in each row of F is greater than zero and all the others are zeros. In that way, the learned F is more accurate, and more capable to provide discriative information. 3.8 Multi-Task Feature Learning[8] A T t= m L(y ti, < a t, U T x ti >) + γ A 2 2, (4) The second term combines the tasks and ensures that common features will be selected across them. 3
3.9 High-Order Multi-Task Feature Learning[9] B T Xt T B t Y t 2 F + α B () 2, + β( B () + B (2) ) (5) t= where X t R d n, B t R d c, Y t R n c, B () R d (c T ), B (2) R c (d T ). d is the number of features, n is the number of features, T is the number of time points. c is the number of scores, which is known as observations. In this case, L 2, norm enforce different tasks(time points) will select the same set of features. 3.0 Unsupervised Feature Selection for Linked Social Media Data[0] W s.t. Tr(W T XLX T W ) + β W 2, + αtr ( W T X(I n F F T )X T W ) (6) W T (XX T + λi)w = I c where X R m n,m is the number of features, n is the number of samples. W R m c, which assigns each data point with a pseudo-class label where c is the number of pseudo-class labels. L = D S is a laplacian matrix F = H(H T H) 2 is the weighted social dimension indicator matrix, H R K n is the social dimension indicator matrix, which can be obtained through Modularity Maximization. 3. Multi-view Clustering and Feature Learning via Structured Sparsity[] A core assumption in MKL, as well as many existing graph based multi-view learning methods, is that all features in the same data source are considered as equally important and given the same weight in data fusion, i.e., one weight is learned for one kernel matrix or graph. However, one can expect that the feature-wise importance to different learning tasks can vary significantly. 3.2 Robust Unsupervised Feature Selection[2] References X GF 2, + νtr[g T LG] + α XW G 2, + β W 2, (7) F,G,W s.t. G R n c +, GT G = I c, F R c d +, W Rd c [] F. Nie, H. Huang, X. Cai, and C. H. Ding, Efficient and robust feature selection via joint 2, -norms imization, in Advances in Neural Information Processing Systems, pp. 83 82, 200. [2] C. Ding, D. Zhou, X. He, and H. Zha, R -pca: rotational invariant l -norm principal component analysis for robust subspace factorization, in Proceedings of the 23rd international conference on Machine learning, pp. 28 288, ACM, 2006. 4
[3] D. Kong, C. Ding, and H. Huang, Robust nonnegative matrix factorization using l2-norm, in Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 673 682, ACM, 20. [4] M. Masaeli, J. G. Dy, and G. M. Fung, From transformation-based dimensionality reduction to feature selection, in Proceedings of the 27th International Conference on Machine Learning (ICML-0), pp. 75 758, 200. [5] Q. Gu, Z. Li, and J. Han, Joint feature selection and subspace learning, in IJCAI Proceedings- International Joint Conference on Artificial Intelligence, vol. 22, p. 294, 20. [6] C. Hou, F. Nie, D. Yi, and Y. Wu, Feature selection via joint embedding learning and sparse regression, in IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, p. 324, 20. [7] Z. Li, Y. Yang, J. Liu, X. Zhou, and H. Lu, Unsupervised feature selection using nonnegative spectral analysis., in AAAI, 202. [8] A. Evgeniou and M. Pontil, Multi-task feature learning, Advances in neural information processing systems, vol. 9, p. 4, 2007. [9] H. Wang, F. Nie, H. Huang, J. Yan, S. Kim, S. Risacher, A. Saykin, and L. Shen, High-order multitask feature learning to identify longitudinal phenotypic markers for alzheimer s disease progression prediction, in Advances in Neural Information Processing Systems, pp. 277 285, 202. [0] J. Tang and H. Liu, Unsupervised feature selection for linked social media data, in Proceedings of the 8th ACM SIGKDD international conference on Knowledge discovery and data ing, pp. 904 92, ACM, 202. [] H. Wang, F. Nie, and H. Huang, Multi-view clustering and feature learning via structured sparsity, in Proceedings of the 30th International Conference on Machine Learning (ICML-3), pp. 352 360, 203. [2] M. Qian and C. Zhai, Robust unsupervised feature selection, in Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pp. 62 627, AAAI Press, 203. 5