L 2,1 Norm and its Applications

Similar documents
Coupled Dictionary Learning for Unsupervised Feature Selection

Nonnegative Matrix Factorization Clustering on Multiple Manifolds

Graph-Laplacian PCA: Closed-form Solution and Robustness

Graph based Subspace Segmentation. Canyi Lu National University of Singapore Nov. 21, 2013

Unsupervised dimensionality reduction

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Sparse multi-kernel based multi-task learning for joint prediction of clinical scores and biomarker identification in Alzheimer s Disease

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Adaptive Affinity Matrix for Unsupervised Metric Learning

Cholesky Decomposition Rectification for Non-negative Matrix Factorization

RaRE: Social Rank Regulated Large-scale Network Embedding

Statistical Machine Learning

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

GROUP-SPARSE SUBSPACE CLUSTERING WITH MISSING DATA

Statistical and Computational Analysis of Locality Preserving Projection

Mathematical Methods for Data Analysis

A Spectral Regularization Framework for Multi-Task Structure Learning

Multi-Task Co-clustering via Nonnegative Matrix Factorization

Group Sparse Non-negative Matrix Factorization for Multi-Manifold Learning

A Local Non-Negative Pursuit Method for Intrinsic Manifold Structure Preservation

Data dependent operators for the spatial-spectral fusion problem

Predicting Graph Labels using Perceptron. Shuang Song

MTTTS16 Learning from Multiple Sources

Sparse representation classification and positive L1 minimization

Learning to Learn and Collaborative Filtering

Learning Bound for Parameter Transfer Learning

Linear Methods for Regression. Lijun Zhang

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

STATISTICAL LEARNING SYSTEMS

CSC 576: Variants of Sparse Learning

Protein Expression Molecular Pattern Discovery by Nonnegative Principal Component Analysis

Multi-Task Clustering using Constrained Symmetric Non-Negative Matrix Factorization

2.3. Clustering or vector quantization 57

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Deep Learning: Approximation of Functions by Composition

A Least Squares Formulation for Canonical Correlation Analysis

ParaGraphE: A Library for Parallel Knowledge Graph Embedding

Machine Learning (BSMC-GA 4439) Wenke Liu

Matrix Support Functional and its Applications

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Linear Dimensionality Reduction

Scalable Subspace Clustering

Sparse Subspace Clustering

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Multi-Task Learning and Matrix Regularization

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Tighter Low-rank Approximation via Sampling the Leveraged Element

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering

Multi-Task Learning and Algorithmic Stability

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds

CPSC 540: Machine Learning

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION

Linear Models for Regression. Sargur Srihari

Robustness of Principal Components

Robust Principal Component Analysis Based on Low-Rank and Block-Sparse Matrix Decomposition

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Parcimonie en apprentissage statistique

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

An Unsupervised Feature Selection Framework for Social Media Data

Inverse Power Method for Non-linear Eigenproblems

High Dimensional Covariance and Precision Matrix Estimation

Streaming multiscale anomaly detection

Nonlinear Dimensionality Reduction

Discriminant Uncorrelated Neighborhood Preserving Projections

Iterative Laplacian Score for Feature Selection

Data Mining and Matrices

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Jure Leskovec Joint work with Jaewon Yang, Julian McAuley

Discriminative Direction for Kernel Classifiers

Robust Laplacian Eigenmaps Using Global Information

Learning Task Grouping and Overlap in Multi-Task Learning

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification

On Spectral Basis Selection for Single Channel Polyphonic Music Separation

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

6. Regularized linear regression

Lecture Notes 10: Matrix Factorization

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Kernels for Multi task Learning

25 : Graphical induced structured input/output models

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Empirical Discriminative Tensor Analysis for Crime Forecasting

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Big Data Analytics: Optimization and Randomization

An Algorithm for Transfer Learning in a Heterogeneous Environment

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Random Subspace NMF for Unsupervised Transfer Learning

Advanced Introduction to Machine Learning CMU-10715

Support Vector Machine (SVM) and Kernel Methods

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Principal Component Analysis

Lecture 18: Multiclass Support Vector Machines

Memory Efficient Kernel Approximation

1 Matrix notation and preliminaries from spectral graph theory

Transcription:

L 2, Norm and its Applications Yale Chang Introduction According to the structure of the constraints, the sparsity can be obtained from three types of regularizers for different purposes.. Flat Sparsity. This types of sparsity is often achieved by l -norm regualarizer. Optimization techniques include LARS, linear gradient search, proximal methods. 2. Structual sparsity, including group features detection, jointly vector sparsity, hierarchical group features, etc. The sparsity is often obtained by l 2 /l -norm. 3. Matrix/tensor sparsity, such as matrix/tensor completion. The typical regularizer is the trace norm which can be solved by Singular Value Decomposition thresholding. 2 Definition Given a matrix M R n m, m i is the i-th row while m j is the j-th column. Then the Frobenius norm of the matrix is defined as M F = n m m i 2 2 = Mij 2 () j= The l 2, norm of the matrix is defined as M 2, = m i 2 = m Mij 2 (2) j= l 2, norm is rotational invariant for rows: MR 2, = M 2,. The l 2 norm can be generalized to l r,p norm[]. ( M r,p = m i p r ) p m = ( M ij r ) p r j= p (3)

3 Applications 3. Rotational Invariant L-norm PCA for Robust Subspace Factorization[2] In R -norm, distance in spatial dimensions(attribute dimensions) are measured in L 2, while the summation over different data points uses L. Let X = {x,, x n } be n data points in d- dimensional space. In matrix form X = (x ji ), index j over spatial dimensions and index i sum over data points. R -norm is defined as Formulation of standard PCA is Formulation of R -PCA is X R = d j= J SV D = X UV 2 F = U,V x 2 ji 2 (4) x i Uv i 2 (5) J R P CA = X UV R = U,V x i Uv i (6) A common feature of previous approaches using Frobenius norm and L -norm is that they treat the two indexes i and j in the same way. However, these two indexes have different meaning: i runs through data points, while j = run through the spatial dimensions. In strict matrix format, this subtle distinction is easy to get lost. R -norm captures this subtle distinction. 3.2 Robust(L 2, ) Feature Selection[] W J(W ) = XT W Y 2, + γ W 2, (7) where X = {x,, x n } R d n, W R d c, Y R n c. The residue W T x i y i 2 is not squared and thus outliers have less importance than the squared residual W T x i y i 2. 3.3 Robust Nonnegative Matrix Factorization[3] The assumption of Gaussian noise leads to the formulation of standard NMF by imposing constraints F 0, G 0. While the assumption of the Laplacian noise leads to the formulation of L 2, NMF. The formulation of standard NMF is X F F 0,G 0 G 2 F (8) where X R d n, F R d q +, G Rq n +. The robust NMF(L 2, NMF is formulated as) X F G 2, (9) F 0,G 0 2

3.4 L, Feature Selection[4] W ) + λ W, (0) W R d qj(xt This formulation induce sparsity on the maximum absolute value of the element of each row; thereby, pushing all the elements of each row to zero. Sparse PCA/LDA enforce L norm regularization on W. This would set individual sparsity on the elements of W but would not necessarily achieve feature selection. For L p, type regularization, increasing p increases the sparsity sharing between the elements in each row. Therefore L 2, (p = 2) norm is more suitable for feature selection compared to L,. Moreover, p = promises full sharing of the elements. The advantage of working on the central subspace is that the data may contain dimensions irrelevant to the task, applying Φ(X T W ) avoids those noisy subspaces. The other reason is that the projection W operates on the original features X rather than on the non-linearly transformed features Φ(X), which makes it more amenable for interpreting which of the original features are important. 3.5 Joint Feature Selection and Subspace Learning[5] W R d q W 2, + µtr(w T XLX T W ) () Note the Laplacian matrix L is constructed from the original data matrix X. 3.6 Feature Selection via Joint Embedding and Sparse Regression[6] arg Tr(Y LY T ) + β( W T X Y 2 2 + α W 2, ) (2) W,Y Y T =I q q where Y R q n, L = (I n n S) T (I n n S) is the graph Laplacian of Local Linearity Embedding. W R d q. 3.7 Unsupervised Feature Selection Using Nonnegative Spectral Analysis[7] Tr(F T LF ) + α( X T W F 2 F + β W 2, ) (3) F T F =I q,f 0,W where F R n q, L = I n n D /2 SD /2, W R d q. Note that all the elements of F are nonnegative by definition. However, the optimal F has mixed signs without nonnegative constraints, which violates its definition. To address this problem, it is natural and reasonable to impose a nonnegative constraint into the objective function. When both nonnegative and orthogonal constraints are satisfied, there is only one element in each row of F is greater than zero and all the others are zeros. In that way, the learned F is more accurate, and more capable to provide discriative information. 3.8 Multi-Task Feature Learning[8] A T t= m L(y ti, < a t, U T x ti >) + γ A 2 2, (4) The second term combines the tasks and ensures that common features will be selected across them. 3

3.9 High-Order Multi-Task Feature Learning[9] B T Xt T B t Y t 2 F + α B () 2, + β( B () + B (2) ) (5) t= where X t R d n, B t R d c, Y t R n c, B () R d (c T ), B (2) R c (d T ). d is the number of features, n is the number of features, T is the number of time points. c is the number of scores, which is known as observations. In this case, L 2, norm enforce different tasks(time points) will select the same set of features. 3.0 Unsupervised Feature Selection for Linked Social Media Data[0] W s.t. Tr(W T XLX T W ) + β W 2, + αtr ( W T X(I n F F T )X T W ) (6) W T (XX T + λi)w = I c where X R m n,m is the number of features, n is the number of samples. W R m c, which assigns each data point with a pseudo-class label where c is the number of pseudo-class labels. L = D S is a laplacian matrix F = H(H T H) 2 is the weighted social dimension indicator matrix, H R K n is the social dimension indicator matrix, which can be obtained through Modularity Maximization. 3. Multi-view Clustering and Feature Learning via Structured Sparsity[] A core assumption in MKL, as well as many existing graph based multi-view learning methods, is that all features in the same data source are considered as equally important and given the same weight in data fusion, i.e., one weight is learned for one kernel matrix or graph. However, one can expect that the feature-wise importance to different learning tasks can vary significantly. 3.2 Robust Unsupervised Feature Selection[2] References X GF 2, + νtr[g T LG] + α XW G 2, + β W 2, (7) F,G,W s.t. G R n c +, GT G = I c, F R c d +, W Rd c [] F. Nie, H. Huang, X. Cai, and C. H. Ding, Efficient and robust feature selection via joint 2, -norms imization, in Advances in Neural Information Processing Systems, pp. 83 82, 200. [2] C. Ding, D. Zhou, X. He, and H. Zha, R -pca: rotational invariant l -norm principal component analysis for robust subspace factorization, in Proceedings of the 23rd international conference on Machine learning, pp. 28 288, ACM, 2006. 4

[3] D. Kong, C. Ding, and H. Huang, Robust nonnegative matrix factorization using l2-norm, in Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 673 682, ACM, 20. [4] M. Masaeli, J. G. Dy, and G. M. Fung, From transformation-based dimensionality reduction to feature selection, in Proceedings of the 27th International Conference on Machine Learning (ICML-0), pp. 75 758, 200. [5] Q. Gu, Z. Li, and J. Han, Joint feature selection and subspace learning, in IJCAI Proceedings- International Joint Conference on Artificial Intelligence, vol. 22, p. 294, 20. [6] C. Hou, F. Nie, D. Yi, and Y. Wu, Feature selection via joint embedding learning and sparse regression, in IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, p. 324, 20. [7] Z. Li, Y. Yang, J. Liu, X. Zhou, and H. Lu, Unsupervised feature selection using nonnegative spectral analysis., in AAAI, 202. [8] A. Evgeniou and M. Pontil, Multi-task feature learning, Advances in neural information processing systems, vol. 9, p. 4, 2007. [9] H. Wang, F. Nie, H. Huang, J. Yan, S. Kim, S. Risacher, A. Saykin, and L. Shen, High-order multitask feature learning to identify longitudinal phenotypic markers for alzheimer s disease progression prediction, in Advances in Neural Information Processing Systems, pp. 277 285, 202. [0] J. Tang and H. Liu, Unsupervised feature selection for linked social media data, in Proceedings of the 8th ACM SIGKDD international conference on Knowledge discovery and data ing, pp. 904 92, ACM, 202. [] H. Wang, F. Nie, and H. Huang, Multi-view clustering and feature learning via structured sparsity, in Proceedings of the 30th International Conference on Machine Learning (ICML-3), pp. 352 360, 203. [2] M. Qian and C. Zhai, Robust unsupervised feature selection, in Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pp. 62 627, AAAI Press, 203. 5