Matrix Factorization with Applications to Clustering Problems: Formulation, Algorithms and Performance
|
|
- Annis Louisa Perry
- 5 years ago
- Views:
Transcription
1 Date: Mar. 3rd, 2017 Matrix Factorization with Applications to Clustering Problems: Formulation, Algorithms and Performance Presenter: Songtao Lu Department of Electrical and Computer Engineering Iowa State University 1
2 Outline Formulation: Spcetral Clustering Joint Factor Analysis and Latent Clustering Algorithms SymNMF Joint Factor Analysis and Latent Clustering Deep Neural Networks for Clustering Other Problems Conclusion 2
3 Applications: Graph Partitioning Figure : The above two graphs are the same graph re-organized and drawn from the stochastic block model (SBM) with 1000 vertices, 5 balanced communities, within-cluster probability of 1/50 and across-cluster probability of 1/1000. Emmanuel Abbe, http : // soc CD1.pdf 3
4 Clustering Problem Formulation 4
5 Kernel K-means clustering K-means clustering K J K = x i m k 2 k=1 i C k (1) K 1 =c 2 x T n i x j k i,j C k (2) k=1 m k = i C k x i /n k is the centroid of cluster C k of n k points. c 2 = i x i 2 5
6 Kernel K-means clustering K-means clustering (matrix form) J K = Tr(X T X) Tr(H T X T XH) (3) H = (h 1,..., h K ), h T k h l = δ kl h k = (0,..., 0, 1, }. {{.., 1 }, 0,..., 0) T /n 1/2 k n k min J K becomes W = X T X max J W (H) = Tr(H T WH) (4) H T H=I,H 0 6
7 Kernel K-means clustering A nonlinear transformation (mapping) x i φ(x i ) (5) Kernel K-means can be written as K minimize φ(x i ) m k 2, (6) C k, k i C k k=1 where m k = i C k φ(x i )/n k is the centroid of cluster C k of n k points. Kernel K-means is equivalent to maximize H k 1 n k i,j C k W i,j = Tr(H T WH) (7) Kernel: W i,j = φ(x i ) T φ(x j ) Membership Matrix: H = (h 1,..., h K ), h T k h l = δ kl, h k = (0,..., 0, 1, }. {{.., 1 }, 0,..., 0) T /n 1/2 k n k 7
8 Spectral Clustering 8
9 Spectral Clustering VS K-Means 9
10 Challenge of Spectral Clustering W = W W W 3 (8) If λ 3 (W 1 ) > max(λ 1 (W 2 ), λ 2 (W 3 )), three leading eigenvector of the similarity matrix are 10
11 Challenge of Spectral Clustering Original Data New Representation in Eigenvectors Spectral Clustering (accuracy: 37.95%) SymNMF (accuracy: 88.78%) D. Huang, S. Yun and H. Park, SymNMF: nonnegative low rank approximation of a similarity matrix for graph clustering, Journal of Global Optimization, vol. 62,no. 3,pp , July,
12 Kernel K-means clustering Equivalence between K-means and matrix factorization H = arg min 2Tr(H T WH) (9) H T H=I,H 0 = arg min W 2 H T F 2Tr(HT WH) + H T H 2 F (10) H=I,H 0 = arg min W HH T 2 H T F (11) H=I,H 0 12
13 Motivation Relaxed version of K-means: Spectral Clustering min H HH T W 2 F maximize Tr(H T WH) (12) H T H=I,H 0 subject to HHT = I H T W SymNMF min H HH T W 2 F H subject to H 0 H T W H D. Huang, S. Yun and H. Park, SymNMF: nonnegative low rank approximation of a similarity matrix for graph clustering, Journal of Global Optimization, vol. 62,no. 3,pp , July,
14 SymNMF for Clustering samples (N) similarity matrix features (M) construct SymNMF: minimize X R N K subject to X 0 XX T Z 2 F Z R N N : pairwise similarity matrix X R N K : clustering indicator matrix X T Z X 14
15 Joint factor analysis and latent clustering min X WH 2 W R N F,H R F M F (13) s.t. W 0, H 0 (14) 15
16 Joint factor analysis and latent clustering min S,M X SM 2 F (15) s.t. S(i, j) {0, 1}, S(i, :) 0 = 1 (16) 16
17 Joint factor analysis and latent clustering Step1 : dimension reduction via factorization (e.g. SVD, NMF) Step 2: perform K-means clustering on the latent factor W Drawbacks of two-step approach ignores latent cluster structure when performing dimension reduction uses naive factorization when clustering 17
18 Motivation A real-world example 2 clusters of documents taken from Reuters text corpus NMF with rank= 2 The figure shows the weights of each document on the two latent topics. 18
19 Latent clustering {X(i, :)} N i = 1 have latent representations drawn from K clusters The rows of W can be divided into K clusters min W,H,S,M X WH 2 F + λ W SM 2 F (17) s.t. W 0, H 0, S(i, :) 0 = 1, S(i, k) {0, 1}, i, k (18) λ 0 is a pre-specified regularization parameter S R N K denotes cluster membership. S(i, k) = 1 means that W(i, :) belongs to cluster k M R K F denotes centroid matrix, where each centroid is M(k, :) 19
20 DNN min W,S,M where w i = f(x i ; H) min H,Z,S,M N f(x i ; H) SM 2 F (19) i=1 s.t. S(i, :) 0 = 1, S(i, k) {0, 1}, i, k (20) N l(g(f(x i ; H), Z), x i ) + λ 2 f(x i; H) SM 2 F (21) i=1 s.t. S(i, :) 0 = 1, S(i, k) {0, 1}, i, k (22) where ˆx i = g(w i ; Z) 20
21 Algorithms 21
22 Symmetric Nonnegative Matrix Factorization P1 : minimize X R N K subject to X 0 f(x) 1 2 XXT Z 2 F Challenges: 4th order polynomial (nonconvex) in terms of X No Lipschitz continuous gradient General matrix Z R N N non-symmetric indefinite contains negative entries K is any integer in [1, N]. 22
23 SymNMF (two dimensional case, K = 1) minimize x subject to x xxt Z 2 F where x = [x 1, x 2 ] T. Hessian matrix: ( 12x 2 H = 1 + 4x 2 2 4Z 11 8x 1 x 2 4Z 12 8x 2 x 1 4Z 21 12x x2 1 4Z 22 ) [0, 0] T [0, 0] T [0, 0] T Z: positive definite Z: indefinite Z: negative definite 23
24 Background and Motivation Clustering Symmetric Nonnegative Matrix Factorization (SymNMF) Probabilistic clustering [Zass et al 05] Community detection [Wang et al 11] [Ma et al 10] Overlapping community detection [Zhang et al 13] Graph partitioning and image segmentation [Park et al 15] Clustering accuracy K-means variants NMF variants Spectral clustering variants SymNMF D. Huang, S. Yun and H. Park, SymNMF: nonnegative low rank approximation of a similarity matrix for graph clustering, Journal of Global Optimization, vol. 62,no. 3,pp , July,
25 Related Work Existing Algorithms Projected Gradient Descent (PGD) [Kuang, Park et al 12] X (t+1) = proj + [X (t) α f(x (t) )] where proj + [X] = max{x, 0}. Projected Newton Method (PNewton) [Kuang, Park et al 12] X (t+1) = proj + [X (t) α( 2 f(x (t) )) 1 f(x (t) )] Disadvantages: no global convergence guarantee (no Lipschitz continuous gradient) 25
26 Related Work Existing Algorithms (continue) Eigen-value decomposition based SymNMF [Huang, Sidiropoulos et al 14] + Z = U K Σ K U T K and Let B U KΣ 1/2 K + Assume Z = XX T minimize X,Q subject to 1 2 X BQ 2 F X 0, Q T Q = QQ T = I Disadvantages: assume there is an exact decomposition (i.e., Z = XX T ) there is no proof that the optimal objective value is 0 26
27 Related Work Existing Algorithms (continue) Alternating Non-negative Least Square (ANLS) [Kuang, Park et al 15] 1 minimize X,Y 2 XYT Z 2 F + λ X Y 2 F subject to Y 0, X 0 Disadvantages: KKT points of the problem is different from P1 Coordinate Descent (CD) [Vandaele, Gillis et al 16] where minimize X :,j 0 X :,jx T :,j R(j) 2 F R (j) = Z K k=1,k j X :,k X T :,k Disadvantages: no convergence guarantee to KKT points (the optimal solution of each subproblem is not unique) 27
28 New Formulation of SymNMF Our formulation of SymNMF: P2 : 1 minimize Z 2 X R N K,Y R N K 2 XYT F subject to Y 0, X = Y, Y i,: 2 2 τ, i Advantages of the new formulation: variable splitting feasible set is compact (closed and bounded) 28
29 New Formulation of SymNMF How to solve this problem? P2 : 1 minimize Z 2 X R N K,Y R N K 2 XYT F subject to Y 0, X = Y, Y i,: 2 2 τ, i 29
30 Alternating Direction Method of Multipliers (ADMM) Problem: minimize x,y subject to h(x) + g(y) Ax + By = c where x R N, z R M, A R P N, B R P M, and c R P. The augmented Lagrangian is L(x, y; λ) = h(x) + g(y) + λ T (Ax + Bz c) + ρ/2 Ax + By c 2 2 where ρ > 0 ADMM consists of the iterations [Boyd et al 04] x (t+1) = arg min x L(x, y (t) ; λ (t) ) y (t+1) = arg min y L(x(t+1), y; λ (t) ) λ (t+1) = λ (t) + ρ(ax (t+1) + By (t+1) c) 30
31 ADMM for SymNMF Partial augmented Lagrangian: L(X, Y; Λ) = 1 2 XYT Z 2 F + Y X, Λ + ρ 2 Y X 2 F Λ R N K : dual variables : inner product operator ρ > 0 Y-subproblem minimize Y 0, Y i,: 2 τ, i 1 2 XYT Z 2 F + Y X, Λ + ρ 2 Y X 2 F X-subproblem minimize X 1 2 XYT Z 2 F + Y X, Λ + ρ 2 Y X 2 F 31
32 Comparison with Classical ADMM classical ADMM: minimize x X,y Y subject to h(x) + g(y) Ax + By = c P2 : minimize X,Y subject to 1 2 XYT Z 2 F Y 0, X = Y, Y i,: 2 2 τ, i Challenges: Nonconvex objective Objective function is non-separable Recent analysis results of ADMM for nonconvex problem do not apply [Hong et al 16] [Pong et al 15] 32
33 Nonconvex Splitting SymNMF (NS-SymNMF)) Parameter Update: iteration dependent penalty parameter: Primal Update: Y (t+1) = arg β (t) = 6 ρ X(t) (Y (t) ) T Z 2 F 1 min Y 0, Y i,: 2 2 τ, i 2 X(t) Y T Z 2 F + ρ 2 Y X(t) + Λ (t) /ρ 2 F + β(t) 2 Y Y(t) 2 F } proximal term X (t+1) = arg min X Dual Update: 1 2 X(Y(t+1) ) T Z 2 F + ρ 2 X Λ(t) /ρ Y (t+1) 2 F Λ (t+1) = Λ (t) + ρ(y (t+1) X (t+1) ) 33
34 Convergence Analysis of ADMM (convex) Convex case [Boyd et al 11],[Hong, Luo 12]: X (t) X 2 + Y (t) Y 2 0 and Λ (t) Λ 2 0 where (X, Y ; Λ ) is the globally optimal primal-dual pair. 34
35 Convergence Analysis (Convergence Rate) Define the proximal gradient of the augmented Lagrangian function as L(X, Y, Λ) where the operator proj Y (W) arg [ YT proj Y [Y T Y (L(Y, X, Λ)] X L(X, Y, Λ) min W Y 2 Y 0, Y i,: 2 2 τ, i F. We use the following quantity to measure the progress of the algorithm primal gap dual gap P(X (t), Y (t), Λ (t) ) L(X (t), Y (t), Λ (t) ) 2 F + X (t) Y (t) 2 F. ] If lim t P(X (t), Y (t), Λ (t) ) = 0, then a KKT point of P2 is obtained. 35
36 Numerical Results Data sets: Synthetic data sets Real data sets Algorithms Comparison: PGD: Projected Gradient Descent [Kuang, Park et al 12] PNewton: Projected Newton Method [Kuang, Park et al 12] SNMF: Eigen Value based SymNMF [Huang, Sidiropoulos 14] ANLS: Alternating Non-negative Least Square [Kuang, Park et al 15] CD: Coordinate Descent [Vandaele, Gillis et al 16] 36
37 Numerical Results Initialization 20 random initializations (Y or X follows i.i.d uniform distribution in the range [0, τ]) Every algorithm starts from the same initial point Parameter Chosen of NS-SymNMF: Accelerating convergence rate: Initialization: ρ (0) = N τ where τ is the average of the column norm of Z. ρ (t+1) = min{ρ (t) /(1 ɛ/ρ (t) ), 6.1Nτ} where ɛ = 10 3 β (t) = 6ξ (t) X (t) Y (t) Z 2 F /ρ(t) where ξ (t+1) = min{ξ (t) /(1 ɛ/ξ (t) ), 1} and ξ (1) =
38 Numerical Results (Synthetic Data) Data Set I: Random symmetric matrices: M R N N + i.i.d. Gaussian Z = M + M T Data Set II: Adjacency matrices N = 2000, K = 4 The numbers of data points within each cluster are 300,500,800,400. Data points {x i } R, i = 1,..., N. Mean: 2,3,6,8; Variance: 0.5. Gaussian function Z i,j = exp( (x i x j ) 2 /(2σ 2 )) where σ 2 = 0.5 Relative objective value: XX T Z 2 F / Z 2 F 38
39 Numerical Results (Random symmetric matrices) Data set I: Random symmetric matrices: N = 500, K = Monte Carlo (MC) trials Full rank 39
40 Numerical Results (Adjacency matrices) Swamp Data set II: Adjacency matrices: N = 2000, K = 4 Optimality Gap: X proj + [X X (g(x, Y))] 20 MC trials 40
41 Numerical Results (Optimality) Check local optimality: initialize δ as 1 decrease it by 0.01 each time check the minimum eigenvalue of T. More examples: fix the ratio of the number of nodes within each cluster (i.e., 3 : 5 : 8 : 4) test on the different total numbers of nodes N λmin (T ) δ Local Optimality (true) % % % 41
42 Numerical Results (Real Data Set) Text Mining (dense similarity matrix): Vertices: documents Edges: similarity between to documents Datasets: Reuters; topic detection and tracking2 (TDT2) [Cai, et al 11] Social Network (sparse similarity matrix): Vertices: individuals Edges: relationship Datasets: -Enron [Leskovec et al 09]; Brightkite (location-based social networking) [Cho et al 11], Facebook [McAuley et al 12] 42
43 Numerical Results Algorithms NS-SymNMF PGD ANLS SNMF CD Mean and Variance 1.01e-2±5.35e e-2±7,34e e-2±1.25e e e-2±1.21e-6 Text mining data dense similarity matrix : topic detection and tracking2 (TDT2) N = 8, 939 documents, K = 25 classes Gaussian function (similarity) 43
44 Numerical Results Algorithms NS-SymNMF ANLS SNMF CD Mean and Variance 8.75e-1±9.52e e-1±1.93e e ±1.49e-3 Social network data sparse similarity matrix : Brightkite (location-based social networking) people N = 58, 228, edges 428, 156, K = 50 44
45 Joint factor analysis and latent clustering min X DWH 2 W,H,S,M,{d i } N F + λ W SM 2 F + η H 2 F (23) i=1 s.t. W 0, H 0, W(i, :) 2 = 1, i (24) D = diag(d 1,..., d N ), (25) S(i, :) 0 = 1, S(i, k) {0, 1}, i, k (26) η 0 is a regularization parameter the Euclidean distance based K-means clustering on the unit 2-norm ball is equivalent to correlation based clustering η H 2 F control the scaling ambiguity 45
46 Joint factor analysis and latent clustering min W,H,Z,S,M,{d i } N i=1 X DWH 2 F + λ W SM 2 F + η H 2 F µ 0 and Z is a slack variable a large µ to enforce W Z + µ W Z 2 F s.t. W 0, H 0, Z(i, :) 2 = 1, i (27) D = diag(d 1,..., d N ), (28) S(i, :) 0 = 1, S(i, k) {0, 1}, i, k (29) 46
47 Alternating Optimization W-update W = arg min W 0 X DWH 2 F + λ W SM 2 F + µ W Z 2 F (30) H-update d i -update H = arg min H 0 X DWH 2 F + η H 2 F (31) d i = arg min d i X(i, :) d i W(i, :)H 2 2 (32) d i = X(i, :)bt i b i, where b T i = W(i, :)H Z-update Z = arg min Z Z(:,i) 2 =1, i W 2 F (33) S, M-update K-means 47
48 Numerical Results Reuters text corpus, top 41 clusters, 8213 documents, words Test for various number of clusters k For each k, 10 Monte-Carlo trials by randomly picking k-clusters locally consistent concept factorization (LCCF) [Cai et al 11] min X U,V XUVT 2 F + λtr(vt LV) s.t.u 0, V 0 (34) L: Graph Laplacian. Intuition: ambient proximity latent proximity 48
49 DNN 49
50 Other Works Other works: Tensor decomposition Distributed matrix factorization 50
51 PARAFAC 2 PARAllel FACtor analysis 2 (PARAFAC2) [Harshman 72]: Cross-Language Information Retrieval [Chew 07] and Multilingual Document Clustering [Romeo 14 ]: terms X(:, :, k) = F k D k (C)A T, k = 1,..., K documents concept terms weight concept documents X(:, :, k) F k D k (C) A T French Chinese Spanish Italian English X(:, :, 2) X(:, :, 1) X(:, :, 4) X(:, :, 3) X(:, :, 5) 51
52 Distributed Matrix Factorization minimize XY T Z 2 F X,Y Distributed matrix factorization: minimize U i,y i, i subject to N i=1 U i Y T i Z i 2 F U i = U j, i, j E U 2, Y 2 U 1, Y 1 Z 2 U 4, Y 4 U 3, Y 3 E Z 1 U 7, Y 7 Z 3 Z 4 Z 5 Z 7 Z 6 U 6, Y 6 U 5, Y 5 Convergence analysis KKT points [Hong 16] Global optimal solution (future work) 52
53 Conclusion Clustering Formulation Spectral Clustering Joint Factor Analysis and Latent Clustering Algorithms Joint Factor Analysis and Latent Clustering SymNMF Other Problems Tensor decomposition Distributed Matrix Factorization 53
54 Thanks for Your Attention! 54
55 References D. Kuang, S. Yun, and H. Park, SymNMF: nonnegative low-rank approximation of a similarity matrix for graph clustering, Journal of Global Optimization, vol. 62, no. 3, pp , Jul Songtao Lu, Mingyi Hong, and Zhengdao Wang, A nonconvex splitting method for symmetric nonnegative matrix factorization: Convergence analysis and optimality, IEEE Transactions on Signal Processing, Feb Bo Yang, Xiao Fu, Nicholas D. Sidiropoulos, Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering, 2016 B. Yang, X. Fu, N. D. Sidiropoulos, Learning from hidden traits: Joint factor analysis and latent clustering, IEEE Transactions on Signal Processing, accepted, Sep
First-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality
Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2018 First-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality
More informationA Stochastic Nonconvex Splitting Method for Symmetric Nonnegative Matrix Factorization
A Stochastic Nonconvex Splitting Method for Symmetric Nonnegative Matrix Factorization Songtao Lu Mingyi Hong Zhengdao Wang Iowa State University Iowa State University Iowa State University Abstract Symmetric
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)
More informationRecent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables
Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop
More information1 Matrix notation and preliminaries from spectral graph theory
Graph clustering (or community detection or graph partitioning) is one of the most studied problems in network analysis. One reason for this is that there are a variety of ways to define a cluster or community.
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationDoes Alternating Direction Method of Multipliers Converge for Nonconvex Problems?
Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems? Mingyi Hong IMSE and ECpE Department Iowa State University ICCOPT, Tokyo, August 2016 Mingyi Hong (Iowa State University)
More informationBeyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory
Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory Xin Liu(4Ð) State Key Laboratory of Scientific and Engineering Computing Institute of Computational Mathematics
More informationCS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu
CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu Feature engineering is hard 1. Extract informative features from domain knowledge
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationPart 5: Penalty and augmented Lagrangian methods for equality constrained optimization. Nick Gould (RAL)
Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization Nick Gould (RAL) x IR n f(x) subject to c(x) = Part C course on continuoue optimization CONSTRAINED MINIMIZATION x
More informationInfeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization
Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization Frank E. Curtis, Lehigh University involving joint work with James V. Burke, University of Washington Daniel
More informationConvex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014
Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationAsynchronous Non-Convex Optimization For Separable Problem
Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent
More informationLecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem
Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Michael Patriksson 0-0 The Relaxation Theorem 1 Problem: find f := infimum f(x), x subject to x S, (1a) (1b) where f : R n R
More informationUC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009
UC Berkeley Department of Electrical Engineering and Computer Science EECS 227A Nonlinear and Convex Optimization Solutions 5 Fall 2009 Reading: Boyd and Vandenberghe, Chapter 5 Solution 5.1 Note that
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More information8.1 Concentration inequality for Gaussian random matrix (cont d)
MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration
More informationDuality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Duality in Linear Programs Lecturer: Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: proximal gradient descent Consider the problem x g(x) + h(x) with g, h convex, g differentiable, and
More informationOn Optimal Frame Conditioners
On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More informationICS-E4030 Kernel Methods in Machine Learning
ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationAlgorithms for constrained local optimization
Algorithms for constrained local optimization Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Algorithms for constrained local optimization p. Feasible direction methods Algorithms for constrained
More informationminimize x subject to (x 2)(x 4) u,
Math 6366/6367: Optimization and Variational Methods Sample Preliminary Exam Questions 1. Suppose that f : [, L] R is a C 2 -function with f () on (, L) and that you have explicit formulae for
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This
More informationWHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????
DUALITY WHY DUALITY? No constraints f(x) Non-differentiable f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc???? Constrained problems? f(x) subject to g(x) apple 0???? h(x) =0
More informationDual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationSpectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods
Spectral Clustering Seungjin Choi Department of Computer Science POSTECH, Korea seungjin@postech.ac.kr 1 Spectral methods Spectral Clustering? Methods using eigenvectors of some matrices Involve eigen-decomposition
More informationFast Coordinate Descent methods for Non-Negative Matrix Factorization
Fast Coordinate Descent methods for Non-Negative Matrix Factorization Inderjit S. Dhillon University of Texas at Austin SIAM Conference on Applied Linear Algebra Valencia, Spain June 19, 2012 Joint work
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More information10 Numerical methods for constrained problems
10 Numerical methods for constrained problems min s.t. f(x) h(x) = 0 (l), g(x) 0 (m), x X The algorithms can be roughly divided the following way: ˆ primal methods: find descent direction keeping inside
More informationMatrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY
Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY OUTLINE Why We Need Matrix Decomposition SVD (Singular Value Decomposition) NMF (Nonnegative
More information10725/36725 Optimization Homework 4
10725/36725 Optimization Homework 4 Due November 27, 2012 at beginning of class Instructions: There are four questions in this assignment. Please submit your homework as (up to) 4 separate sets of pages
More information1 Non-negative Matrix Factorization (NMF)
2018-06-21 1 Non-negative Matrix Factorization NMF) In the last lecture, we considered low rank approximations to data matrices. We started with the optimal rank k approximation to A R m n via the SVD,
More informationIntroduction to Alternating Direction Method of Multipliers
Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction
More informationDual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationYou should be able to...
Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set
More informationBlock Coordinate Descent for Regularized Multi-convex Optimization
Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline
More informationDistributed Convex Optimization
Master Program 2013-2015 Electrical Engineering Distributed Convex Optimization A Study on the Primal-Dual Method of Multipliers Delft University of Technology He Ming Zhang, Guoqiang Zhang, Richard Heusdens
More informationCertifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering
Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering Shuyang Ling Courant Institute of Mathematical Sciences, NYU Aug 13, 2018 Joint
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationAccelerated Block-Coordinate Relaxation for Regularized Optimization
Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth
More informationNONNEGATIVE matrix factorization (NMF) has become
1 Efficient and Non-Convex Coordinate Descent for Symmetric Nonnegative Matrix Factorization Arnaud Vandaele 1, Nicolas Gillis 1, Qi Lei 2, Kai Zhong 2, and Inderjit Dhillon 2,3, Fellow, IEEE 1 Department
More informationLearning From Hidden Traits: Joint Factor Analysis and Latent Clustering
1 Learning From Hidden Traits: Joint Factor Analysis and Latent Clustering Bo Yang, Student Member, IEEE, Xiao Fu, Member, IEEE, Nicholas D. Sidiropoulos, Fellow, IEEE arxiv:165.6711v1 [cs.lg] 1 May 16
More informationALADIN An Algorithm for Distributed Non-Convex Optimization and Control
ALADIN An Algorithm for Distributed Non-Convex Optimization and Control Boris Houska, Yuning Jiang, Janick Frasch, Rien Quirynen, Dimitris Kouzoupis, Moritz Diehl ShanghaiTech University, University of
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationNetwork Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)
Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationAlternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization
Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote
More informationDual Ascent. Ryan Tibshirani Convex Optimization
Dual Ascent Ryan Tibshirani Conve Optimization 10-725 Last time: coordinate descent Consider the problem min f() where f() = g() + n i=1 h i( i ), with g conve and differentiable and each h i conve. Coordinate
More informationHYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX PROGRAMMING
SIAM J. OPTIM. Vol. 8, No. 1, pp. 646 670 c 018 Society for Industrial and Applied Mathematics HYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX
More informationApplications of Linear Programming
Applications of Linear Programming lecturer: András London University of Szeged Institute of Informatics Department of Computational Optimization Lecture 9 Non-linear programming In case of LP, the goal
More informationELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications
ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications Professor M. Chiang Electrical Engineering Department, Princeton University March
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationData dependent operators for the spatial-spectral fusion problem
Data dependent operators for the spatial-spectral fusion problem Wien, December 3, 2012 Joint work with: University of Maryland: J. J. Benedetto, J. A. Dobrosotskaya, T. Doster, K. W. Duke, M. Ehler, A.
More informationOverlapping Communities
Overlapping Communities Davide Mottin HassoPlattner Institute Graph Mining course Winter Semester 2017 Acknowledgements Most of this lecture is taken from: http://web.stanford.edu/class/cs224w/slides GRAPH
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationFinding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October
Finding normalized and modularity cuts by spectral clustering Marianna Bolla Institute of Mathematics Budapest University of Technology and Economics marib@math.bme.hu Ljubjana 2010, October Outline Find
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition
ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition Wing-Kin (Ken) Ma 2017 2018 Term 2 Department of Electronic Engineering The Chinese University
More informationClustering. SVD and NMF
Clustering with the SVD and NMF Amy Langville Mathematics Department College of Charleston Dagstuhl 2/14/2007 Outline Fielder Method Extended Fielder Method and SVD Clustering with SVD vs. NMF Demos with
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationDistributed Optimization via Alternating Direction Method of Multipliers
Distributed Optimization via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University ITMANET, Stanford, January 2011 Outline precursors dual decomposition
More informationNumerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen
Numerisches Rechnen (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2011/12 IGPM, RWTH Aachen Numerisches Rechnen
More informationFast Nonnegative Matrix Factorization with Rank-one ADMM
Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,
More informationAdaptive Primal Dual Optimization for Image Processing and Learning
Adaptive Primal Dual Optimization for Image Processing and Learning Tom Goldstein Rice University tag7@rice.edu Ernie Esser University of British Columbia eesser@eos.ubc.ca Richard Baraniuk Rice University
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationMini-Course 1: SGD Escapes Saddle Points
Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationE5295/5B5749 Convex optimization with engineering applications. Lecture 5. Convex programming and semidefinite programming
E5295/5B5749 Convex optimization with engineering applications Lecture 5 Convex programming and semidefinite programming A. Forsgren, KTH 1 Lecture 5 Convex optimization 2006/2007 Convex quadratic program
More informationWritten Examination
Division of Scientific Computing Department of Information Technology Uppsala University Optimization Written Examination 202-2-20 Time: 4:00-9:00 Allowed Tools: Pocket Calculator, one A4 paper with notes
More informationA DECOMPOSITION PROCEDURE BASED ON APPROXIMATE NEWTON DIRECTIONS
Working Paper 01 09 Departamento de Estadística y Econometría Statistics and Econometrics Series 06 Universidad Carlos III de Madrid January 2001 Calle Madrid, 126 28903 Getafe (Spain) Fax (34) 91 624
More informationInverse Power Method for Non-linear Eigenproblems
Inverse Power Method for Non-linear Eigenproblems Matthias Hein and Thomas Bühler Anubhav Dwivedi Department of Aerospace Engineering & Mechanics 7th March, 2017 1 / 30 OUTLINE Motivation Non-Linear Eigenproblems
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 12: Graph Clustering Cho-Jui Hsieh UC Davis May 29, 2018 Graph Clustering Given a graph G = (V, E, W ) V : nodes {v 1,, v n } E: edges
More informationSparse and Regularized Optimization
Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity
More informationCoordinate Update Algorithm Short Course Proximal Operators and Algorithms
Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow
More informationA New Trust Region Algorithm Using Radial Basis Function Models
A New Trust Region Algorithm Using Radial Basis Function Models Seppo Pulkkinen University of Turku Department of Mathematics July 14, 2010 Outline 1 Introduction 2 Background Taylor series approximations
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationEE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)
EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement to the material discussed in
More informationOptimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30
Optimization Escuela de Ingeniería Informática de Oviedo (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Unconstrained optimization Outline 1 Unconstrained optimization 2 Constrained
More informationData Mining Techniques
Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!
More informationOnline Nonnegative Matrix Factorization with General Divergences
Online Nonnegative Matrix Factorization with General Divergences Vincent Y. F. Tan (ECE, Mathematics, NUS) Joint work with Renbo Zhao (NUS) and Huan Xu (GeorgiaTech) IWCT, Shanghai Jiaotong University
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationLecture 6. Regression
Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationInexact Alternating Direction Method of Multipliers for Separable Convex Optimization
Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization Hongchao Zhang hozhang@math.lsu.edu Department of Mathematics Center for Computation and Technology Louisiana State
More information10-725/ Optimization Midterm Exam
10-725/36-725 Optimization Midterm Exam November 6, 2012 NAME: ANDREW ID: Instructions: This exam is 1hr 20mins long Except for a single two-sided sheet of notes, no other material or discussion is permitted
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More information