Efficient Iterative Semi-Supervised Classification on Manifold

Similar documents
Manifold Coarse Graining for Online Semi-supervised Learning

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Graph-Based Semi-Supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning

What is semi-supervised learning?

How to learn from very few examples?

Comparison of Modern Stochastic Optimization Algorithms

Iterative Laplacian Score for Feature Selection

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter

Manifold Regularization

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Graphs, Geometry and Semi-supervised Learning

Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University

Semi-Supervised Learning

Global vs. Multiscale Approaches

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Active and Semi-supervised Kernel Classification

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Fantope Regularization in Metric Learning

Limits of Spectral Clustering

1 Graph Kernels by Spectral Transforms

Self-Tuning Semantic Image Segmentation

Hou, Ch. et al. IEEE Transactions on Neural Networks March 2011

Neural Network Training

Higher-Order Methods

Regularization on Discrete Spaces

ABC-LogitBoost for Multi-Class Classification

MAA507, Power method, QR-method and sparse matrix representation.

Nonlinear Optimization for Optimal Control

Large-Scale Graph-Based Semi-Supervised Learning via Tree Laplacian Solver

A PROBABILISTIC INTERPRETATION OF SAMPLING THEORY OF GRAPH SIGNALS. Akshay Gadde and Antonio Ortega

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Discrete vs. Continuous: Two Sides of Machine Learning

Semi-Supervised Learning with Graphs

Discriminative Direction for Kernel Classifiers

Graphs in Machine Learning

On Optimal Frame Conditioners

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

Maximum-weighted matching strategies and the application to symmetric indefinite systems

12. Cholesky factorization

Iterative Methods for Solving A x = b

Summer School on Graphs in Computer Graphics, Image and Signal Analysis Bornholm, Denmark, August 2011

SPARSE signal representations have gained popularity in recent

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Improved Local Coordinate Coding using Local Tangents

Large Scale Semi-supervised Linear SVMs. University of Chicago

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Online Manifold Regularization: A New Learning Setting and Empirical Study

Convex Optimization of Graph Laplacian Eigenvalues

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Semi-Supervised Learning with the Graph Laplacian: The Limit of Infinite Unlabelled Data

Linear & nonlinear classifiers

Nonlinear Dimensionality Reduction. Jose A. Costa

Graph Quality Judgement: A Large Margin Expedition

Convergence Rates for Greedy Kaczmarz Algorithms

From graph to manifold Laplacian: The convergence rate

Spectral Clustering on Handwritten Digits Database

Integrating Global and Local Structures: A Least Squares Framework for Dimensionality Reduction

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Adaptive Affinity Matrix for Unsupervised Metric Learning

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Justin Solomon MIT, Spring 2017

Approximating the Covariance Matrix with Low-rank Perturbations

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Nonnegative Matrix Factorization Clustering on Multiple Manifolds

Matrix Assembly in FEA

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

An Empirical Study of Building Compact Ensembles

Semi-Supervised Classification with Universum

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Nonlinear Dimensionality Reduction

SVRG++ with Non-uniform Sampling

COMPARING PERFORMANCE OF NEURAL NETWORKS RECOGNIZING MACHINE GENERATED CHARACTERS

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Homework 4. Convex Optimization /36-725

Appendix to: On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Accelerating SVRG via second-order information

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

TOPOLOGY FOR GLOBAL AVERAGE CONSENSUS. Soummya Kar and José M. F. Moura

HOMEWORK #4: LOGISTIC REGRESSION

Numerical Analysis Lecture Notes

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

Lec10p1, ORF363/COS323

When Dictionary Learning Meets Classification

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Improving Semi-Supervised Target Alignment via Label-Aware Base Kernels

Semi-supervised Eigenvectors for Locally-biased Learning

Multi-view Laplacian Support Vector Machines

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Recent Advances in Bayesian Inference Techniques

Transcription:

Efficient Iterative Semi-Supervised Classification on Manifold Mehrdad Farajtabar, Hamid R. Rabiee, Amirreza Shaban, Ali Soltani-Farani Digital Media Lab, AICTC Research Center, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. {farajtabar, shaban, a soltani}@ce.sharif.edu, rabiee@sharif.edu Abstract Semi-Supervised Learning SSL has become a topic of recent research that effectively addresses the problem of limited labeled data. Many SSL methods have been developed based on the manifold assumption, among them, the Local and Global Consistency LGC is a popular method. The problem with most of these algorithms, and in particular with LGC, is the fact that their naive implementations do not scale well to the size of data. Time and memory limitations are the major problems faced in large-scale problems. In this paper, we provide theoretical bounds on gradient descent, and to overcome the aforementioned problems, a new approximate Newton s method is proposed. Moreover, convergence analysis and theoretical bounds for time complexity of the proposed method is provided. We claim that the number of iterations in the proposed methods, logarithmically depends on the number of data, which is a considerable improvement compared to the naive implementations. Experimental results on real world datasets confirm superiority of the proposed methods over LGC s default iterative implementation and the state of the art factorization method. Keywords-Semi-supervised learning, Manifold assumption, Local and global consistency, Iterative method, Convergence analysis I. INTRODUCTION Semi-supervised Learning has become a popular approach to the problem of classification with limited labeled data in recent years []. To use unlabeled data effectively in the learning process, certain assumptions regarding the possible labeling functions and the underlying geometry need to be held []. In many real world classification problems, data points lie on a low dimensional manifold. The manifold assumption states that the labeling function varies smoothly with respect to the underlying manifold [3]. Methods utilizing the manifold assumption prove to be effective in many applications including image segmentation [4], handwritten digit recognition, and text classification [5]. Regularization is essentially the soul of semi-supervised learning based on the manifold assumption. Manifold regularization is commonly formulated as a quadratic optimization problem, min x xt Ax b T x, where A R n n and b, x R n. It is in effect equivalent to solving the system of linear equations, Ax = b. A is fortunately a sparse symmetric positive definite matrix. Naive solutions to this problem require On 3 operations to solve for x, while methods that take into account the sparse structure of A can cost much less. Taking the inverse of A directly is an obvious bad choice for various reasons. First taking the inverse requires On 3 operations regardless of the sparse structure of A. Secondly A may be near singular in which case the inverse operation is numerically unstable. Lastly the inverse of A is usually not sparse in which case a large amount of memory is needed to store and process A. To elaborate, note that semi-supervised learning is specially advantageous when there is large amount of unlabeled data which leads to better utilization of the underlying manifold structure. For example consider the huge amount of unlabeled documents or images on the web which may be used to improve classification results. In these large-scale settings ordinary implementations are not effective, because time and memory limitations are an important concern in SSL methods with the manifold assumption []. There are commonly two approaches to overcome this problem. First, one may reformulate the manifold regularization problem in a new form, more suitable for large-scale settings. For example, [6] considers a linear base kernel and thus requires an inverse operation with a very smaller matrix. [7] uses a sparsified manifold regularizer with core vector machines which is recently proposed for scaling up kernel methods to handle large-scale data. The second approach to this problem which is the focus of this paper relies heavily on factorization, optimization, or iterative procedures to solve the original manifold regularization formulation. Specially, Iterative methods are of great interest. Label propagation LP [8] is an iterative algorithm for computing harmonic solution [9], which is a variation of manifold regularization problem. The other naturally iterative manifold regularization algorithm is local and global consistency LGC [], upon which we build our work. Linear neighborhood propagation LNP [] is another iterative one which differs from other manifold learning methods mostly in the way of constructing the neighborhood graph. The problem with the most of these iterative methods is that, though of being claimed to be

converged fast, there is no analytical guarantee or proof for that claim. In this paper we conduct a theoretical analysis of iterative methods for the LGC. We apply gradient descent to the LGC and derive an analytical bound for the number of iterations and its dependency on the number of data. These bounds are also true for other manifold regularization problems such as harmonic solution and tikhonov regularization. We then show that the LGC s iterative procedure may be improved through an approximation of the inverse Hessian and present a detailed convergence analysis. Again a theoretical bound is derived for the number of iterations. We show that these iterative implementations require Olog n sparse matrix-vector multiplications to compute LGC s solution with sufficient accuracy. Then it is proved that LGC s iterative procedure is a special case of our proposed method. Finally proposed methods are compared with LGC s iterative procedure, and a state of the art factorization method utilizing Cholesky. The rest of the paper is organized as follows. In section II some related works in the domain of optimization, factorization and iterative methods are introduced. Section III provides a basic overview of LGC and introduces the notations. Section IV provides a detailed analysis of gradient descent applied to LGC. In section V we then show how the LGC s iterative procedure may be improved and derive further theoretical bounds. Section VI gives experimental results validating the derived bounds, after which the paper is concluded in Section VII. II. RELATED WORKS Methods such as LQ, LU, or Cholesky factorization overcome the inverse operation problems by factorizing A into matrices with special structure that greatly simplify computations especially when A is sparse. In particular Cholesky factorization best fits our problem by making use of the symmetry and positive definiteness properties of A. It decomposes A as P U T UP T, where P is a permutation matrix and U is upper triangular with positive diagonal elements. Heuristics are used to choose a matrix P that leads to a sparse U. In some instances these heuristics fail and the resulting algorithm may not be computationally as efficient as expected []. Iterative methods are another well studied approaches to the problem. Two views to the problem exist. When considering the problem in its optimization form, solutions such as gradient descent, conjugate gradient, steepest descent, and quasi-newton methods become evident. Taking the machine learning view point leads to more meaningful iterative methods. Among them are LP, LNP and LGC which are introduced in the previous section. LGC s iterative procedure is useful in many other applications, so improving and analyzing it may be helpful. For example [3] proposed an iterative procedure based on LGC for ranking in the web and [4] used similar ideas in image retrieval. As stated before the problem with LGC or LP s iterative procedure is that there is no analysis provided on the number of iterations for convergence. Morever, no explicit stopping criterion is mentioned which is essential for bounding convergence iterations. Gradient descent is one of the simplest iterative solutions to any optimization problem, however beyond this simplicity its linear convergence rate is strongly dependent on the condition number of the Hessian [5]. Conjugate gradient is a method especially designed to solve large systems of linear equations. A conjugate set of directions with respect to A are chosen. In each iteration the objective function is minimized in one of the directions. Theoretically the method should converge in at most n iterations, with each iteration costing as much as a sparse matrix-vector multiplication. While this makes conjugate gradient a suitable choice, its inherent numerical instability in finding conjugate directions could yield the procedure slower than expected. [6], [] apply conjugate gradient to harmonic solution with both superior and inferior results to LP depending on the dataset in use. Quasi-newton methods exhibit super-linear convergence. At each iteration the inverse Hessian in Newton s method is replaced by an approximation. These methods will not be helpful unless the approximation is sparse, However sparse quasi-newton methods have an empirically lower convergence rate than low storage quasi-newton [7]. Thus they couldn t be helpful. Moreover for our problem, in which the Hessian is constant, computing an approximate to the inverse Hessian per iteration is costly. In our proposed algorithm we shall avoid this cost by computing a sufficiently precise and also sparse approximation of the inverse Hessian at the start. III. BASICS AND NOTATIONS Consider the general problem of semi-supervised learning. Let X u = {x,..., x u } and X l = {x u+,..., x u+l } be sets of unlabeled and labeled data points respectively, where n = u + l is the total number of data points. Also let y be a vector of length n with y i = for unlabeled x i and y i equals to the or corresponding to the class labels for the labeled data points. Our goal is to predict labels of X = X u X l as f, where f i is the label associated to x i for i =,..., n. It s usual to construct the similarity graph of data using methods like weighted k-nn for better performance and accuracy []. Let W be the n n weight matrix W ij = exp x i x j σ where σ is the bandwidth parameter. Define the diagonal matrix D with nonzero entries Di, i = n j= W ij. Symmetrically normalize W by S = D / W D /. The laplacian matrix is L = I S.

The family of manifold regularization algorithms can be formulated as following optimization problem: min f f T Qf + f y T Cf y 3 where Q is a regularization matrix usually the laplacian itself and C is a diagonal matrix with C ii equal to the importance of the i th node to stick to its initial value y i. The first term represents smoothness of the predicted labels with respect to the underlying manifold and the second term is squared error of the predicted labels compared with the initial ones weighted by C. Choosing different Qs and Cs leads to various manifold classification methods [5], [], [9], [3]. In LGC, Q = L and C = µi. It may easily be shown that the solution is equal to: f = L + C Cy = I αs y, 4 where α = µ+. Authors of [] propose an iterative algorithm to compute this solution: f t+ = αsf t + αy. 5 Since < α < and the eigenvalues of S are in [, ], this iterative algorithm converges to the solution of LGC []. In summary, the manifold regularization problem casts into the minimizing, Rf = f T Lf + f y T Cf y. 6 Throughout the paper R t and f t denote the value and point respectively, at the t th iteration of the algorithm and R and f for corresponding optimal ones. IV. ANALYSIS OF GRADIENT DESCENT The gradient of 6 is R = Lf + Cf y, which leads to the gradient descent update rule: f t+ = f t αlf + Cf y. 7 The stopping criterion is R. Choosing α appropriately is essential for convergence. Following [5], applying exact line search to our problem ensures linear convergence and at iteration t we have: t log R R log /z R t R. 8 which z is a constant equal to λminl+c λ maxl+c. For deeper analysis of the method we need the following lemma. Lemma [8]. If λ m and λ M are the smallest and largest eigenvalues of L respectively, then we have = λ m < λ M. Using the above lemma and the fact that C = µi, we have λ min L + C = µ and λ max L + C = µ + λ M µ +. Lemma. For any convex function R of f in 6 the followings hold: R R R R λ max R R. 9 λ min R R. R R λ max R f f f f λ max R. R Proof: Considering that Hessian is a constant matrix, the proof for equations 9 and can be found in standard optimization texts such as [5]. For we need the following [5]: Rh Rf + Rf T h f + λ max R Replacing f for f and f for h we get: h f. 3 Rf Rf + λ max R f f. 4 And the third equation is proved. Combining this with 9 the forth equation is proved. Theorem. The maximum number of iterations for gradient descent with exact line search and fixed, µ is Olog n. Proof: Consider the iteration t just before stopping, i.e., when R t > and R t+. using equation 9 and lemma : R t R λ max L + C Inserting this into 8 yields R t t log λ M +µr R log + µ λ M λ M + µ. 5. 6 In order to find an upper bound for R R inequality is used: R R λ M + µ f f λ M + µn 7 where in the last inequality we use the fact that f = and elements of f are in [, ]. Using this in 6 we reach t log λ M +µ n log + µ λ M +µ log n log + µ. 8 Each iteration of gradient descent in equation 7 consists of two steps. First α is computed which takes a fixed number of matrix-vector multiplications. Next Lf + Cf y is

computed which costs the same. Considering that all the matrices involved are sparse, because L is constructed using k-nn and C is diagonal, there are some sparse matrixvector multiplications. Thus the total cost of each iteration is Okn, where k is associated to neighborhood size in the construction of similarity graph. Putting these together we come to a Okn log n time complexity of computing the solution of LGC with gradient descent, i.e., a On log n rate of growth with respect the number of data, n, which is comparably less than the ordinary inverse complexity of On 3 in naive implementations or On with sparsity taken into consideration. It is easy to show the analysis presented above is valid for other laplacians, L, and Cs, i.e. applying gradient descent to other manifold regularization methods, such as harmonic solution and tikhonov regularization leads to the same bound. An interesting feature of the bound derived in 8 is that it is independent of the dataset in use. Replacing λ M for its upper bound in 8 eliminates the dependence of the bound to the data. This independence accompanied with being sufficiently tight is appropirate for data-independent practical implementation. V. SPARSE APPROXIMATION OF NEWTON S METHOD Newton s update rule for our problem is f t+ = f t α R R 9 For our quadratic problem one iteration is sufficient to reach the optimum point with α =, however we wish to find a sparse approximation of the inverse Hessian. We show that using a sparse approximation of the inverse Hessian leads to an iterative method with acceptable convergence rate. As an interesting result it may be seen that in the special case our method reduces to the LGC. We start with approximating the inverse Hessian. R = L + C = I S + C = I I + C S I + C = Σ I + C S i I + C The last equality is obtained because eigenvalues of I + C S are all less than one. Using the m first terms in the above summation leads to an approximation of the inverse Hessian: R Σ m I + C S i I + C. Rewriting Newton s method with the approximated inverse Hessian results in the update rule below. f t+ =f t R Lf + Cf y f t Σ m I + C S i I + C Lf + Cf y =f t Σ m I + C S i I + C I + C Sf t I + C Cy =f t Σ m I + C S i I I + C S f t + Σ m I + C S i I + C Cy =f t I I + C S m f t + I + C S i I + C Cy Σ m = I + C S m f t + I + C S i I + C Cy. Σ m In summary it can be restated as: where f t+ = H m f t + g m, 3 H = I + C S 4 m g m = H i I + C Cy. 5 This update rule is performed iteratively from an initial f until the stopping criterion R is reached. Theorem. The approximate Newton s method in 3 converges to the optimal solution of LGC. Proof: Unfolding the update rule in 3 leads to m f t =H mt f + H mi g m m =H mt f + mt m H mi =H mt f + H i I + C Cy H i I + C Cy 6 Tending t gives the final solution. Since the magnitude of the eigenvalues of H are less than one, H mt f, and lim f t = I H I +C Cy = L+C Cy, 7 t which is equal to f in 4.

Theorem 3. For the approximate Newton s method in 3 the stopping criterion R is reached in Olog n iterations with respect to the number of data n. Proof: f t f = H m f t g m H m f g m = H m f t f H m is symmetric so H m x λ max H m x, so f t f λmax H m f t f λ max H m t f f = λ max I + C S mt f f f = + µ mt f 8 9 By rewriting the above inequality one can see that the maximum number of iterations is bounded by t log f f f t f m log + µ 3 As in gradient decent consider the iteration t just before the stopping criterion is met, i.e., when R t > and R t+. Using equation we have f t f R t λ max L + C λ M + µ. 3 The maximum number of iterations is thus bounded above by t log λm +µ f f m log + µ f +µ f log m log + µ +µn log m log + µ 3 Similar to gradient descent an Olog n dependency on the number of data is derived for our approximate Newton s method. The sparsity degree of H m is k m, So the matrixvector operations with this matrix cost Ok m n. As the approximation become more exact, H m will become less sparse. So as m increases the number of iterations decrease, as can be seen from 3, however, the cost of each iteration grows. Empirically it is seen that m should be chosen from to 3, so we can treat it as constant and achive a Ok 3 n log n dependence on the number of data for the whole algorithm. Also since k is chosen independent of n and is usually constant, the growth of the algorithm s time complexity is On log n with respect to the number of data. Approx. Newton m = Approx. Newton m = Gradient Descent Figure : Demonstration of steps taken by gradient descent and approximate Newton s method for two data points from MNIST. The algorithms start their movments from top left point to the optimal point which is located at bottom right. Similar to gradient descent the bound derived in 3 is independent of dataset, which accompanied with tightness is a good feature for practical implementation. Experiments show that that the bound derived here is tighter than that of for gradient descent and of course the number of iterations for approximate Newton is much less than that of for gradient descent. As a special case, we claim that for m =, the algorithm is the same as LGC s iteration procedure. Remembering C = Iµ; f t+ =Hf t + g = I + C Sf t + I + C Cy = µ Sf t + µ µ + Cy = αsf t + αcy, 33 which is the same as 5. Figure shows how increasing m affects steps taken by the optimization algorithm in contrast to steps taken by gradient descent for simulations on the MNIST dataset. Gradient descent is extremely dependent on the condition number of the Hessian; for high condition numbers gradient descent usually takes a series of zigzag steps to reach the optimum point. Approximating the Newton step refines the search direction and decreases the zigzag effect. Figure shows that the steps form approximately a line at m =. The Newton step for quadratic problems is in the direction to the optimal point. The trace of approximate method with m = highly coincides with the true direction to the optimum point, indicating how well inverse Hessian is approximated in the proposed method. This is the reason of small number of iterations needed for convergence of approximate method compared with that of for gradient descent. The experiments validating the improvement are presented in the next section.

VI. EXPERIMENTS For experiments three real world datasets are used: MNIST for digit recognition, Covertype for forest cover prediction, and Classic for text categorization. These rather large datasets are chosen to better simulate a large-scale setting, for which naive solutions, such as inverse operation, are not applicable in terms of memory and time. The MNIST is a collection of 6 handwritten digit recognition samples. For classification we choose data points from digits and 8. Each is of dimension 784. No processing is done on the data. The forest Covertype dataset is collected for predicting forest cover type from cartographic variables. It includes seven classes and 58 samples of dimension 54. We randomly select samples of types and, and normalize them such that each feature is in [ ]. Classic collection is a benchmark dataset in text mining. This dataset consists of 4 different document collections: CACM 34 documents, CISI 46 documents, CRAN 398 documents, and MED 33 documents. We try to separate first category from others. Terms are single words; Minimum term length is 3. A term appears at least in 3 documents, and a term can appear at most 95 % of the documents. Moreover, Porters stemming is applied while preprocessing. Features are weighted with TFIDF scheme and normalized to. For all the datasets we use the same setting: Adjacency matrices are constructed using 5-NN with the bandwidth size set to mean of standard deviation of data. % of data points are labeled. µ is set to.5. Choosing =.5 empirically ensures convergence to the optimal solutions. Number of Iterations, accuracy, and distance to optimum are reported by average of runs for different random labelings. The algorithms are run on datasets and the results are depicted and discussed in the following. Figure shows the number of iterations for three iterative methods with respect to the number of data. The solution of iterative methods are almost converged to the optimum point as depicted by Figure 3. LGC s default implementation is the worst among the three. Gradient descent is second, and our approximate Newton s method has the fastest convergence rate consistently in the three diverse datasets. Note that LGC corresponds to the approximate method with m =, and as indicated in figure has better direction compared with gradient descent, so it may be surprising that its iterations are more than that of gradient descent. The key point is the line search. Although the direction proposed by gradient descent is worse than the one for LGC, exact line search causes gradient descent to reach the optimum faster. If we incorporate our approximate method with an exact line search we reach even fewer iterations, however empirically it was observed that due to the time consumed by line search, there is no improvement in terms of time duration. Another important point about diagrams in figure is the order of growth with respect to the number of data, which is consistent with the logarithmic growth derived in the previous sections. This makes LGC with iterative implementation a good choice for large-scale SSL tasks. To illustrate how tight the bounds derived for iterative methods are, we put the parameters into equations 3 and 8 to get 9, 38, and 97 for approximate method with m =, m =, and for gradient descent respectively, which may be compared with the empirical values from the diagrams in figure. Interestingly the diagrams show that the derived bounds are quite tight regardless of the dataset. Figure 3 shows accuracy of the iterative methods compared with a factorization method, CHOLMOD [9], which uses Cholesky factorization to solve a system of linear equations fast. Since computing exact solution via inverse is impractical we use a factorization method to solve for the exact solution and compare it with the solution of iterative methods. As seen from the diagrams, for all three datasets, the solution of iterative methods is sufficiently close to the optimal solution, with the number of iterations demonstrated in figure. Figure 4 compares distance to optimum with different methods at each iteration and shows how these methods converge to the optimum. As expected from previous results approximate Newton s method with m = has the fastest convergence, while LGC is the slowest one. As stated before, the superiority measured by number of iterations point of view of gradient descent to LGC is due to its line search, not the direction chosen by the method. Figure 5 shows the time needed to compute the solution. Figure 5a compares our approximate Newton s method with CHOLMOD which is the state of the art method in solving large systems of linear equations. Iterative methods are obviously superior to CHOLMOD. Figure 5b compares running times of different iterative methods. Again the proposed method with m = is the best, however this time LGC performs better than gradient descent, because of the overhead imposed due to the line search. As the number of data get larger the difference between the methods becomes more evident. Time growth is of order n logn, as predicted by theorems and 3. VII. CONCLUSION AND FUTURE WORKS In this paper, a novel approximation to Newton s method is proposed for solving manifold regularization problem along with a theoretical analysis on the number of iterations. We proved that the number of iterations have logarithmic dependence on the number of data. We also applied gradient descent to this problem and proved that its number of iterations also grows logarithmically with the number of data. The logarithmic dependence makes iterative methods a reasonable approach when a large amount of data is being classified. It s notable that the bounds derived, are empirically tight independent of the dataset in use, which is

Number of Iterations 35 3 5 5 LGC Approx. Newton m = Gradient Descent 4 6 8 a MNIST Number of Iterations 35 3 5 5.5.5 x 4 b Covertype Number of Iterations 35 3 5 5 4 6 8 c Classic Figure : Number of iterations for three iterative methods with respect to the number of data. Accuracy.5 LGC Approx. Newton m = Gradient Descent CHOLMOD.95 Accuracy.9.8.7 Accuracy.95.9.85.8.75.9 4 6 8 a MNIST.5.5 x 4 b Covertype Figure 3: Accuracy of the iterative methods compared with CHOLMOD.7 4 6 8 c Classic f t f * 5 LGC Approx. Newton m = Gradient Descent 5 f t f * 5 5 f t f * 8 6 4 3 Number of iterations a MNIST 3 Number of iterations b Covertype 5 5 5 Number of iterations c Classic Figure 4: Distance form optimum for the three methods with respect to the iteration number practically an important feature of an algorithm. We derived LGC s iterative procedure as a special case of our proposed approximate Newton s method. Our method is based upon approximation of the inverse Hessian. The more exact the approximation is, the better the search direction is chosen. Experimental results confirm improvement of our proposed method over LGC s iterative procedure without any loss in accuracy of classification. Also the improvement of our approximate method over gradient descent is revealed both theoretically and empirically. A theoretical analysis of robustness against noise, incorporating a low cost line search with the proposed method, and finding lower bounds on the number of iterations or tighter bounds, to name a few, are interesting problems that remain as future work.

Duration Sec 4Approx. Newton m = CHOLMOD 3 Duration Sec.8LGC Approx. Newton m =.6Gradient Descent.4. 4 6 8 a MNIST 4 6 8 b MNIST Figure 5: Comparison of time needed to compute the solution for iterative methods and CHOLMOD REFERENCES [] X. Zhu, Semi-supervised learning with graphs, Ph.D. dissertation, Carnegie Mellon University, 5. [] O. Chapelle, B. Scholkopf, and A. Zien, Semi-supervised learning. MIT press Cambridge, MA, 6, vol.. [3] M. Belkin, P. Niyogi, and V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research, vol. 7, pp. 399 434, 6. [4] O. Duchenne, J. Audibert, R. Keriven, J. Ponce, and F. Ségonne, Segmentation by transduction, in Computer Vision and Pattern Recognition, 8. CVPR 8. IEEE Conference on. IEEE, 8, pp. 8. [5] M. Belkin and P. Niyogi, Using manifold stucture for partially labeled classification, in NIPS,, pp. 99 936. [6] V. Sindhwani, P. Niyogi, M. Belkin, and S. Keerthi, Linear manifold regularization for large scale semi-supervised learning, in Proc. of the nd ICML Workshop on Learning with Partially Classified Training Data, 5. [7] I. Tsang and J. Kwok, Large-scale sparsified manifold regularization, Advances in Neural Information Processing Systems, vol. 9, p. 4, 7. [8] X. Zhu and Z. Ghahramani, Learning from labeled and unlabeled data with label propagation, School Comput. Sci., Carnegie Mellon Univ., Tech. Rep. CMUCALD--7,. [] A. George and J. Liu, Computer solution of large sparse positive definite systems, ser. Prentice-Hall series in computational mathematics. Prentice-Hall, 98. [3] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Scholkopf, Ranking on data manifolds, in Advances in neural information processing systems 6: proceedings of the 3 conference, vol. 6. The MIT Press, 4, p. 69. [4] J. He, M. Li, H. Zhang, H. Tong, and C. Zhang, Manifoldranking based image retrieval, in Proceedings of the th annual ACM international conference on Multimedia. ACM, 4, pp. 9 6. [5] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge Univ Pr, 4. [6] A. Argyriou, Efficient approximation methods for harmonic semi- supervised learning, Master s thesis, University College London, UK, 4. [7] J. Nocedal and S. Wright, Numerical optimization. Springer verlag, 999. [8] F. Chung, Spectral graph theory. Amer Mathematical Society, 997, no. 9. [9] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam, Algorithm 887: Cholmod, supernodal sparse cholesky factorization and update/downdate, ACM Trans. Math. Softw., vol. 35, pp. : :4, October 8. [9] X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, in ICML, 3, pp. 9 99. [] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, Learning with local and global consistency, in NIPS, 3. [] F. Wang and C. Zhang, Label propagation through linear neighborhoods, in Proceedings of the 3rd international conference on Machine learning. ACM, 6, pp. 985 99.