Efficient Iterative Semi-Supervised Classification on Manifold

Size: px
Start display at page:

Download "Efficient Iterative Semi-Supervised Classification on Manifold"

Transcription

1 Efficient Iterative Semi-Supervised Classification on Manifold Mehrdad Farajtabar, Hamid R. Rabiee, Amirreza Shaban, Ali Soltani-Farani Digital Media Lab, AICTC Research Center, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. {farajtabar, shaban, a soltani}@ce.sharif.edu, rabiee@sharif.edu Abstract Semi-Supervised Learning SSL has become a topic of recent research that effectively addresses the problem of limited labeled data. Many SSL methods have been developed based on the manifold assumption, among them, the Local and Global Consistency LGC is a popular method. The problem with most of these algorithms, and in particular with LGC, is the fact that their naive implementations do not scale well to the size of data. Time and memory limitations are the major problems faced in large-scale problems. In this paper, we provide theoretical bounds on gradient descent, and to overcome the aforementioned problems, a new approximate Newton s method is proposed. Moreover, convergence analysis and theoretical bounds for time complexity of the proposed method is provided. We claim that the number of iterations in the proposed methods, logarithmically depends on the number of data, which is a considerable improvement compared to the naive implementations. Experimental results on real world datasets confirm superiority of the proposed methods over LGC s default iterative implementation and the state of the art factorization method. Keywords-Semi-supervised learning, Manifold assumption, Local and global consistency, Iterative method, Convergence analysis I. INTRODUCTION Semi-supervised Learning has become a popular approach to the problem of classification with limited labeled data in recent years []. To use unlabeled data effectively in the learning process, certain assumptions regarding the possible labeling functions and the underlying geometry need to be held []. In many real world classification problems, data points lie on a low dimensional manifold. The manifold assumption states that the labeling function varies smoothly with respect to the underlying manifold [3]. Methods utilizing the manifold assumption prove to be effective in many applications including image segmentation [4], handwritten digit recognition, and text classification [5]. Regularization is essentially the soul of semi-supervised learning based on the manifold assumption. Manifold regularization is commonly formulated as a quadratic optimization problem, min x xt Ax b T x, where A R n n and b, x R n. It is in effect equivalent to solving the system of linear equations, Ax = b. A is fortunately a sparse symmetric positive definite matrix. Naive solutions to this problem require On 3 operations to solve for x, while methods that take into account the sparse structure of A can cost much less. Taking the inverse of A directly is an obvious bad choice for various reasons. First taking the inverse requires On 3 operations regardless of the sparse structure of A. Secondly A may be near singular in which case the inverse operation is numerically unstable. Lastly the inverse of A is usually not sparse in which case a large amount of memory is needed to store and process A. To elaborate, note that semi-supervised learning is specially advantageous when there is large amount of unlabeled data which leads to better utilization of the underlying manifold structure. For example consider the huge amount of unlabeled documents or images on the web which may be used to improve classification results. In these large-scale settings ordinary implementations are not effective, because time and memory limitations are an important concern in SSL methods with the manifold assumption []. There are commonly two approaches to overcome this problem. First, one may reformulate the manifold regularization problem in a new form, more suitable for large-scale settings. For example, [6] considers a linear base kernel and thus requires an inverse operation with a very smaller matrix. [7] uses a sparsified manifold regularizer with core vector machines which is recently proposed for scaling up kernel methods to handle large-scale data. The second approach to this problem which is the focus of this paper relies heavily on factorization, optimization, or iterative procedures to solve the original manifold regularization formulation. Specially, Iterative methods are of great interest. Label propagation LP [8] is an iterative algorithm for computing harmonic solution [9], which is a variation of manifold regularization problem. The other naturally iterative manifold regularization algorithm is local and global consistency LGC [], upon which we build our work. Linear neighborhood propagation LNP [] is another iterative one which differs from other manifold learning methods mostly in the way of constructing the neighborhood graph. The problem with the most of these iterative methods is that, though of being claimed to be

2 converged fast, there is no analytical guarantee or proof for that claim. In this paper we conduct a theoretical analysis of iterative methods for the LGC. We apply gradient descent to the LGC and derive an analytical bound for the number of iterations and its dependency on the number of data. These bounds are also true for other manifold regularization problems such as harmonic solution and tikhonov regularization. We then show that the LGC s iterative procedure may be improved through an approximation of the inverse Hessian and present a detailed convergence analysis. Again a theoretical bound is derived for the number of iterations. We show that these iterative implementations require Olog n sparse matrix-vector multiplications to compute LGC s solution with sufficient accuracy. Then it is proved that LGC s iterative procedure is a special case of our proposed method. Finally proposed methods are compared with LGC s iterative procedure, and a state of the art factorization method utilizing Cholesky. The rest of the paper is organized as follows. In section II some related works in the domain of optimization, factorization and iterative methods are introduced. Section III provides a basic overview of LGC and introduces the notations. Section IV provides a detailed analysis of gradient descent applied to LGC. In section V we then show how the LGC s iterative procedure may be improved and derive further theoretical bounds. Section VI gives experimental results validating the derived bounds, after which the paper is concluded in Section VII. II. RELATED WORKS Methods such as LQ, LU, or Cholesky factorization overcome the inverse operation problems by factorizing A into matrices with special structure that greatly simplify computations especially when A is sparse. In particular Cholesky factorization best fits our problem by making use of the symmetry and positive definiteness properties of A. It decomposes A as P U T UP T, where P is a permutation matrix and U is upper triangular with positive diagonal elements. Heuristics are used to choose a matrix P that leads to a sparse U. In some instances these heuristics fail and the resulting algorithm may not be computationally as efficient as expected []. Iterative methods are another well studied approaches to the problem. Two views to the problem exist. When considering the problem in its optimization form, solutions such as gradient descent, conjugate gradient, steepest descent, and quasi-newton methods become evident. Taking the machine learning view point leads to more meaningful iterative methods. Among them are LP, LNP and LGC which are introduced in the previous section. LGC s iterative procedure is useful in many other applications, so improving and analyzing it may be helpful. For example [3] proposed an iterative procedure based on LGC for ranking in the web and [4] used similar ideas in image retrieval. As stated before the problem with LGC or LP s iterative procedure is that there is no analysis provided on the number of iterations for convergence. Morever, no explicit stopping criterion is mentioned which is essential for bounding convergence iterations. Gradient descent is one of the simplest iterative solutions to any optimization problem, however beyond this simplicity its linear convergence rate is strongly dependent on the condition number of the Hessian [5]. Conjugate gradient is a method especially designed to solve large systems of linear equations. A conjugate set of directions with respect to A are chosen. In each iteration the objective function is minimized in one of the directions. Theoretically the method should converge in at most n iterations, with each iteration costing as much as a sparse matrix-vector multiplication. While this makes conjugate gradient a suitable choice, its inherent numerical instability in finding conjugate directions could yield the procedure slower than expected. [6], [] apply conjugate gradient to harmonic solution with both superior and inferior results to LP depending on the dataset in use. Quasi-newton methods exhibit super-linear convergence. At each iteration the inverse Hessian in Newton s method is replaced by an approximation. These methods will not be helpful unless the approximation is sparse, However sparse quasi-newton methods have an empirically lower convergence rate than low storage quasi-newton [7]. Thus they couldn t be helpful. Moreover for our problem, in which the Hessian is constant, computing an approximate to the inverse Hessian per iteration is costly. In our proposed algorithm we shall avoid this cost by computing a sufficiently precise and also sparse approximation of the inverse Hessian at the start. III. BASICS AND NOTATIONS Consider the general problem of semi-supervised learning. Let X u = {x,..., x u } and X l = {x u+,..., x u+l } be sets of unlabeled and labeled data points respectively, where n = u + l is the total number of data points. Also let y be a vector of length n with y i = for unlabeled x i and y i equals to the or corresponding to the class labels for the labeled data points. Our goal is to predict labels of X = X u X l as f, where f i is the label associated to x i for i =,..., n. It s usual to construct the similarity graph of data using methods like weighted k-nn for better performance and accuracy []. Let W be the n n weight matrix W ij = exp x i x j σ where σ is the bandwidth parameter. Define the diagonal matrix D with nonzero entries Di, i = n j= W ij. Symmetrically normalize W by S = D / W D /. The laplacian matrix is L = I S.

3 The family of manifold regularization algorithms can be formulated as following optimization problem: min f f T Qf + f y T Cf y 3 where Q is a regularization matrix usually the laplacian itself and C is a diagonal matrix with C ii equal to the importance of the i th node to stick to its initial value y i. The first term represents smoothness of the predicted labels with respect to the underlying manifold and the second term is squared error of the predicted labels compared with the initial ones weighted by C. Choosing different Qs and Cs leads to various manifold classification methods [5], [], [9], [3]. In LGC, Q = L and C = µi. It may easily be shown that the solution is equal to: f = L + C Cy = I αs y, 4 where α = µ+. Authors of [] propose an iterative algorithm to compute this solution: f t+ = αsf t + αy. 5 Since < α < and the eigenvalues of S are in [, ], this iterative algorithm converges to the solution of LGC []. In summary, the manifold regularization problem casts into the minimizing, Rf = f T Lf + f y T Cf y. 6 Throughout the paper R t and f t denote the value and point respectively, at the t th iteration of the algorithm and R and f for corresponding optimal ones. IV. ANALYSIS OF GRADIENT DESCENT The gradient of 6 is R = Lf + Cf y, which leads to the gradient descent update rule: f t+ = f t αlf + Cf y. 7 The stopping criterion is R. Choosing α appropriately is essential for convergence. Following [5], applying exact line search to our problem ensures linear convergence and at iteration t we have: t log R R log /z R t R. 8 which z is a constant equal to λminl+c λ maxl+c. For deeper analysis of the method we need the following lemma. Lemma [8]. If λ m and λ M are the smallest and largest eigenvalues of L respectively, then we have = λ m < λ M. Using the above lemma and the fact that C = µi, we have λ min L + C = µ and λ max L + C = µ + λ M µ +. Lemma. For any convex function R of f in 6 the followings hold: R R R R λ max R R. 9 λ min R R. R R λ max R f f f f λ max R. R Proof: Considering that Hessian is a constant matrix, the proof for equations 9 and can be found in standard optimization texts such as [5]. For we need the following [5]: Rh Rf + Rf T h f + λ max R Replacing f for f and f for h we get: h f. 3 Rf Rf + λ max R f f. 4 And the third equation is proved. Combining this with 9 the forth equation is proved. Theorem. The maximum number of iterations for gradient descent with exact line search and fixed, µ is Olog n. Proof: Consider the iteration t just before stopping, i.e., when R t > and R t+. using equation 9 and lemma : R t R λ max L + C Inserting this into 8 yields R t t log λ M +µr R log + µ λ M λ M + µ In order to find an upper bound for R R inequality is used: R R λ M + µ f f λ M + µn 7 where in the last inequality we use the fact that f = and elements of f are in [, ]. Using this in 6 we reach t log λ M +µ n log + µ λ M +µ log n log + µ. 8 Each iteration of gradient descent in equation 7 consists of two steps. First α is computed which takes a fixed number of matrix-vector multiplications. Next Lf + Cf y is

4 computed which costs the same. Considering that all the matrices involved are sparse, because L is constructed using k-nn and C is diagonal, there are some sparse matrixvector multiplications. Thus the total cost of each iteration is Okn, where k is associated to neighborhood size in the construction of similarity graph. Putting these together we come to a Okn log n time complexity of computing the solution of LGC with gradient descent, i.e., a On log n rate of growth with respect the number of data, n, which is comparably less than the ordinary inverse complexity of On 3 in naive implementations or On with sparsity taken into consideration. It is easy to show the analysis presented above is valid for other laplacians, L, and Cs, i.e. applying gradient descent to other manifold regularization methods, such as harmonic solution and tikhonov regularization leads to the same bound. An interesting feature of the bound derived in 8 is that it is independent of the dataset in use. Replacing λ M for its upper bound in 8 eliminates the dependence of the bound to the data. This independence accompanied with being sufficiently tight is appropirate for data-independent practical implementation. V. SPARSE APPROXIMATION OF NEWTON S METHOD Newton s update rule for our problem is f t+ = f t α R R 9 For our quadratic problem one iteration is sufficient to reach the optimum point with α =, however we wish to find a sparse approximation of the inverse Hessian. We show that using a sparse approximation of the inverse Hessian leads to an iterative method with acceptable convergence rate. As an interesting result it may be seen that in the special case our method reduces to the LGC. We start with approximating the inverse Hessian. R = L + C = I S + C = I I + C S I + C = Σ I + C S i I + C The last equality is obtained because eigenvalues of I + C S are all less than one. Using the m first terms in the above summation leads to an approximation of the inverse Hessian: R Σ m I + C S i I + C. Rewriting Newton s method with the approximated inverse Hessian results in the update rule below. f t+ =f t R Lf + Cf y f t Σ m I + C S i I + C Lf + Cf y =f t Σ m I + C S i I + C I + C Sf t I + C Cy =f t Σ m I + C S i I I + C S f t + Σ m I + C S i I + C Cy =f t I I + C S m f t + I + C S i I + C Cy Σ m = I + C S m f t + I + C S i I + C Cy. Σ m In summary it can be restated as: where f t+ = H m f t + g m, 3 H = I + C S 4 m g m = H i I + C Cy. 5 This update rule is performed iteratively from an initial f until the stopping criterion R is reached. Theorem. The approximate Newton s method in 3 converges to the optimal solution of LGC. Proof: Unfolding the update rule in 3 leads to m f t =H mt f + H mi g m m =H mt f + mt m H mi =H mt f + H i I + C Cy H i I + C Cy 6 Tending t gives the final solution. Since the magnitude of the eigenvalues of H are less than one, H mt f, and lim f t = I H I +C Cy = L+C Cy, 7 t which is equal to f in 4.

5 Theorem 3. For the approximate Newton s method in 3 the stopping criterion R is reached in Olog n iterations with respect to the number of data n. Proof: f t f = H m f t g m H m f g m = H m f t f H m is symmetric so H m x λ max H m x, so f t f λmax H m f t f λ max H m t f f = λ max I + C S mt f f f = + µ mt f 8 9 By rewriting the above inequality one can see that the maximum number of iterations is bounded by t log f f f t f m log + µ 3 As in gradient decent consider the iteration t just before the stopping criterion is met, i.e., when R t > and R t+. Using equation we have f t f R t λ max L + C λ M + µ. 3 The maximum number of iterations is thus bounded above by t log λm +µ f f m log + µ f +µ f log m log + µ +µn log m log + µ 3 Similar to gradient descent an Olog n dependency on the number of data is derived for our approximate Newton s method. The sparsity degree of H m is k m, So the matrixvector operations with this matrix cost Ok m n. As the approximation become more exact, H m will become less sparse. So as m increases the number of iterations decrease, as can be seen from 3, however, the cost of each iteration grows. Empirically it is seen that m should be chosen from to 3, so we can treat it as constant and achive a Ok 3 n log n dependence on the number of data for the whole algorithm. Also since k is chosen independent of n and is usually constant, the growth of the algorithm s time complexity is On log n with respect to the number of data. Approx. Newton m = Approx. Newton m = Gradient Descent Figure : Demonstration of steps taken by gradient descent and approximate Newton s method for two data points from MNIST. The algorithms start their movments from top left point to the optimal point which is located at bottom right. Similar to gradient descent the bound derived in 3 is independent of dataset, which accompanied with tightness is a good feature for practical implementation. Experiments show that that the bound derived here is tighter than that of for gradient descent and of course the number of iterations for approximate Newton is much less than that of for gradient descent. As a special case, we claim that for m =, the algorithm is the same as LGC s iteration procedure. Remembering C = Iµ; f t+ =Hf t + g = I + C Sf t + I + C Cy = µ Sf t + µ µ + Cy = αsf t + αcy, 33 which is the same as 5. Figure shows how increasing m affects steps taken by the optimization algorithm in contrast to steps taken by gradient descent for simulations on the MNIST dataset. Gradient descent is extremely dependent on the condition number of the Hessian; for high condition numbers gradient descent usually takes a series of zigzag steps to reach the optimum point. Approximating the Newton step refines the search direction and decreases the zigzag effect. Figure shows that the steps form approximately a line at m =. The Newton step for quadratic problems is in the direction to the optimal point. The trace of approximate method with m = highly coincides with the true direction to the optimum point, indicating how well inverse Hessian is approximated in the proposed method. This is the reason of small number of iterations needed for convergence of approximate method compared with that of for gradient descent. The experiments validating the improvement are presented in the next section.

6 VI. EXPERIMENTS For experiments three real world datasets are used: MNIST for digit recognition, Covertype for forest cover prediction, and Classic for text categorization. These rather large datasets are chosen to better simulate a large-scale setting, for which naive solutions, such as inverse operation, are not applicable in terms of memory and time. The MNIST is a collection of 6 handwritten digit recognition samples. For classification we choose data points from digits and 8. Each is of dimension 784. No processing is done on the data. The forest Covertype dataset is collected for predicting forest cover type from cartographic variables. It includes seven classes and 58 samples of dimension 54. We randomly select samples of types and, and normalize them such that each feature is in [ ]. Classic collection is a benchmark dataset in text mining. This dataset consists of 4 different document collections: CACM 34 documents, CISI 46 documents, CRAN 398 documents, and MED 33 documents. We try to separate first category from others. Terms are single words; Minimum term length is 3. A term appears at least in 3 documents, and a term can appear at most 95 % of the documents. Moreover, Porters stemming is applied while preprocessing. Features are weighted with TFIDF scheme and normalized to. For all the datasets we use the same setting: Adjacency matrices are constructed using 5-NN with the bandwidth size set to mean of standard deviation of data. % of data points are labeled. µ is set to.5. Choosing =.5 empirically ensures convergence to the optimal solutions. Number of Iterations, accuracy, and distance to optimum are reported by average of runs for different random labelings. The algorithms are run on datasets and the results are depicted and discussed in the following. Figure shows the number of iterations for three iterative methods with respect to the number of data. The solution of iterative methods are almost converged to the optimum point as depicted by Figure 3. LGC s default implementation is the worst among the three. Gradient descent is second, and our approximate Newton s method has the fastest convergence rate consistently in the three diverse datasets. Note that LGC corresponds to the approximate method with m =, and as indicated in figure has better direction compared with gradient descent, so it may be surprising that its iterations are more than that of gradient descent. The key point is the line search. Although the direction proposed by gradient descent is worse than the one for LGC, exact line search causes gradient descent to reach the optimum faster. If we incorporate our approximate method with an exact line search we reach even fewer iterations, however empirically it was observed that due to the time consumed by line search, there is no improvement in terms of time duration. Another important point about diagrams in figure is the order of growth with respect to the number of data, which is consistent with the logarithmic growth derived in the previous sections. This makes LGC with iterative implementation a good choice for large-scale SSL tasks. To illustrate how tight the bounds derived for iterative methods are, we put the parameters into equations 3 and 8 to get 9, 38, and 97 for approximate method with m =, m =, and for gradient descent respectively, which may be compared with the empirical values from the diagrams in figure. Interestingly the diagrams show that the derived bounds are quite tight regardless of the dataset. Figure 3 shows accuracy of the iterative methods compared with a factorization method, CHOLMOD [9], which uses Cholesky factorization to solve a system of linear equations fast. Since computing exact solution via inverse is impractical we use a factorization method to solve for the exact solution and compare it with the solution of iterative methods. As seen from the diagrams, for all three datasets, the solution of iterative methods is sufficiently close to the optimal solution, with the number of iterations demonstrated in figure. Figure 4 compares distance to optimum with different methods at each iteration and shows how these methods converge to the optimum. As expected from previous results approximate Newton s method with m = has the fastest convergence, while LGC is the slowest one. As stated before, the superiority measured by number of iterations point of view of gradient descent to LGC is due to its line search, not the direction chosen by the method. Figure 5 shows the time needed to compute the solution. Figure 5a compares our approximate Newton s method with CHOLMOD which is the state of the art method in solving large systems of linear equations. Iterative methods are obviously superior to CHOLMOD. Figure 5b compares running times of different iterative methods. Again the proposed method with m = is the best, however this time LGC performs better than gradient descent, because of the overhead imposed due to the line search. As the number of data get larger the difference between the methods becomes more evident. Time growth is of order n logn, as predicted by theorems and 3. VII. CONCLUSION AND FUTURE WORKS In this paper, a novel approximation to Newton s method is proposed for solving manifold regularization problem along with a theoretical analysis on the number of iterations. We proved that the number of iterations have logarithmic dependence on the number of data. We also applied gradient descent to this problem and proved that its number of iterations also grows logarithmically with the number of data. The logarithmic dependence makes iterative methods a reasonable approach when a large amount of data is being classified. It s notable that the bounds derived, are empirically tight independent of the dataset in use, which is

7 Number of Iterations LGC Approx. Newton m = Gradient Descent a MNIST Number of Iterations x 4 b Covertype Number of Iterations c Classic Figure : Number of iterations for three iterative methods with respect to the number of data. Accuracy.5 LGC Approx. Newton m = Gradient Descent CHOLMOD.95 Accuracy Accuracy a MNIST.5.5 x 4 b Covertype Figure 3: Accuracy of the iterative methods compared with CHOLMOD c Classic f t f * 5 LGC Approx. Newton m = Gradient Descent 5 f t f * 5 5 f t f * Number of iterations a MNIST 3 Number of iterations b Covertype Number of iterations c Classic Figure 4: Distance form optimum for the three methods with respect to the iteration number practically an important feature of an algorithm. We derived LGC s iterative procedure as a special case of our proposed approximate Newton s method. Our method is based upon approximation of the inverse Hessian. The more exact the approximation is, the better the search direction is chosen. Experimental results confirm improvement of our proposed method over LGC s iterative procedure without any loss in accuracy of classification. Also the improvement of our approximate method over gradient descent is revealed both theoretically and empirically. A theoretical analysis of robustness against noise, incorporating a low cost line search with the proposed method, and finding lower bounds on the number of iterations or tighter bounds, to name a few, are interesting problems that remain as future work.

8 Duration Sec 4Approx. Newton m = CHOLMOD 3 Duration Sec.8LGC Approx. Newton m =.6Gradient Descent a MNIST b MNIST Figure 5: Comparison of time needed to compute the solution for iterative methods and CHOLMOD REFERENCES [] X. Zhu, Semi-supervised learning with graphs, Ph.D. dissertation, Carnegie Mellon University, 5. [] O. Chapelle, B. Scholkopf, and A. Zien, Semi-supervised learning. MIT press Cambridge, MA, 6, vol.. [3] M. Belkin, P. Niyogi, and V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research, vol. 7, pp , 6. [4] O. Duchenne, J. Audibert, R. Keriven, J. Ponce, and F. Ségonne, Segmentation by transduction, in Computer Vision and Pattern Recognition, 8. CVPR 8. IEEE Conference on. IEEE, 8, pp. 8. [5] M. Belkin and P. Niyogi, Using manifold stucture for partially labeled classification, in NIPS,, pp [6] V. Sindhwani, P. Niyogi, M. Belkin, and S. Keerthi, Linear manifold regularization for large scale semi-supervised learning, in Proc. of the nd ICML Workshop on Learning with Partially Classified Training Data, 5. [7] I. Tsang and J. Kwok, Large-scale sparsified manifold regularization, Advances in Neural Information Processing Systems, vol. 9, p. 4, 7. [8] X. Zhu and Z. Ghahramani, Learning from labeled and unlabeled data with label propagation, School Comput. Sci., Carnegie Mellon Univ., Tech. Rep. CMUCALD--7,. [] A. George and J. Liu, Computer solution of large sparse positive definite systems, ser. Prentice-Hall series in computational mathematics. Prentice-Hall, 98. [3] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Scholkopf, Ranking on data manifolds, in Advances in neural information processing systems 6: proceedings of the 3 conference, vol. 6. The MIT Press, 4, p. 69. [4] J. He, M. Li, H. Zhang, H. Tong, and C. Zhang, Manifoldranking based image retrieval, in Proceedings of the th annual ACM international conference on Multimedia. ACM, 4, pp [5] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge Univ Pr, 4. [6] A. Argyriou, Efficient approximation methods for harmonic semi- supervised learning, Master s thesis, University College London, UK, 4. [7] J. Nocedal and S. Wright, Numerical optimization. Springer verlag, 999. [8] F. Chung, Spectral graph theory. Amer Mathematical Society, 997, no. 9. [9] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam, Algorithm 887: Cholmod, supernodal sparse cholesky factorization and update/downdate, ACM Trans. Math. Softw., vol. 35, pp. : :4, October 8. [9] X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, in ICML, 3, pp [] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, Learning with local and global consistency, in NIPS, 3. [] F. Wang and C. Zhang, Label propagation through linear neighborhoods, in Proceedings of the 3rd international conference on Machine learning. ACM, 6, pp

Manifold Coarse Graining for Online Semi-supervised Learning

Manifold Coarse Graining for Online Semi-supervised Learning for Online Semi-supervised Learning Mehrdad Farajtabar, Amirreza Shaban, Hamid R. Rabiee, Mohammad H. Rohban Digital Media Lab, Department of Computer Engineering, Sharif University of Technology, Tehran,

More information

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31 Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking Dengyong Zhou zhou@tuebingen.mpg.de Dept. Schölkopf, Max Planck Institute for Biological Cybernetics, Germany Learning from

More information

Graph-Based Semi-Supervised Learning

Graph-Based Semi-Supervised Learning Graph-Based Semi-Supervised Learning Olivier Delalleau, Yoshua Bengio and Nicolas Le Roux Université de Montréal CIAR Workshop - April 26th, 2005 Graph-Based Semi-Supervised Learning Yoshua Bengio, Olivier

More information

Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

More information

What is semi-supervised learning?

What is semi-supervised learning? What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,

More information

How to learn from very few examples?

How to learn from very few examples? How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter Classification Semi-supervised learning based on network Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS 249-2 2017 Winter Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions Xiaojin

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT) Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT) Gigantic Image Collections What does the world look like? High

More information

Graphs, Geometry and Semi-supervised Learning

Graphs, Geometry and Semi-supervised Learning Graphs, Geometry and Semi-supervised Learning Mikhail Belkin The Ohio State University, Dept of Computer Science and Engineering and Dept of Statistics Collaborators: Partha Niyogi, Vikas Sindhwani In

More information

Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University

Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University Semi-Supervised Learning with Graphs Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University 1 Semi-supervised Learning classification classifiers need labeled data to train labeled data

More information

Semi-Supervised Learning

Semi-Supervised Learning Semi-Supervised Learning getting more for less in natural language processing and beyond Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University 1 Semi-supervised Learning many human

More information

Global vs. Multiscale Approaches

Global vs. Multiscale Approaches Harmonic Analysis on Graphs Global vs. Multiscale Approaches Weizmann Institute of Science, Rehovot, Israel July 2011 Joint work with Matan Gavish (WIS/Stanford), Ronald Coifman (Yale), ICML 10' Challenge:

More information

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion

Semi-supervised Dictionary Learning Based on Hilbert-Schmidt Independence Criterion Semi-supervised ictionary Learning Based on Hilbert-Schmidt Independence Criterion Mehrdad J. Gangeh 1, Safaa M.A. Bedawi 2, Ali Ghodsi 3, and Fakhri Karray 2 1 epartments of Medical Biophysics, and Radiation

More information

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Limits of Spectral Clustering

Limits of Spectral Clustering Limits of Spectral Clustering Ulrike von Luxburg and Olivier Bousquet Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tübingen, Germany {ulrike.luxburg,olivier.bousquet}@tuebingen.mpg.de

More information

1 Graph Kernels by Spectral Transforms

1 Graph Kernels by Spectral Transforms Graph Kernels by Spectral Transforms Xiaojin Zhu Jaz Kandola John Lafferty Zoubin Ghahramani Many graph-based semi-supervised learning methods can be viewed as imposing smoothness conditions on the target

More information

Self-Tuning Semantic Image Segmentation

Self-Tuning Semantic Image Segmentation Self-Tuning Semantic Image Segmentation Sergey Milyaev 1,2, Olga Barinova 2 1 Voronezh State University sergey.milyaev@gmail.com 2 Lomonosov Moscow State University obarinova@graphics.cs.msu.su Abstract.

More information

Hou, Ch. et al. IEEE Transactions on Neural Networks March 2011

Hou, Ch. et al. IEEE Transactions on Neural Networks March 2011 Hou, Ch. et al. IEEE Transactions on Neural Networks March 2011 Semi-supervised approach which attempts to incorporate partial information from unlabeled data points Semi-supervised approach which attempts

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Regularization on Discrete Spaces

Regularization on Discrete Spaces Regularization on Discrete Spaces Dengyong Zhou and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany {dengyong.zhou, bernhard.schoelkopf}@tuebingen.mpg.de

More information

ABC-LogitBoost for Multi-Class Classification

ABC-LogitBoost for Multi-Class Classification Ping Li, Cornell University ABC-Boost BTRY 6520 Fall 2012 1 ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University 2 4 6 8 10 12 14 16 2 4 6 8 10 12

More information

MAA507, Power method, QR-method and sparse matrix representation.

MAA507, Power method, QR-method and sparse matrix representation. ,, and representation. February 11, 2014 Lecture 7: Overview, Today we will look at:.. If time: A look at representation and fill in. Why do we need numerical s? I think everyone have seen how time consuming

More information

Nonlinear Optimization for Optimal Control

Nonlinear Optimization for Optimal Control Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

More information

Large-Scale Graph-Based Semi-Supervised Learning via Tree Laplacian Solver

Large-Scale Graph-Based Semi-Supervised Learning via Tree Laplacian Solver Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Large-Scale Graph-Based Semi-Supervised Learning via Tree Laplacian Solver Yan-Ming Zhang and Xu-Yao Zhang National Laboratory

More information

A PROBABILISTIC INTERPRETATION OF SAMPLING THEORY OF GRAPH SIGNALS. Akshay Gadde and Antonio Ortega

A PROBABILISTIC INTERPRETATION OF SAMPLING THEORY OF GRAPH SIGNALS. Akshay Gadde and Antonio Ortega A PROBABILISTIC INTERPRETATION OF SAMPLING THEORY OF GRAPH SIGNALS Akshay Gadde and Antonio Ortega Department of Electrical Engineering University of Southern California, Los Angeles Email: agadde@usc.edu,

More information

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances Wee Sun Lee,2, Xinhua Zhang,2, and Yee Whye Teh Department of Computer Science, National University of Singapore. 2

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Discrete vs. Continuous: Two Sides of Machine Learning

Discrete vs. Continuous: Two Sides of Machine Learning Discrete vs. Continuous: Two Sides of Machine Learning Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Oct. 18,

More information

Semi-Supervised Learning with Graphs

Semi-Supervised Learning with Graphs Semi-Supervised Learning with Graphs Xiaojin (Jerry) Zhu LTI SCS CMU Thesis Committee John Lafferty (co-chair) Ronald Rosenfeld (co-chair) Zoubin Ghahramani Tommi Jaakkola 1 Semi-supervised Learning classifiers

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Graphs in Machine Learning

Graphs in Machine Learning Graphs in Machine Learning Michal Valko Inria Lille - Nord Europe, France TA: Pierre Perrault Partially based on material by: Mikhail Belkin, Jerry Zhu, Olivier Chapelle, Branislav Kveton October 30, 2017

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

6. Iterative Methods for Linear Systems. The stepwise approach to the solution... 6 Iterative Methods for Linear Systems The stepwise approach to the solution Miriam Mehl: 6 Iterative Methods for Linear Systems The stepwise approach to the solution, January 18, 2013 1 61 Large Sparse

More information

Maximum-weighted matching strategies and the application to symmetric indefinite systems

Maximum-weighted matching strategies and the application to symmetric indefinite systems Maximum-weighted matching strategies and the application to symmetric indefinite systems by Stefan Röllin, and Olaf Schenk 2 Technical Report CS-24-7 Department of Computer Science, University of Basel

More information

12. Cholesky factorization

12. Cholesky factorization L. Vandenberghe ECE133A (Winter 2018) 12. Cholesky factorization positive definite matrices examples Cholesky factorization complex positive definite matrices kernel methods 12-1 Definitions a symmetric

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Summer School on Graphs in Computer Graphics, Image and Signal Analysis Bornholm, Denmark, August 2011

Summer School on Graphs in Computer Graphics, Image and Signal Analysis Bornholm, Denmark, August 2011 Summer School on Graphs in Computer Graphics, Image and Signal Analysis Bornholm, Denmark, August 2011 1 Succinct Games Describing a game in normal form entails listing all payoffs for all players and

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Improved Local Coordinate Coding using Local Tangents

Improved Local Coordinate Coding using Local Tangents Improved Local Coordinate Coding using Local Tangents Kai Yu NEC Laboratories America, 10081 N. Wolfe Road, Cupertino, CA 95129 Tong Zhang Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp. On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Online Manifold Regularization: A New Learning Setting and Empirical Study

Online Manifold Regularization: A New Learning Setting and Empirical Study Online Manifold Regularization: A New Learning Setting and Empirical Study Andrew B. Goldberg 1, Ming Li 2, Xiaojin Zhu 1 1 Computer Sciences, University of Wisconsin Madison, USA. {goldberg,jerryzhu}@cs.wisc.edu

More information

Convex Optimization of Graph Laplacian Eigenvalues

Convex Optimization of Graph Laplacian Eigenvalues Convex Optimization of Graph Laplacian Eigenvalues Stephen Boyd Abstract. We consider the problem of choosing the edge weights of an undirected graph so as to maximize or minimize some function of the

More information

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Semi-Supervised Learning with the Graph Laplacian: The Limit of Infinite Unlabelled Data

Semi-Supervised Learning with the Graph Laplacian: The Limit of Infinite Unlabelled Data Semi-Supervised Learning with the Graph Laplacian: The Limit of Infinite Unlabelled Data Boaz Nadler Dept. of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, Israel 76 boaz.nadler@weizmann.ac.il

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Nonlinear Dimensionality Reduction. Jose A. Costa

Nonlinear Dimensionality Reduction. Jose A. Costa Nonlinear Dimensionality Reduction Jose A. Costa Mathematics of Information Seminar, Dec. Motivation Many useful of signals such as: Image databases; Gene expression microarrays; Internet traffic time

More information

Graph Quality Judgement: A Large Margin Expedition

Graph Quality Judgement: A Large Margin Expedition Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Graph Quality Judgement: A Large Margin Expedition Yu-Feng Li Shao-Bo Wang Zhi-Hua Zhou National Key

More information

Convergence Rates for Greedy Kaczmarz Algorithms

Convergence Rates for Greedy Kaczmarz Algorithms onvergence Rates for Greedy Kaczmarz Algorithms Julie Nutini 1, Behrooz Sepehry 1, Alim Virani 1, Issam Laradji 1, Mark Schmidt 1, Hoyt Koepke 2 1 niversity of British olumbia, 2 Dato Abstract We discuss

More information

From graph to manifold Laplacian: The convergence rate

From graph to manifold Laplacian: The convergence rate Appl. Comput. Harmon. Anal. 2 (2006) 28 34 www.elsevier.com/locate/acha Letter to the Editor From graph to manifold Laplacian: The convergence rate A. Singer Department of athematics, Yale University,

More information

Spectral Clustering on Handwritten Digits Database

Spectral Clustering on Handwritten Digits Database University of Maryland-College Park Advance Scientific Computing I,II Spectral Clustering on Handwritten Digits Database Author: Danielle Middlebrooks Dmiddle1@math.umd.edu Second year AMSC Student Advisor:

More information

Integrating Global and Local Structures: A Least Squares Framework for Dimensionality Reduction

Integrating Global and Local Structures: A Least Squares Framework for Dimensionality Reduction Integrating Global and Local Structures: A Least Squares Framework for Dimensionality Reduction Jianhui Chen, Jieping Ye Computer Science and Engineering Department Arizona State University {jianhui.chen,

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

Adaptive Affinity Matrix for Unsupervised Metric Learning

Adaptive Affinity Matrix for Unsupervised Metric Learning Adaptive Affinity Matrix for Unsupervised Metric Learning Yaoyi Li, Junxuan Chen, Yiru Zhao and Hongtao Lu Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering,

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Justin Solomon MIT, Spring 2017

Justin Solomon MIT, Spring 2017 Justin Solomon MIT, Spring 2017 http://pngimg.com/upload/hammer_png3886.png You can learn a lot about a shape by hitting it (lightly) with a hammer! What can you learn about its shape from vibration frequencies

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Nonnegative Matrix Factorization Clustering on Multiple Manifolds

Nonnegative Matrix Factorization Clustering on Multiple Manifolds Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Nonnegative Matrix Factorization Clustering on Multiple Manifolds Bin Shen, Luo Si Department of Computer Science,

More information

Matrix Assembly in FEA

Matrix Assembly in FEA Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,

More information

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota Université du Littoral- Calais July 11, 16 First..... to the

More information

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b. Lab 1 Conjugate-Gradient Lab Objective: Learn about the Conjugate-Gradient Algorithm and its Uses Descent Algorithms and the Conjugate-Gradient Method There are many possibilities for solving a linear

More information

An Empirical Study of Building Compact Ensembles

An Empirical Study of Building Compact Ensembles An Empirical Study of Building Compact Ensembles Huan Liu, Amit Mandvikar, and Jigar Mody Computer Science & Engineering Arizona State University Tempe, AZ 85281 {huan.liu,amitm,jigar.mody}@asu.edu Abstract.

More information

Semi-Supervised Classification with Universum

Semi-Supervised Classification with Universum Semi-Supervised Classification with Universum Dan Zhang 1, Jingdong Wang 2, Fei Wang 3, Changshui Zhang 4 1,3,4 State Key Laboratory on Intelligent Technology and Systems, Tsinghua National Laboratory

More information

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Multiple Similarities Based Kernel Subspace Learning for Image Classification Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

COMPARING PERFORMANCE OF NEURAL NETWORKS RECOGNIZING MACHINE GENERATED CHARACTERS

COMPARING PERFORMANCE OF NEURAL NETWORKS RECOGNIZING MACHINE GENERATED CHARACTERS Proceedings of the First Southern Symposium on Computing The University of Southern Mississippi, December 4-5, 1998 COMPARING PERFORMANCE OF NEURAL NETWORKS RECOGNIZING MACHINE GENERATED CHARACTERS SEAN

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Appendix to: On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

Appendix to: On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts Appendix to: On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts Hariharan Narayanan Department of Computer Science University of Chicago Chicago IL 6637 hari@cs.uchicago.edu

More information

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Lecture 9: Numerical Linear Algebra Primer (February 11st) 10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares Robert Bridson October 29, 2008 1 Hessian Problems in Newton Last time we fixed one of plain Newton s problems by introducing line search

More information

TOPOLOGY FOR GLOBAL AVERAGE CONSENSUS. Soummya Kar and José M. F. Moura

TOPOLOGY FOR GLOBAL AVERAGE CONSENSUS. Soummya Kar and José M. F. Moura TOPOLOGY FOR GLOBAL AVERAGE CONSENSUS Soummya Kar and José M. F. Moura Department of Electrical and Computer Engineering Carnegie Mellon University, Pittsburgh, PA 15213 USA (e-mail:{moura}@ece.cmu.edu)

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Numerical Analysis Lecture Notes Peter J Olver 8 Numerical Computation of Eigenvalues In this part, we discuss some practical methods for computing eigenvalues and eigenvectors of matrices Needless to

More information

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09 Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods

More information

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification JMLR: Workshop and Conference Proceedings 1 16 A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification Chih-Yang Hsia r04922021@ntu.edu.tw Dept. of Computer Science,

More information

Lec10p1, ORF363/COS323

Lec10p1, ORF363/COS323 Lec10 Page 1 Lec10p1, ORF363/COS323 This lecture: Conjugate direction methods Conjugate directions Conjugate Gram-Schmidt The conjugate gradient (CG) algorithm Solving linear systems Leontief input-output

More information

When Dictionary Learning Meets Classification

When Dictionary Learning Meets Classification When Dictionary Learning Meets Classification Bufford, Teresa 1 Chen, Yuxin 2 Horning, Mitchell 3 Shee, Liberty 1 Mentor: Professor Yohann Tendero 1 UCLA 2 Dalhousie University 3 Harvey Mudd College August

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Improving Semi-Supervised Target Alignment via Label-Aware Base Kernels

Improving Semi-Supervised Target Alignment via Label-Aware Base Kernels Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Improving Semi-Supervised Target Alignment via Label-Aware Base Kernels Qiaojun Wang, Kai Zhang 2, Guofei Jiang 2, and Ivan Marsic

More information

Semi-supervised Eigenvectors for Locally-biased Learning

Semi-supervised Eigenvectors for Locally-biased Learning Semi-supervised Eigenvectors for Locally-biased Learning Toke Jansen Hansen Section for Cognitive Systems DTU Informatics Technical University of Denmark tjha@imm.dtu.dk Michael W. Mahoney Department of

More information

Multi-view Laplacian Support Vector Machines

Multi-view Laplacian Support Vector Machines Multi-view Laplacian Support Vector Machines Shiliang Sun Department of Computer Science and Technology, East China Normal University, Shanghai 200241, China slsun@cs.ecnu.edu.cn Abstract. We propose a

More information

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Fast Nonnegative Matrix Factorization with Rank-one ADMM Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information