Learning Preferences with Millions of Parameters by Enforcing Sparsity

Size: px
Start display at page:

Download "Learning Preferences with Millions of Parameters by Enforcing Sparsity"

Transcription

1 Learning Preferences with Millions of Parameters by Enforcing Sparsity Xi Chen (1)(2) Bing Bai (1) Yanjun Qi (1) Qihang Li (2) Jaime Carbonell (2) (1) NEC Labs America, Princeton, NJ (2) Carnegie Mellon University, Pittsburg, PA {xichen, qihangl, Abstract We study the retrieval task that ranks a set of objects for a given query in the pairwise preference learning framework. Recently researchers found out that raw features (e.g. words for text retrieval) and their pairwise features which describe relationships between two raw features (e.g. word synonymy or polysemy) could greatly improve the retrieval precision. However, most existing methods can not scale up to problems with many raw features (e.g. English vocabulary), due to the prohibitive computational cost on learning and the memory requirement to store a quadratic number of parameters. In this paper, we propose to learn a sparse representation of the pairwise features under the preference learning framework using the L1 regularization. Based on stochastic gradient descent, an online algorithm is devised to enforce the sparsity using a mini-batch shrinkage strategy. On multiple benchmark datasets, we show that our method achieves better performance with fast convergence, and takes much less memory on models with millions of parameters. Keywords-preference learning; sparse model; online learning; learning to rank; text mining I. INTRODUCTION Learning preferences among a set of objects (e.g. documents) given another object as query is a central task of information retrieval and text mining. One of the most natural frameworks for this task is the pairwise preference learning, expressing that one document is preferred over another given the query [1]. Most existing methods [2] learn the preference or relevance function by assigning a real valued score to a feature vector describing a (query, object) pair. This feature vector normally includes a small number of hand-crafted features, such as the BM25 scores for the title or the whole text, instead of the very natural raw features [3]. A drawback of using hand-crafted features is that they are often expensive and specific to datasets, requiring domain knowledge in preprocessing. In contrast, the raw features are easily available, and carry strong semantic information (such as word features in text mining). In this paper we study a basic model presented in [4], [5] which uses the raw word features under the supervised pairwise preference learning framework and consider feature relationships in the model 1. To be specific, let D be the dictionary size, i.e. the size of the query and document feature set 2, given a query q R D and a document d R D, the relevance score between q and d is modeled as: f(q, d) =q Wd = W ij Φ(q i,d j ), (1) i,j where Φ(q i,d j ) = q i d j and W ij models the relationship/correlation between i th query feature q i and j th document feature d j. This is essentially a linear model with pairwise features Φ(, ) and the parameter matrix W R D D is learned from labeled data. Compared to most of the existing models, the capacity of this model is very large because of the D 2 free parameters which can carefully model the relationship between each pair of words. From a semantic point of view, a notable superiority of this model is that it can capture synonymy and polysemy as it looks at all possible cross terms, and can be tuned directly for the task of interest. Although it is very powerful, the basic model in Eq. (1) suffers from the following weakness which hinders its wide application: 1) Memory issue: Given the large dictionary size D, it requires a huge amount of memory to store the W matrix with a size quadratic in D. When D =1,, storing W needs nearly 1Gb of RAM (assuming double); when D =3,, it requires 8Gb of RAM. 2) Generalization ability: Given D 2 free parameters (entries of W ), when the number of training samples is limited, it can easily lead to overfitting. Considering the dictionary with the size D = 1,, wehave D 2 =1 8 free parameters that need to be estimated which is far too many for small corpora. To address the above weakness, we propose to constrain W to be a sparse matrix with many zero entries for the pairs of words which are irrelevant for the preference learning task. If W is a highly sparse matrix, then it consumes 1 For the sake of clarity, we present the model in a text retrieval scenario. We use the term words for features, and query and documents for data instances, depending on their roles. 2 In our model and algorithm, there is no need to restrict that query q and document d have the same feature size D; however we make this assumption for simplicity of explanation.

2 much less memory and has only a limited number of free parameters to be estimated. In other words, a sparse W matrix will allow us to greatly scale up the dictionary size to model those non-frequent words which are often essential for the preference ordering. In addition, we can have faster and more scalable learning algorithm since most entries of W are zeros so that those multiplications can be avoided. Another advantage of learning a sparse representation of W is its good interpretability. The zero W ij indicate that i th query feature and j th document feature are not correlated to the specific task. A sparse W matrix will accurately capture correlation information between pairs of words and the learned correlated word pairs could explain the semantic rationale behind the preference ordering. In order to enforce the sparsity on W, inspired by the success of the sparse linear regression model lasso [6], we impose the entry-wise l 1 regularization on W. It has been shown theoretically that under certain conditions, l 1 regularization can correctly uncover the underlying ground truth sparsity pattern [7]. A practical challenge in using the l 1 regularized model is to develop an efficient and scalable learning algorithm. Since in many preference learning related applications (e.g. search engine), the model needs to be trained in a timely fashion and new (query, document) pairs may be added to the repository at any time, stochastic gradient descent in the online learning framework is the most desirable learning method for our task [8]. To enforce the l 1 regularization, based on [9], we propose to perform a mini-batch shrinking step for every T iterations in the stochastic gradient descent which can lead to the sparse solution. Moreover, to reduce the additional bias introduced by the l 1 regularization, we further propose a refitting step which can improve the preference prediction while keeping the learned sparsity pattern. The key idea of this paper is that: learning on a large number of corpus-independent raw features, or even on combinations of such features, is not impossible. Although the number of involved parameters is intimidatingly large, using sparsity as a powerful tool, the model can be well controlled and efficiently learned. We believe these ideas form a fresh perspective and will benefit many other retrieval related tasks as well. II. BASIC MODEL Let us denote the set of documents in the corpus as {d k } K k=1 RD and the query as q R D, where D is the dictionary size, and the j th dimension of a document/query vector indicates the frequency of occurrence of the j th word, e.g. using the tf-idf weighting and then normalizing to unit length [1]. Given a query q and a document d, we wish to learn a scoring function f(q, d) that measures the relevance of d to q. In this paper, we assume that f is a linear function which takes the form of Eq. (1). Each entry of W represents a relationship between a pair of words. A. Margin Rank Loss Suppose we are given a set of tuples R (labeled data), where each tuple contains a query q, a preferred document d + and an unpreferred (or lower ranked) document d.we would like to learn a W such that q Wd + > q Wd, making the right preference prediction. For that purpose, given tuple (q, d +,d ), we employ the widely adopted margin rank loss [11]: L W (q, d +,d ) h(q T Wd + q T Wd ) (2) = max(, 1 q T Wd + + q T Wd ), where h(x) max(, 1 x) is the well-known hinge loss function as adopted in SVM. Our goal is to learn the W matrix which minimize the loss in Eq. (2) summing over all tuples (q, d +,d ) in R: W 1 L W (q, d +,d ). (3) W R (q,d +,d ) R B. Stochastic Subgradient Descent In general, the size of R is very large and new tuples may be added to R in a streaming manner, which makes it difficult to directly train the objective in Eq. (3). To overcome this challenge, we adopt stochastic (sub)gradient descent (SGD) in an online learning framework [12], Specifically, at each iteration, we randomly draw a sample (q, d +,d ) from R, compute the subgradient 3 of L W (q, d +,d ) with respect to W as following: L W (q, d +,d ) { q(d + d ) if q W (d + d ) < 1 =, (4) otherwise and then update the W matrix accordingly. It has been shown that SGD achieves fast learning on large scale datasets [8]. We suggest to initialize W to the identity matrix as this initializes the model to the same solution as a cosine similarity score. The strategy introduces a prior expressing that the weight matrix should be close to the identity matrix. We consider term correlations only when it is necessary to increase the score of a relevant document, or conversely, decrease the score of a irrelevant document. As for the learning rate η t, we suggest to adopt a decaying learning rate: η t = C t, where C is a pre-defined constant as the initial learning rate. Intuitively, it should be better than the fixed learning rate since at the beginning, when W is far away from the optimal solution W, a larger learning rate is desirable since it leads to a significant decrease of the objective value. On the other hand, when W gets close to W, a smaller rate should be adopted to avoid missing the optimal solution. We will show the advantage of the 3 Since L is a non-smooth function, it does not have the gradient but only has the subgradient.

3 decaying learning rate scheme over the fixed one in the experiment section. III. PREFERENCE LEARNING WITH SPARSITY As discussed in the introduction, the model and the learning algorithm in the previous section will lead to a dense W matrix which consumes a large amount of memory and has poor generalization ability for small corpora. To address these problems, we can learn a sparse model with a small number of nonzero entries of W. In order to obtain a sparse W, inspired by [6], we add an entry-wise l 1 norm to the W as a regularization term to the loss function. We propose to optimize the following objective function: W 1 L W (q, d +,d )+λ W 1, W R (q,d +,d ) R (5) where W 1 = D i,j=1 W ij is the entry-wise l 1 norm of W and λ is the regularization parameter which controls the sparsity level (the number of nonzero entries) of W.In general, a larger λ leads to a more sparse W. On the other hand, a too sparse W will miss some useful relationship information among word pairs (considering diagonal W as an extreme case). Therefore, in practice, we need to tune λ to obtain a W with a suitable sparsity level. A. Training the Sparse Model To optimize Eq. (5), we adopt a variant of the general sparse online learning scheme in [9]. In [9], after updating W t at each iteration in SGD, a shrinkage step is performed by solving the following optimization problem: Ŵ t 1 W 2 W W t 2 F + λη t W 1, (6) and then use Ŵ t as the starting point for the next iteration. In Eq. 6, F denote the matrix Frobenius norm and η t is the decaying learning rate for the t th iteration. According to the next proposition, we know that performing (6) will shrink those Wij t with an absolute value less than λη t to zero and hence lead to a sparse W matrix. Proposition 1. The solution Ŵ t to the optimization problem in Eq. (6) takes the following form: Wij t + λη t if Wij t < λη t Ŵij t = if λη t Wij t λη t (7) Wij t λη t if Wij t >λη t Although performing the shrinkage step leads a sparse W solution, it is very expensive for a large dictionary size D. For example, when D =1,, we need D 2 =1 8 operations. Therefore, we suggest to perform the shrinkage step for every T iteration cycles. In general, a smaller T guarantees that the shrinkage step can be done in a timely fashion so that the entries of W will not grow too large to produce inaccurate L W (q, d +,d ); on the other hand, a smaller T increases the computational cost of the training process. In practice, we suggest to set T = 1. The details of the algorithm are presented in Algorithm 1. Note that when t is a multiple of T, we perform the shrinkage step with a cumulative regularization parameter for W 1 from t T +1 to t: λ t k=t T +1 η t. The reason why we adopt cumulative regularization parameter is due to the following simple fact that: W T obtained by solving a sequence of successive optimization problems: W t 1 W 2 W W t 1 2 F + λη t W 1, t =1,...T is identical to the one by solving the following single optimization problem: W T W 1 2 W W 2 F + λ T η t W 1. t=1 The above equivalence can be easily proved using Proposition 1 as shown in [9]. Although we take the gradient update so that Algorithm 1 is not exactly identical to the one taking the shrinkage step at every iteration, empirically, the learned sparse W from Algorithm 1 is a good approximation. Algorithm 1 Initialization: W R D D, T, learning rate sequence {η t }. Iterate for t =1, 2,... until convergence of W t : 1) Randomly draw a tuple (q, d +,d ) R 2) Compute the subgradient of L W t 1(q, d +,d ) with respect to W : L W t 1(q, d +,d ) 3) Update W t = W t 1 η t L W t 1(q, d +,d ) 4) If (t mod T =) W t W 1 2 W W t 2 F + λ t k=t T +1 η t W 1 B. Refitting the Sparse Model From Eq. (7), we see that l 1 regularization will not only shrink the weights for uncorrelated word pairs to zero but also reduce the absolute value of the weights for correlated word pairs. This additional bias introduced by l 1 regularization often harm the prediction performance. In order to reduce this bias, we propose to refit the model without l 1 regularization, but enforcing the sparsity pattern of W obtained from Algorithm 1. More precisely, after learning the sparse Ŵ from Algorithm 1, let Ω be the indices of nonzero entries of Ŵ, i.e. Ω={(i, j) Ŵij }. Given a matrix W R D D, let P Ω (W ) R D D be the { matrix defined as following: W ij if (i, j) Ω P Ω (W ) ij = if (i, j) Ω, where P Ω is called the projection operator which projects W onto Ω.

4 In the refitting step, given Ω, we try to minimize the following objective function: W 1 L PΩ(W )(q, d +,d ), (8) W R (q,d +,d ) R We still adopt SGD to minimize Eq. (8), but replace L W (q, d +,d ) with L PΩ(W )(q, d +,d ). Using the chain rule for subgradient, we can show that L PΩ(W )(q, d +,d ) takes the following form: L PΩ(W )(q, d +,d ) { P Ω (q(d + d ) ) if q P Ω (W )(d + d ) < 1 =. otherwise In the experiment section, we show that the prediction performance gets improved after the refitting step. IV. EXPERIMENT A. Experiment Setup Pairwise preference learning discussed in this paper belongs to a more general framework Learning to rank, which is a key topic in the research of information retrieval [2]. Learning to rank methods are usually evaluated using standard benchmark data like TREC 4 data or LETOR [3]. However, TREC has only a limited number of queries available, which makes the training of a large number of features very difficult. On the other hand, LETOR and most of the other learning to rank datasets (e.g. Microsoft Learning to Rank Datasets 5, Yahoo! Learning to Rank Challenge Datasets 6 ) have only few hundred (or even less) features such as BM25 or pagerank scores instead of the actual word features, and are therefore not adequate for evaluating our methods. It would be ideal to test the proposed method on click-through data from web search logs, but such data are not publicly available. As pointed out by a seminal paper [13], preference learning and multiclass classification can be modeled in a unified framework. It is natural to adopt the multiclass classification (with many different classes) datasets to evaluate the our proposed preference learning models. More precisely, we construct training samples (q, d +,d ) R where q and d + are in the same class while q and d belong to different classes. In our experiment, we use several benchmark multiclass classification datasets, including the text datasets 2 Newsgroups 7 (2NG), RCV1 8 [15] and digital recognition dataset MNIST 9. For 2NG and RCV1, we adopt the normalized tf-idf of the 1, most frequent words as the document features. For MINST, the normalized grey scale We adopt the preprocessing method in [14], remove the multi-labelled instances and result in 53 different classes. 9 Table I THE STATISTICS OF THE EXPERIMENTAL DATASETS 2NG RCV1 MNIST No. of training samples 11,314 15,564 6, No. of testing samples 7, ,571 1, No. of Classes Dictionary Size D 1, 1, 784 No. of Free Parameters , 656 pixel values are used as features. Some statistics of these datasets are shown in Table I. For the experiments, we use the cosine similarity as the baseline, i.e. W = I and mainly compare the following methods: 1) W is a diagonal matrix. 2) W is unconstrained and trained by SGD with the fixed learning rate η =.1. 3) W is unconstrained and trained by SGD with the decaying learning rates η t = C t where C = 2. 4) W is sparse and trained by Algorithm 1 with the decaying learning rates η t = C t where C = 2 and then perform the corresponding refitting step 1. Note that for the decaying learning rate case, we try a wide range of starting rate C and find that C = 2 can provide us the most rapid decrease of the objective value. So C is set to be 2 through out the paper. As for the regularization parameter λ, when the dictionary size is large, say D =1, for the text data, we choose λ which leads to the 5% to 1% density (percentage of nonzero entries) of W. When D is relatively small, say 748 for the MNIST dataset, we set λ so that the density of W is roughly 5%. This way of setting λ provides us a sparse enough W which has the advantage of the memory savings and being easy to interpret. In the meanwhile, we have enough nonzero entries of W to guarantee a reasonably good preference learning performance on the testing datasets. We compare the performance of each method by the following metrics: 1) Test error rate: for a testing tuple (q, d +,d ), if q T Wd + q T Wd, test error is increased by 1 and normalized by the test sample size. 2) Mean average precision (MAP) [16]. And we also provide the semantic information encoded in the W matrix for the text data. B. Results 1) Learning Curves: It has been shown that test error rate is monotonic to area under ROC curve (AUC). In Figure 1, we show the curves of the test error rate and 1 For the fair comparison, we use exactly the same training samples in the refitting step as in the sparse learning procedure without any additional samples.

5 Figure 1. The test error rate and the density W for 3 benchmark datasets Test Error Rate (%) (Refitting) Test Error Rate (%) (Refitting) Test Error Rate (%) 5 5 (Refitting).5 (a) The test error curve for 2NG (b) The test error curve for RCV1.5 (c) The test error curve for MNIST Density of W (%) (d) The density of W for 2NG Density of W (%).5.3 (e) The density of W for RCV1 Density of W (%) (f) The density of W for MNIST the corresponding density of W vs. the number of training iterations, i.e. the number of accesses of the training data. Each row in Figure 1 corresponds to one dataset. We have the following observations: 1) For SGD with the fixed learning rate, the test error rate decreases very slowly, and is relatively flat compared to other methods. 2) The schemes with decaying learning rates (including two sparse models, and the dense model ) have similarly good convergence rate. 3) Using the same regularization parameter, Sparse SGD achieves the most sparse model. For dense models, reaches a more sparse model than using naive method SGD (Fixed Learning Rate). This is another evidence that decaying learning rates are superior to fixed learning rates. 4) Refitting the sparse models clearly achieves the best test error rate while keeping the low density of the W matrix. 2) Retrieval Performance and Memory: In this section, we present mean average precision (MAP), test error rate and the memory space of W of each method after training 1, iterations. The results are presented in Table II. We use the abbreviation for each method due to the space limitation of the table. Identity stands for W = I and Diagonal means that W is a diagonal matrix where only the diagonal entries are learned. SGD FLR means SGD with the fixed learning rate, while SGD DLR means SGD with the decaying learning rate. Both of them are trained on the basic model without the l 1 regularization. Sparse and Sparse-R are sparse models, without refitting and with refitting, respectively. The results on MAP are essentially consistent to those on the test error rate. The sparse method with decaying learning rate has similar performance compared to its dense counterpart but takes much less memory. Moreover, sparse models after refitting achieves the best performance 3) Anecdotal Evidences: The sparse model can also provide much useful semantic information. For example, we can easily infer the most related word pair between query word and document word from the sparse W matrix. More precisely, each row of W represents the strength of the correlation (either positive or negative) of document words to a specific query word. Given the i th query word, we sort the absolute value of the i th row of W in a descending order and pick the first few nonzero entries. Those selected entries of W represent the most correlated document words to the given i th query word. In Table III, we present the query words from different categories and the five most correlated document words from the learned sparse W in Eq. (5). We can see that most correlated document words are clearly from the same topics as the query words. It also provide us some interesting semantic information. For example, in 2NG, fbi is closely related to handgun ; colorado to hockey which indicates that there might be a popular hockey team in Colorado (in fact,

6 Table II RETRIEVAL PERFORMANCE AND MEMORY. ITEMS IN BOLD FONTS ARE THE BEST AMONG METHODS TESTED. (a) 2NG MAP Error Mem. (MB) Identity Diagonal SGD FLR SGD DLR Sparse Sparse-R (b) RCV1 MAP Error Mem. (MB) Identity.38 3 Diagonal SGD FLR SGD DLR Sparse Sparse-R (c) MNIST MAP Error Mem. (MB) Identity Diagonal SGD FLR SGD DLR Sparse Sparse-R Table III THE EXAMPLES OF LEARNED RELATED WORD PAIRS IN 2NG Query word clinton cpu graphics handgun hockey motorcycle religions Five most related document words clinton government health people gay mac drive scsi card jon graphics tiff image color polygon gun weapons handgun militia fbi hockey game espn colorado team bike brake turbo rpi cylinder god religions bible christian jesus it is Colorado Avalanche team); government to clinton which indicates that Clinton might be a famous politician (The former US president Bill Clinton). V. CONCLUSIONS In contrast to the traditional preference learning or learning to rank with a few hundred hand-crafted features, our basic model directly performs learning on actual words and considers their pairwise relationships between query and document. Although the pairwise relationship of words could improve and provide us additional semantic information, the basic model suffers from storage overloads and parameter overfitting. To overcome these drawbacks, we introduce sparsity to the model which is achieved by the l 1 regularization with an efficient online learning algorithm. We show with multiple benchmark datasets that our method achieves good performance with fast convergence, while remaining sparse during training (small memory consumption), on a model with hundreds of millions of parameters. For future work, we will explore extended models to learn group structure or even hierarchical structure of the words using group lasso [17] type of regularization. The prior knowledge on the structure of W can also be imposed. REFERENCES [1] J. Fürnkranz and E. Hüllermeier, Pairwise preference learning and ranking, in ECML, 23. [2] T.Y.Liu, Learning to Rank for Information Retrieval. Now Publishers Inc, 29. [3] T. Liu, J. Xu, T. Qin, W. Xiong, and H. Li, Letor: Benchmark dataset for research on learning to rank for information retrieval, in SIGIR 27 Workshop on Learning to Rank for Information Retrieval, 27. [4] D. Grangier and S. Bengio, A discriminative kernel-based approach to rank images from text queries, IEEE Trans. PAMI., vol. 3, no. 8, pp , 28. [5] B. Bai, J. Weston, R. Collobert, and D. Grangier, Supervised semantic indexing, in ECIR, 29. [6] R.Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B, vol. 58, pp , [7] M. J. Wainwright, Sharp thresholds for noisy and highdimensional recovery of sparsity using l 1-constrained quadratic programming (lasso), IEEE Tran on Information Theory, vol. 55, pp ,, 29. [8] L. Bottou and Y. LeCun, Large-scale on-line learning, in Advances in Neural Information Processing Systems 15. MIT Press, 24. [9] J. Duchi and Y. Singer, Efficient learning using forwardbackward splitting, in Advances in Neural Information Processing Systems 23, 29. [1] R. Baeza-Yates, B. Ribeiro-Neto et al., Modern information retrieval. Addison-Wesley Harlow, England, [11] R. Herbrich, T. Graepel, and K. Obermayer, Large margin rank boundaries for ordinal regression. MIT Press, Cambridge, MA, 2. [12] M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, in ICML, 23. [13] F.Aiolli and A. Sperduti, Learning preferences for multiclass problems, in Advances in Neural Information Processing Systems 18, 24. [14] R. Bekkerman and M. Scholz, Data weaving: Scaling up the state-of-the-art in data clustering, in CIKM, 28. [15] D. Lewis, Y.Yang, T. Rose, and F. Li, Rcv1: A new benchmark collection for text categorization, Journal of Machine Learning Research, vol. 5, pp , 24. [16] G. Cormack and T. Lynam, Statistical precision of information retrieval evaluation, in SIGIR, 26. [17] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of Royal Statistical Society, Series B, vol. 68(1), pp , 26.

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are

More information

Sparse Latent Semantic Analysis

Sparse Latent Semantic Analysis Sparse Latent Semantic Analysis Xi Chen Yanjun Qi Bing Bai Qihang Lin Jaime G. Carbonell Abstract Latent semantic analysis (LSA), as one of the most popular unsupervised dimension reduction tools, has

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

MULTIPLEKERNELLEARNING CSE902

MULTIPLEKERNELLEARNING CSE902 MULTIPLEKERNELLEARNING CSE902 Multiple Kernel Learning -keywords Heterogeneous information fusion Feature selection Max-margin classification Multiple kernel learning MKL Convex optimization Kernel classification

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Fast Nonnegative Matrix Factorization with Rank-one ADMM Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,

More information

Large-Scale Behavioral Targeting

Large-Scale Behavioral Targeting Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting

More information

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion

More information

Online Passive-Aggressive Algorithms. Tirgul 11

Online Passive-Aggressive Algorithms. Tirgul 11 Online Passive-Aggressive Algorithms Tirgul 11 Multi-Label Classification 2 Multilabel Problem: Example Mapping Apps to smart folders: Assign an installed app to one or more folders Candy Crush Saga 3

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge Ming-Feng Tsai Department of Computer Science University of Singapore 13 Computing Drive, Singapore Shang-Tse Chen Yao-Nan Chen Chun-Sung

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Information Retrieval

Information Retrieval Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 15: Learning to Rank Sec. 15.4 Machine learning for IR ranking? We ve looked at methods for ranking

More information

Prediction of Citations for Academic Papers

Prediction of Citations for Academic Papers 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

11. Learning To Rank. Most slides were adapted from Stanford CS 276 course.

11. Learning To Rank. Most slides were adapted from Stanford CS 276 course. 11. Learning To Rank Most slides were adapted from Stanford CS 276 course. 1 Sec. 15.4 Machine learning for IR ranking? We ve looked at methods for ranking documents in IR Cosine similarity, inverse document

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning, Pandu Nayak, and Prabhakar Raghavan Lecture 14: Learning to Rank Sec. 15.4 Machine learning for IR

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Accelerated Gradient Method for Multi-Task Sparse Learning Problem

Accelerated Gradient Method for Multi-Task Sparse Learning Problem Accelerated Gradient Method for Multi-Task Sparse Learning Problem Xi Chen eike Pan James T. Kwok Jaime G. Carbonell School of Computer Science, Carnegie Mellon University Pittsburgh, U.S.A {xichen, jgc}@cs.cmu.edu

More information

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

A Neural Passage Model for Ad-hoc Document Retrieval

A Neural Passage Model for Ad-hoc Document Retrieval A Neural Passage Model for Ad-hoc Document Retrieval Qingyao Ai, Brendan O Connor, and W. Bruce Croft College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA, USA,

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Classification and Pattern Recognition

Classification and Pattern Recognition Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Transductive Experiment Design

Transductive Experiment Design Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, 2005. Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG 81739 Munich, Germany Abstract

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Decoupled Collaborative Ranking

Decoupled Collaborative Ranking Decoupled Collaborative Ranking Jun Hu, Ping Li April 24, 2017 Jun Hu, Ping Li WWW2017 April 24, 2017 1 / 36 Recommender Systems Recommendation system is an information filtering technique, which provides

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set

More information

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression

More information

Leverage Sparse Information in Predictive Modeling

Leverage Sparse Information in Predictive Modeling Leverage Sparse Information in Predictive Modeling Liang Xie Countrywide Home Loans, Countrywide Bank, FSB August 29, 2008 Abstract This paper examines an innovative method to leverage information from

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning and Inference in DAGs We discussed learning in DAG models, log p(x W ) = n d log p(x i j x i pa(j),

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm Zhixiang Chen (chen@cs.panam.edu) Department of Computer Science, University of Texas-Pan American, 1201 West University

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November 16, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision

More information

Language Models, Smoothing, and IDF Weighting

Language Models, Smoothing, and IDF Weighting Language Models, Smoothing, and IDF Weighting Najeeb Abdulmutalib, Norbert Fuhr University of Duisburg-Essen, Germany {najeeb fuhr}@is.inf.uni-due.de Abstract In this paper, we investigate the relationship

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

arxiv: v2 [cs.ne] 22 Feb 2013

arxiv: v2 [cs.ne] 22 Feb 2013 Sparse Penalty in Deep Belief Networks: Using the Mixed Norm Constraint arxiv:1301.3533v2 [cs.ne] 22 Feb 2013 Xanadu C. Halkias DYNI, LSIS, Universitè du Sud, Avenue de l Université - BP20132, 83957 LA

More information

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs) ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexity-preserving operations Convex envelopes, cardinality

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

NEAL: A Neurally Enhanced Approach to Linking Citation and Reference

NEAL: A Neurally Enhanced Approach to Linking Citation and Reference NEAL: A Neurally Enhanced Approach to Linking Citation and Reference Tadashi Nomoto 1 National Institute of Japanese Literature 2 The Graduate University of Advanced Studies (SOKENDAI) nomoto@acm.org Abstract.

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

From Non-Negative Matrix Factorization to Deep Learning

From Non-Negative Matrix Factorization to Deep Learning The Math!! From Non-Negative Matrix Factorization to Deep Learning Intuitions and some Math too! luissarmento@gmailcom https://wwwlinkedincom/in/luissarmento/ October 18, 2017 The Math!! Introduction Disclaimer

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Variable Selection in Data Mining Project

Variable Selection in Data Mining Project Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT 6266 - Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning CS4731 Dr. Mihail Fall 2017 Slide content based on books by Bishop and Barber. https://www.microsoft.com/en-us/research/people/cmbishop/ http://web4.cs.ucl.ac.uk/staff/d.barber/pmwiki/pmwiki.php?n=brml.homepage

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class

More information

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016 GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes

More information

Variable Latent Semantic Indexing

Variable Latent Semantic Indexing Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,

More information

2018 EE448, Big Data Mining, Lecture 4. (Part I) Weinan Zhang Shanghai Jiao Tong University

2018 EE448, Big Data Mining, Lecture 4. (Part I) Weinan Zhang Shanghai Jiao Tong University 2018 EE448, Big Data Mining, Lecture 4 Supervised Learning (Part I) Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html Content of Supervised Learning

More information

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory Bin Gao Tie-an Liu Wei-ing Ma Microsoft Research Asia 4F Sigma Center No. 49 hichun Road Beijing 00080

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Supplementary Material Zheyun Feng Rong Jin Anil Jain Department of Computer Science and Engineering, Michigan State University,

More information

An Efficient Proximal Gradient Method for General Structured Sparse Learning

An Efficient Proximal Gradient Method for General Structured Sparse Learning Journal of Machine Learning Research 11 (2010) Submitted 11/2010; Published An Efficient Proximal Gradient Method for General Structured Sparse Learning Xi Chen Qihang Lin Seyoung Kim Jaime G. Carbonell

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Statistical Ranking Problem

Statistical Ranking Problem Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing

More information