Scalable Term Selection for Text Categorization

Size: px
Start display at page:

Download "Scalable Term Selection for Text Categorization"

Transcription

1 Scalable Term Selection for Text Categorization Jingyang Li National Lab of Intelligent Tech. & Sys. Department of Computer Sci. & Tech. Tsinghua University, Beijing, China Maosong Sun National Lab of Intelligent Tech. & Sys. Department of Computer Sci. & Tech. Tsinghua University, Beijing, China Abstract In text categorization, term selection is an important step for the sake of both categorization accuracy and computational efficiency. Different dimensionalities are expected under different practical resource restrictions of time or space. Traditionally in text categorization, the same scoring or ranking criterion is adopted for all target dimensionalities, which considers both the discriminability and the coverage of a term, such as χ 2 or IG. In this paper, the poor accuracy at a low dimensionality is imputed to the small average vector length of the documents. Scalable term selection is proposed to optimize the term set at a given dimensionality according to an expected average vector length. Discriminability and coverage are separately measured; by adjusting the ratio of their weights in a combined criterion, the expected average vector length can be reached, which means a good compromise between the specificity and the exhaustivity of the term subset. Experiments show that the accuracy is considerably improved at lower dimensionalities, and larger term subsets have the possibility to lower the average vector length for a lower computational cost. The interesting observations might inspire further investigations. 1 Introduction Text categorization is a classical text information processing task which has been studied adequately (Sebastiani, 2002). A typical text categorization process usually involves these phases: document indexing, dimensionality reduction, classifier learning, classification and evaluation. The vector space model is frequently used for text representation (document indexing); dimensions of the learning space are called terms, or features in a general machine learning context. Term selection is often necessary because: Many irrelevant terms have detrimental effect on categorization accuracy due to overfitting (Sebastiani, 2002). Some text categorization tasks have many relevant but redundant features, which also hurt the categorization accuracy (Gabrilovich and Markovitch, 2004). Considerations on computational cost: (i) Many sophisticated learning machines are very slow at high dimensionalities, such as LLSF (Yang and Chute, 1994) and SVMs. (ii) In Asian languages, the term set is often very large and redundant, which causes the learning and the predicting to be really slow. (iii) In some practical cases the computational resources (time or space) are restricted, such as hand-held devices, real-time applications and frequently retrained systems. (iv) Some deeper analysis or feature reconstruction techniques rely on matrix factorization (e.g. LSA based on SVD), which might be computationally intractable while the dimensionality is large. Sometimes an aggressive term selection might be needed particularly for (iii) and (iv). But it is notable that the dimensionality is not always directly 774 Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp , Prague, June c 2007 Association for Computational Linguistics

2 connected to the computational cost; this issue will be touched on in Section 6. Although we have many general feature selection techniques, the domain specified ones are preferred (Guyon and Elisseeff, 2003). Another reason for ad hoc term selection techniques is that many other pattern classification tasks has no sparseness problem (in this study the sparseness means a sample vector has few nonzero elements, but not the high-dimensional learning space has few training samples). As a basic motivation of this study, we hypothesize that the low accuracy at low dimensionalities is mainly due to the sparseness problem. Many term selection techniques were presented and some of them have been experimentally tested to be high-performing, such as Information Gain, χ 2 (Yang and Pedersen, 1997; Rogati and Yang, 2002) and Bi-Normal Separation (Forman, 2003). Everyone of them adopt a criterion scoring and ranking the terms; for a target dimensionality d, the term selection is simply done by picking out the top-d terms from the ranked term set. These high performing criteria have a common characteristic both discriminability and coverage are implicitly considered. discriminability: how unbalanced is the distribution of the term among the categories. coverage: how many documents does the term occur in. (Borrowing the terminologies from document indexing, we can say the specificity of a term set corresponds to the discriminability of each term, and the exhaustivity of a term set corresponds to the coverage of each term.) The main difference among these criteria is to what extent the discriminability is emphasized or the coverage is emphasized. For instance, empirically IG prefers high frequency terms more than χ 2 does, which means IG emphasizes the coverage more than χ 2 does. The problem is, these criteria are nonparametric and do the same ranking for any target dimensionality. Small term sets meet the specificity exhaustivity dilemma. If really the sparseness is the main reason of the low performance of a small term set, the specificity should be moderately sacrificed to improve the exhaustivity for a small term set; that is to say, the term selection criterion should consider coverage more than discriminability. Contrariwise, coverage could be less considered for a large term set, because we need worry little about the sparseness problem and the computational cost might decrease. The remainder of this paper is organized as follows: Section 2 describes the document collections used in this study, as well as other experiment settings; Section 3 investigates the relation between sparseness (measured by average vector length) and categorization accuracy; Section 4 explains the basic idea of scalable term selection and proposed a potential approach; Section 5 carries out experiments to evaluate the approach, during which some empirical rules are observed to complete the approach; Section 6 makes some further observations and discussions based on Section 5; Section 7 gives a concluding remark. 2 Experiment Settings 2.1 Document Collections Two document collections are used in this study. CE (Chinese Encyclopedia): This is from the electronic version of the Chinese Encyclopedia. We choose a Chinese corpus as the primary document collection because Chinese text (as well as other Asian languages) has a very large term set and a satisfying subset is usually not smaller than (Li et al., 2006); on the contrary, a dimensionality lower than suffices a general English text categorization (Yang and Pedersen, 1997; Rogati and Yang, 2002). For computational cost reasons mentioned in Section 1, Chinese text categorization would benefit more from an high-performing aggressive term selection. This collection contains 55 categories and documents (9:1 split to training set and test set). Each documents belongs to only one category. Each category contains documents. This collection was also used by Li et al. (2006). 20NG (20 Newsgroups 1 ): This classical English document collection is chosen as a secondary in this study to testify the generality of the proposed approach. Some figures about this collection are not shown in this paper as the figures about CE, viz. Figure 1 4 because they are similar to CE s Newsgroups 775

3 2.2 Other Settings For CE collection, character bigrams are chosen to be the indexing unit for its high performance (Li et al., 2006); but the bigram term set suffers from its high dimensionality. This is exactly the case we tend to tackle. For 20NG collection, the indexing units are stemmed 2 words. Both term set are df -cut by the most conservative threshold (df 2). The sizes of the two candidate term sets are T CE = and T 20NG = Term weighting is done by tfidf (t i, d j ) = log(tf (t i, d j ) + 1) log ( df (ti )+1 N d )3, in which t i denotes a term, d j denotes a document, N d denotes the total document number. The classifiers used in this study are support vector machines (Joachims, 1998; Gabrilovich and Markovitch, 2004; Chang and Lin, 2001). The kernel type is set to linear, which is fast and enough for text categorization. Also, Brank et al. (2002) pointed out that the complexity and sophistication of the criterion itself is more important to the success of the term selection method than its compatibility in design with the classifier. Performance is evaluated by microaveraged F 1 - measure. For single-label tasks, microaveraged precision, recall and F 1 have the same value. χ 2 is used as the term selection baseline for its popularity and high performance. (IG was also reported to be good. In our previous experiments, χ 2 is generally superior to IG.) In this study, features are always selected globally, which means the maximum are computed for category-specific values (Sebastiani, 2002). 3 Average Vector Length (AVL) In this study, vector length (how many different terms does the document hold after term selection) is used as a straightforward sparseness measure for a document (Brank et al., 2002). Generally, document sizes have a lognormal distribution (Mitzenmacher, 2003). In our experiment, vector lengths are also found to be nearly lognormal distributed, as shown in Figure 1. If the correctly classified documents 2 Stemming by Porter s Stemmer ( tartarus.org/ martin/porterstemmer/). 3 In our experiments this form of tfidf always outperforms the basic tfidf (t i, d j) = tf (t i, d j) log df (ti )+1 N d form. prob density vector length correct wrong all Figure 1: Vector Length Distributions (smoothed), on CE Document Collection error rate vector length Figure 2: Error Rate vs. Vector Length (smoothed), on CE Collection, 5000 Dimensions by χ 2 and the wrongly classified documents are separately investigated, they both yield a nearly lognormal distribution. Also in Figure 1, wrongly classified documents shows a relatively large proportion at low dimensionalities. Figure 2 demonstrates this with more clarity. Thus the hypothesis formed in Section 1 is confirmed: there is a strong correlation between the sparseness degree and the categorization error rate. Therefore, it is quite straightforward a thought to measure the sparseness of a term subset (or more precisely, the exhaustivity) by the corresponding average vector length (AVL) of all documents. 4 In the 4 Due to the lognormal distribution of vector length, it seems more plausible to average the logarithmic vector length. However, for a fixed number of documents, log should hold P log dj P dj D a nearly fixed ratio to, in which D denotes the document number and d j denotes the document vector D length. 776

4 remainder of this paper, (log) AVL is an important metric used to assess and control the sparseness of a term subset Scalable Term Selection () Since the performance droping down at low dimensionalities is attributable to low AVLs in the previous section, a scalable term selection criterion should automatically accommodate its favor of high coverage to different target dimensionalities. 4.1 Measuring Discriminability and Coverage The first step is to separately measure the discriminability and the coverage of a term. A basic guideline is that these two metrics should not be highly (positive) correlated; intuitively, they should have a slight negative correlation. The correlation of the two metrics can be visually estimated by the joint distribution figure. A bunch of term selection metrics were explored by Forman (2003). df (document frequency) is a straightforward choice to measure coverage. Since df follows the Zipf s law (inverse power law), log(df ) is adopted. High-performing term selection criterion themselves might not be good candidates for the discriminability metric because they take coverage into account. For example, Figure 3 shows that χ 2 is not satisfying. (For readability, the grayness is proportional to the log probability density in Figure 3, Figure 4 and Figure 12.) Relatively, probability ratio (Forman, 2003) is a more straight metric of discriminability. PR(t i, c) = P (t i c + ) P (t i c ) = df (t i, c + )/df (c + ) df (t i, c )/df (c ) It is a symmetric ratio, so log(pr) is likely to be more appropriate. For multi-class categorization, a global value can be assessed by PR max (t i ) = max c PR(t i, c), like χ 2 max for χ 2 (Yang and Pedersen, 1997; Rogati and Yang, 2002; Sebastiani, 2002); for brief, PR denotes PR max hereafter. The joint distribution of log(pr) and log(df ) is shown in Figure 12. We can see that the distribution is quite even and they have a slight negative correlation. 4.2 Combined Criterion Now we have the two metrics: log(pr) for discriminability and log(df ) for coverage, and a parametric 9.46 log(pr) χ log(df ) Figure 3: ( log(df ), χ 2) Distribution, on CE log(df ) Figure 4: (log(df ), log(pr)) Distribution, on CE term selection criterion comes forth: ( ζ(t i ; λ) = λ log(pr(t i )) + 1 λ ) 1 log(df (t i )) A weighted harmonic averaging is adopted here because either metric s being too small is a severe detriment. λ [0, 1] is the weight for log(pr), which denotes how much the discriminability is emphasized. When the dimensionality is fixed, a smaller λ leads to a larger AVL and a larger λ leads to a smaller AVL. The optimal λ should be a function 777

5 of the expected : λ (k) = arg max F 1 (S k (λ)) λ in which the term subset S k (λ) T is selected by ζ( ; λ), S k = k, and F 1 is the default evaluation criterion. Naturally, this optimal λ leads to a corresponding optimal AVL: AVL (k) λ (k) For a concrete implementation, we should have an (empirical) function to estimate λ or AVL : AVL (k). = AVL (k) In the next section, the values of AVL (as well as λ ) for some k-s are figured out by experimental search; then an empirical formula, AVL (k), comes forth. It is interesting and inspiring that by adding the corpus AVL as a parameter this formula is universal for different document collections, which makes the whole idea valuable. 5 Experiments and Implementation 5.1 Experiments The expected dimensionalities (k) chosen for experimentation are CE: 500, 1000, 2000, 4000,..., 32000, 64000; 20NG: 500, 1000, 2000,..., 16000, For a given document collection and a given target dimensionality, there is a corresponding AVL for a λ, and vice versa (for the possible value range of AVL). According to the observations in Section 5.2, AVL other than λ is the direct concern because it is more intrinsic, but λ is the one that can be tuned directly. So, in the experiments, we vary AVL by tuning λ to produce it, which means to calculate λ(avl). AVL(λ) is a monotone function and fast to calculate. For a given AVL, the corresponding λ can be quickly found by a Newton iteration in [0,1]. In fact, AVL(λ) is not a continuous function, so λ is only tuned to get an acceptable match, e.g. within ± is tested to the whole T on 20NG but not on CE, because (i) T CE is too large and time consuming for training and testing, and (ii) χ 2 was previously tested on larger k and the performance (F 1) is not stable while k > For each k, by the above way of fitting λ, we manually adjust AVL (only in integers) until F 1 (S k (λ(avl))) peaks. By this way, Figure 5 11 are manually tuned best-performing results as observations for figuring out the empirical formulas. Figure 5 shows the F 1 peaks at different dimensionalities. Comparing to χ 2, has a considerable potential superiority at low dimensionalities. The corresponding values of AVL are shown in Figure 6, along with the AVLs of χ 2 -selected term subsets. The dotted lines show the trend of AVL ; at the overall dimensionality, T CE = , they have the same AVL = We can see that log(avl ) is almost proportional to log(k) when k is not too large. The corresponding values of λ are shown in Figure 7; the relation is nearly linear between λ and log(k). Now it is necessary to explain why an empirical AVL (k) derived from the straight line in Figure 6 can be used instead of AVL (k) in practice. One important but not plotted property is that the performance of is not very sensitive to a small value change of AVL. For instance, at k = 4000, AVL = 120 and the F 1 peak is %, and for AVL = 110 and 130 the corresponding F 1 are % and %; at the same k, the F 1 of χ 2 selection is %. This characteristic of guarantee that the empirical AVL (k) has a very close performance to AVL (k); due to the limited space, the performance curve of AVL (k) will not be plotted in Section 5.2. Same experiments are done on 20NG and the results are shown in Figure 8, Figure 9 and Figure 10. The performance improvements is not as significant as on the CE collection; this will be discussed in Section 6.2. The conspicuous relations between AVL, λ and k remain the same. 5.2 Algorithm Completion In Figure 6 and Figure 9, the ratios of log(avl (k)) to log(k) are not the same on CE and 20NG. Taking into account the corpus AVL (the AVL produced by the whole term set): AVL TCE = and AVL T20NG = , we guess log(avl (k)) log(avl T ) is capable of keeping the same ratio to log(k) for both CE and 20NG. This hypothesis is confirmed (not for too high dimensionalities) by Figure 11; Section

6 F 1 (%) F 1 (%) Figure 5: Performance Comparison, on CE Figure 8: Performance Comparison, on 20NG AVL * AVL * Figure 6: AVL Comparison, on CE Figure 9: AVL Comparison, on 20NG Figure 7: Optimal Weights of log(pr), on CE Figure 10: Optimal Weights of log(pr), on 20NG 779

7 log(avl * (k)) / log(avl T ) CE 20NG Figure 11: log(avl (k)) log(avl T ), on Both CE and 20NG 9.46 log(pr) χ contains some discussion on this. From the figure, we get the value of this ratio (the base of log is set to e): γ = log(avl (k))/log(avl T ) log(k) = An investigation shows that for a quite large range of λ, term rankings by ζ(t i ; λ) and χ 2 (t i ) have a strong correlation (the Spearman s rank correlation coefficient is bigger than 0.999). In order to com log(df ) Figure 12: Selection Area Comparison of and χ 2 on Various Dimensionalities, on CE which should be a universal constant for all text categorization tasks. So the empirical estimation of AVL (k) is given by AVL (k) = exp(γ log(avl T ) log(k)) = AVL γ log(k) T and the final criterion is ζ(t i, k) = ζ(t i ; λ(avl (k))) = ζ(t i ; λ(avl γ log(k) T )) 8.04 log(pr) in which λ( ) can be calculated as in Section 5.1. The target dimensionality, k, is involved as a parameter, so the approach is named scalable term selection. As stated in Section 5.1, AVL (k) has a very close performance to AVL (k) and its performance is not plotted here. 6 Further Observation and Discussion 6.1 Comparing the Selected Subsets log(df ) 9.14 Figure 13: Selection Area Comparison of and χ 2 on Various Dimensionalities, on 20NG pare the two criteria s preferences for discriminability and coverage, the selected subsets of different dimensionalities are shown in Figure 12 (the corresponding term density distribution was shown in Figure 4) and Figure 13. For different dimension- χ 2 780

8 alities, the selection areas of are represented by boundary lines, and the selection areas of χ 2 are represented by different grayness. In Figure 12, shows its superiority at low dimensionalities by more emphasis on the coverage of terms. In Figure 13, shows its superiority at high dimensionalities by more emphasis on the discriminability of terms; lower coverage yields smaller index size and lower computational cost. At any dimensionality, yields a relatively fixed bound for either discriminability or coverage, other than a compromise between them like χ 2 ; this is attributable to the harmonic averaging. 6.2 Adaptability of There are actually two kinds of sparseness in a (vectorized) document collection: collection sparseness: the high-dimensional learning space contains few training samples; document sparseness: a document vector has few nonzero dimensions. In this study, only the document sparseness is investigated. The collection sparseness might be a backroom factor influencing the actual performance on different document collections. This might explain why the explicit characteristics of are not the same on CE to 20NG: (comparing with χ 2, see Figure 5, Figure 6, Figure 8 and Figure 9) CE. The significant F 1 improvements at low dimensionalities sacrifice the short of AVL. In some learning process implementations, it is AVL other than k that determines the computational cost; in many other cases, k is the determinant. Further more, possible post-processing, like matrix factorization, might benefit from a low k. 20NG. The F 1 improvements at low dimensionalities is not quite significant, but AVL remains a lower level. For higher k, there is less difference in F 1, but the smaller AVL yield lower computational cost than χ 2. Nevertheless, shows a stable behavior for various dimensionalities and quite different document collections. The existence of the universal constant γ empowers it to be adaptive and practical. As shown in Figure 11, draws the relative log AVL (k) to the same straight line, γ log(k), for different document collections. This might means that the relative AVL is an intrinsic demand for the term subset size k. 7 Conclusion In this paper, Scalable Term Selection () is proposed and supposed to be more adaptive than traditional high-performing criteria, viz. χ 2, IG, BNS, etc. The basic idea of is to separately measure discriminability and coverage, and adjust the relative importance between them to produce a optimal term subset of a given size. Empirically, the constant relation between target dimensionality and the optimal relative average vector length is found, which turned the idea into implementation. showed considerable adaptivity and stability for various dimensionalities and quite different document collections. The categorization accuracy increasing at low dimensionalities and the computational cost decreasing at high dimensionalities were observed. Some observations are notable: the loglinear relation between optimal average vector length (AVL ) and, the semi-loglinear relation between weight λ and dimensionality, and the universal constant γ. For a future work, needs to be conducted on more document collections to check if γ is really universal. In addition, there could be other implementations of the general idea, via other metrics of discriminability and coverage, other weighted combination forms, or other term subset evaluations. Acknowledgement The research is supported by the National Natural Science Foundation of China under grant number , and References Janez Brank, Marko Grobelnik, Nataša Milic- Fraylingand, and Dunjia Mladenic Interaction of feature selection methods and linear classification models. Workshop on Text Learning held at ICML Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines. Software available at cjlin/libsvm. 781

9 George Forman An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3: Evgeniy Gabrilovich and Shaul Markovitch Text categorization with many redundant features: using aggressive feature selection to make svms competitive with c4.5. In ICML 04: Proceedings of the twentyfirst international conference on Machine learning, page 41, New York, NY, USA. ACM Press. Isabelle Guyon and André Elisseeff An introduction to variable and feature selection. Journal of Machine Learning Research, 3: Thorsten Joachims Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML 98, number 1398, pages Springer Verlag, Heidelberg, DE. Jingyang Li, Maosong Sun, and Xian Zhang A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization. In Proceedings of COLING-ACL 06, pages Association for Computational Linguistics, July. Michael Mitzenmacher A brief history of generative models for power law and lognormal distributions. Internet Mathematics, 1: Monica Rogati and Yiming Yang Highperforming feature selection for text classification. In Proceedings of CIKM 02, pages ACM Press. Fabrizio Sebastiani Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1):1 47. Yiming Yang and Christopher G. Chute An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems (TOIS), 12(3): Yiming Yang and Jan O. Pedersen A comparative study on feature selection in text categorization. In Douglas H. Fisher, editor, Proceedings of ICML- 97, 14th International Conference on Machine Learning, pages , Nashville, US. Morgan Kaufmann Publishers, San Francisco, US. 782

Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions

Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions AUSTRIAN JOURNAL OF STATISTICS Volume 37 (2008), Number 1, 109 118 Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions Vaidas Balys and Rimantas Rudzkis Institute

More information

Dimensionality reduction

Dimensionality reduction Dimensionality reduction ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 2017 Recapitulating: Evaluating

More information

Classifying Chinese Texts in Two Steps

Classifying Chinese Texts in Two Steps Classifying Chinese Texts in Two Steps Xinghua Fan,, 3, Maosong Sun, Key-sun Choi 3, and Qin Zhang State Key Laboratory of Intelligent Technoy and Systems, Tsinghua University, Beijing 00084, China fanxh@tsinghua.org.cn,

More information

Significance Tests for Bizarre Measures in 2-Class Classification Tasks

Significance Tests for Bizarre Measures in 2-Class Classification Tasks R E S E A R C H R E P O R T I D I A P Significance Tests for Bizarre Measures in 2-Class Classification Tasks Mikaela Keller 1 Johnny Mariéthoz 2 Samy Bengio 3 IDIAP RR 04-34 October 4, 2004 D a l l e

More information

CLASSIFYING STRUCTURED WEB SOURCES USING AGGRESSIVE FEATURE SELECTION

CLASSIFYING STRUCTURED WEB SOURCES USING AGGRESSIVE FEATURE SELECTION CLASSIFYING STRUCTURED WEB SOURCES USING AGGRESSIVE FEATURE SELECTION Hieu Quang Le, Stefan Conrad Institute of Computer Science, Heinrich-Heine-Universität Düsseldorf, D-40225 Düsseldorf, Germany lqhieu@cs.uni-duesseldorf.de,

More information

An Efficient Algorithm for Large-Scale Text Categorization

An Efficient Algorithm for Large-Scale Text Categorization An Efficient Algorithm for Large-Scale Text Categorization CHANG-RUI YU 1, YAN LUO 2 1 School of Information Management and Engineering Shanghai University of Finance and Economics No.777, Guoding Rd.

More information

Language Models, Smoothing, and IDF Weighting

Language Models, Smoothing, and IDF Weighting Language Models, Smoothing, and IDF Weighting Najeeb Abdulmutalib, Norbert Fuhr University of Duisburg-Essen, Germany {najeeb fuhr}@is.inf.uni-due.de Abstract In this paper, we investigate the relationship

More information

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Multiple Similarities Based Kernel Subspace Learning for Image Classification Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

MULTIPLEKERNELLEARNING CSE902

MULTIPLEKERNELLEARNING CSE902 MULTIPLEKERNELLEARNING CSE902 Multiple Kernel Learning -keywords Heterogeneous information fusion Feature selection Max-margin classification Multiple kernel learning MKL Convex optimization Kernel classification

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion

More information

A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR)

A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR) A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR) Chirag Shah Dept. of CSE IIT Bombay, Powai Mumbai - 400 076, Maharashtra, India. Email: chirag@cse.iitb.ac.in

More information

Supervised Term Weighting for Automated Text Categorization

Supervised Term Weighting for Automated Text Categorization Supervised Term Weighting for Automated Text Categorization Franca Debole Istituto di Scienza e Tecnologie dell Informazione Consiglio Nazionale delle Ricerche Via G. Moruzzi 1-56124 Pisa (Italy) debole@iei.pi.cnr.it

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Tuning as Linear Regression

Tuning as Linear Regression Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical

More information

Tutorial on Venn-ABERS prediction

Tutorial on Venn-ABERS prediction Tutorial on Venn-ABERS prediction Paolo Toccaceli (joint work with Alex Gammerman and Ilia Nouretdinov) Computer Learning Research Centre Royal Holloway, University of London 6th Symposium on Conformal

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM. Dr. Xin Zhang Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

On a New Model for Automatic Text Categorization Based on Vector Space Model

On a New Model for Automatic Text Categorization Based on Vector Space Model On a New Model for Automatic Text Categorization Based on Vector Space Model Makoto Suzuki, Naohide Yamagishi, Takashi Ishida, Masayuki Goto and Shigeichi Hirasawa Faculty of Information Science, Shonan

More information

Decision Trees (Cont.)

Decision Trees (Cont.) Decision Trees (Cont.) R&N Chapter 18.2,18.3 Side example with discrete (categorical) attributes: Predicting age (3 values: less than 30, 30-45, more than 45 yrs old) from census data. Attributes (split

More information

A Neural Passage Model for Ad-hoc Document Retrieval

A Neural Passage Model for Ad-hoc Document Retrieval A Neural Passage Model for Ad-hoc Document Retrieval Qingyao Ai, Brendan O Connor, and W. Bruce Croft College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA, USA,

More information

Supervised Term Weighting for Automated Text Categorization

Supervised Term Weighting for Automated Text Categorization Supervised Term Weighting for Automated Text Categorization Franca Debole and Fabrizio Sebastiani Istituto di Scienza e Tecnologie dell Informazione Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi,

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

P leiades: Subspace Clustering and Evaluation

P leiades: Subspace Clustering and Evaluation P leiades: Subspace Clustering and Evaluation Ira Assent, Emmanuel Müller, Ralph Krieger, Timm Jansen, and Thomas Seidl Data management and exploration group, RWTH Aachen University, Germany {assent,mueller,krieger,jansen,seidl}@cs.rwth-aachen.de

More information

Generative MaxEnt Learning for Multiclass Classification

Generative MaxEnt Learning for Multiclass Classification Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

More information

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng

More information

An Analysis on Frequency of Terms for Text Categorization

An Analysis on Frequency of Terms for Text Categorization An Analysis on Frequency of Terms for Text Categorization Edgar Moyotl-Hernández Fac. de Ciencias de la Computación B. Universidad Autónoma de Puebla C.U. 72570, Puebla, México emoyotl@mail.cs.buap.mx,

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for

More information

An Empirical Study of Building Compact Ensembles

An Empirical Study of Building Compact Ensembles An Empirical Study of Building Compact Ensembles Huan Liu, Amit Mandvikar, and Jigar Mody Computer Science & Engineering Arizona State University Tempe, AZ 85281 {huan.liu,amitm,jigar.mody}@asu.edu Abstract.

More information

Benchmarking Non-Parametric Statistical Tests

Benchmarking Non-Parametric Statistical Tests R E S E A R C H R E P O R T I D I A P Benchmarking Non-Parametric Statistical Tests Mikaela Keller a Samy Bengio a Siew Yeung Wong a IDIAP RR 05-38 January 5, 2006 to appear in Advances in Neural Information

More information

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge Ming-Feng Tsai Department of Computer Science University of Singapore 13 Computing Drive, Singapore Shang-Tse Chen Yao-Nan Chen Chun-Sung

More information

Term Filtering with Bounded Error

Term Filtering with Bounded Error Term Filtering with Bounded Error Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning) Bayes Theorem & Naïve Bayes (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning) Review: Bayes Theorem & Diagnosis P( a b) Posterior Likelihood Prior P( b a) P( a)

More information

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Fast Logistic Regression for Text Categorization with Variable-Length N-grams Fast Logistic Regression for Text Categorization with Variable-Length N-grams Georgiana Ifrim *, Gökhan Bakır +, Gerhard Weikum * * Max-Planck Institute for Informatics Saarbrücken, Germany + Google Switzerland

More information

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier Kaizhu Huang, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

Variable Selection in Data Mining Project

Variable Selection in Data Mining Project Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT 6266 - Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal

More information

Support Vector Machine Regression for Volatile Stock Market Prediction

Support Vector Machine Regression for Volatile Stock Market Prediction Support Vector Machine Regression for Volatile Stock Market Prediction Haiqin Yang, Laiwan Chan, and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,

More information

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng CAS Key Lab of Network Data Science and Technology Institute

More information

Entropy based feature selection for text categorization

Entropy based feature selection for text categorization Entropy based feature selection for text categorization Christine Largeron, Christophe Moulin, Mathias Géry To cite this version: Christine Largeron, Christophe Moulin, Mathias Géry. Entropy based feature

More information

Large-Margin Thresholded Ensembles for Ordinal Regression

Large-Margin Thresholded Ensembles for Ordinal Regression Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

NON-FIXED AND ASYMMETRICAL MARGIN APPROACH TO STOCK MARKET PREDICTION USING SUPPORT VECTOR REGRESSION. Haiqin Yang, Irwin King and Laiwan Chan

NON-FIXED AND ASYMMETRICAL MARGIN APPROACH TO STOCK MARKET PREDICTION USING SUPPORT VECTOR REGRESSION. Haiqin Yang, Irwin King and Laiwan Chan In The Proceedings of ICONIP 2002, Singapore, 2002. NON-FIXED AND ASYMMETRICAL MARGIN APPROACH TO STOCK MARKET PREDICTION USING SUPPORT VECTOR REGRESSION Haiqin Yang, Irwin King and Laiwan Chan Department

More information

Observing Dark Worlds (Final Report)

Observing Dark Worlds (Final Report) Observing Dark Worlds (Final Report) Bingrui Joel Li (0009) Abstract Dark matter is hypothesized to account for a large proportion of the universe s total mass. It does not emit or absorb light, making

More information

Neural Networks for Web Page Classification Based on Augmented PCA

Neural Networks for Web Page Classification Based on Augmented PCA Neural Networks for Web Page Classification Based on Augmented PCA Ali Selamat and Sigeru Omatu Division of Computer and Systems Sciences, Graduate School of Engineering, Osaka Prefecture University, Sakai,

More information

Day 6: Classification and Machine Learning

Day 6: Classification and Machine Learning Day 6: Classification and Machine Learning Kenneth Benoit Essex Summer School 2014 July 30, 2013 Today s Road Map The Naive Bayes Classifier The k-nearest Neighbour Classifier Support Vector Machines (SVMs)

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Tackling the Poor Assumptions of Naive Bayes Text Classifiers

Tackling the Poor Assumptions of Naive Bayes Text Classifiers Tackling the Poor Assumptions of Naive Bayes Text Classifiers Jason Rennie MIT Computer Science and Artificial Intelligence Laboratory jrennie@ai.mit.edu Joint work with Lawrence Shih, Jaime Teevan and

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels Daoqiang Zhang Zhi-Hua Zhou National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1. Toward Optimal Feature Selection in Naive Bayes for Text Categorization

SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1. Toward Optimal Feature Selection in Naive Bayes for Text Categorization SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 Toward Optimal Feature Selection in Naive Bayes for Text Categorization Bo Tang, Student Member, IEEE, Steven Kay, Fellow, IEEE, and Haibo

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

A Support Vector Method for Multivariate Performance Measures

A Support Vector Method for Multivariate Performance Measures A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Earthquake-Induced Structural Damage Classification Algorithm

Earthquake-Induced Structural Damage Classification Algorithm Earthquake-Induced Structural Damage Classification Algorithm Max Ferguson, maxkferg@stanford.edu, Amory Martin, amorym@stanford.edu Department of Civil & Environmental Engineering, Stanford University

More information

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

More Smoothing, Tuning, and Evaluation

More Smoothing, Tuning, and Evaluation More Smoothing, Tuning, and Evaluation Nathan Schneider (slides adapted from Henry Thompson, Alex Lascarides, Chris Dyer, Noah Smith, et al.) ENLP 21 September 2016 1 Review: 2 Naïve Bayes Classifier w

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Outliers Treatment in Support Vector Regression for Financial Time Series Prediction

Outliers Treatment in Support Vector Regression for Financial Time Series Prediction Outliers Treatment in Support Vector Regression for Financial Time Series Prediction Haiqin Yang, Kaizhu Huang, Laiwan Chan, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering

More information

Information Retrieval

Information Retrieval Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 15: Learning to Rank Sec. 15.4 Machine learning for IR ranking? We ve looked at methods for ranking

More information

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test? Train the model with a subset of the data Test the model on the remaining data (the validation set) What data to choose for training vs. test? In a time-series dimension, it is natural to hold out the

More information

N-gram based Text Categorization

N-gram based Text Categorization COMENIUS UNIVERSITY FACULTY OF MATHEMATICS, PHYSICS AND INFORMATICS INSTITUTE OF INFORMATICS Peter Náther N-gram based Text Categorization Diploma thesis Thesis advisor: Mgr. Ján Habdák BRATISLAVA 2005

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

DISTINGUISH HARD INSTANCES OF AN NP-HARD PROBLEM USING MACHINE LEARNING

DISTINGUISH HARD INSTANCES OF AN NP-HARD PROBLEM USING MACHINE LEARNING DISTINGUISH HARD INSTANCES OF AN NP-HARD PROBLEM USING MACHINE LEARNING ZHE WANG, TONG ZHANG AND YUHAO ZHANG Abstract. Graph properties suitable for the classification of instance hardness for the NP-hard

More information

MP-Boost: A Multiple-Pivot Boosting Algorithm and Its Application to Text Categorization

MP-Boost: A Multiple-Pivot Boosting Algorithm and Its Application to Text Categorization MP-Boost: A Multiple-Pivot Boosting Algorithm and Its Application to Text Categorization Andrea Esuli, Tiziano Fagni, and Fabrizio Sebastiani Istituto di Scienza e Tecnologia dell Informazione Consiglio

More information

Estimating Convergence Probability for the Hartree-Fock Algorithm for Calculating Electronic Wavefunctions for Molecules

Estimating Convergence Probability for the Hartree-Fock Algorithm for Calculating Electronic Wavefunctions for Molecules Estimating Convergence Probability for the Hartree-Fock Algorithm for Calculating Electronic Wavefunctions for Molecules Sofia Izmailov, Fang Liu December 14, 2012 1 Introduction In quantum mechanics,

More information

Optimum parameter selection for K.L.D. based Authorship Attribution for Gujarati

Optimum parameter selection for K.L.D. based Authorship Attribution for Gujarati Optimum parameter selection for K.L.D. based Authorship Attribution for Gujarati Parth Mehta DA-IICT, Gandhinagar parth.mehta126@gmail.com Prasenjit Majumder DA-IICT, Gandhinagar prasenjit.majumder@gmail.com

More information

Automatically Evaluating Text Coherence using Anaphora and Coreference Resolution

Automatically Evaluating Text Coherence using Anaphora and Coreference Resolution 1 1 Barzilay 1) Automatically Evaluating Text Coherence using Anaphora and Coreference Resolution Ryu Iida 1 and Takenobu Tokunaga 1 We propose a metric for automatically evaluating discourse coherence

More information

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability

More information

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

More information

Active Learning with Support Vector Machines for Tornado Prediction

Active Learning with Support Vector Machines for Tornado Prediction International Conference on Computational Science (ICCS) 2007 Beijing, China May 27-30, 2007 Active Learning with Support Vector Machines for Tornado Prediction Theodore B. Trafalis 1, Indra Adrianto 1,

More information

Dreem Challenge report (team Bussanati)

Dreem Challenge report (team Bussanati) Wavelet course, MVA 04-05 Simon Bussy, simon.bussy@gmail.com Antoine Recanati, arecanat@ens-cachan.fr Dreem Challenge report (team Bussanati) Description and specifics of the challenge We worked on the

More information

Training Support Vector Machines: Status and Challenges

Training Support Vector Machines: Status and Challenges ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science

More information

Support Vector Machine via Nonlinear Rescaling Method

Support Vector Machine via Nonlinear Rescaling Method Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012 CS-62 Theory Gems September 27, 22 Lecture 4 Lecturer: Aleksander Mądry Scribes: Alhussein Fawzi Learning Non-Linear Classifiers In the previous lectures, we have focused on finding linear classifiers,

More information

Probabilistic Field Mapping for Product Search

Probabilistic Field Mapping for Product Search Probabilistic Field Mapping for Product Search Aman Berhane Ghirmatsion and Krisztian Balog University of Stavanger, Stavanger, Norway ab.ghirmatsion@stud.uis.no, krisztian.balog@uis.no, Abstract. This

More information

An Approach to Classification Based on Fuzzy Association Rules

An Approach to Classification Based on Fuzzy Association Rules An Approach to Classification Based on Fuzzy Association Rules Zuoliang Chen, Guoqing Chen School of Economics and Management, Tsinghua University, Beijing 100084, P. R. China Abstract Classification based

More information

Predicting the Probability of Correct Classification

Predicting the Probability of Correct Classification Predicting the Probability of Correct Classification Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract We propose a formulation for binary

More information

Classification of Ordinal Data Using Neural Networks

Classification of Ordinal Data Using Neural Networks Classification of Ordinal Data Using Neural Networks Joaquim Pinto da Costa and Jaime S. Cardoso 2 Faculdade Ciências Universidade Porto, Porto, Portugal jpcosta@fc.up.pt 2 Faculdade Engenharia Universidade

More information

NEAL: A Neurally Enhanced Approach to Linking Citation and Reference

NEAL: A Neurally Enhanced Approach to Linking Citation and Reference NEAL: A Neurally Enhanced Approach to Linking Citation and Reference Tadashi Nomoto 1 National Institute of Japanese Literature 2 The Graduate University of Advanced Studies (SOKENDAI) nomoto@acm.org Abstract.

More information

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation Instance-based Domain Adaptation via Multi-clustering Logistic Approximation FENG U, Nanjing University of Science and Technology JIANFEI YU, Singapore Management University RUI IA, Nanjing University

More information