Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Similar documents
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Introduction to Machine Learning

Classification: Logistic Regression from Data

Machine Learning, Fall 2012 Homework 2

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Introduction to Logistic Regression and Support Vector Machine

Click Prediction and Preference Ranking of RSS Feeds

Ad Placement Strategies

Generative v. Discriminative classifiers Intuition

Conditional Random Field

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Classification: Logistic Regression from Data

Applied Natural Language Processing

Loss Functions, Decision Theory, and Linear Models

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Information Extraction from Text

Machine Learning for NLP

Information Retrieval and Organisation

Naïve Bayes, Maxent and Neural Models

Natural Language Processing

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning - Waseda University Logistic Regression

TnT Part of Speech Tagger

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Machine Learning for NLP

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Variations of Logistic Regression with Stochastic Gradient Descent

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

The Noisy Channel Model and Markov Models

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Lecture 15: Logistic Regression

Classification Logistic Regression

Lecture 2: Probability, Naive Bayes

ECS289: Scalable Machine Learning

Coordinate Descent and Ascent Methods

Instance-based Domain Adaptation

CS60021: Scalable Data Mining. Large Scale Machine Learning

Linear classifiers: Logistic regression

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Probability Review and Naïve Bayes

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Ad Placement Strategies

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Generative v. Discriminative classifiers Intuition

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Training the linear classifier

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Introduction to SVM and RVM

Introduction to Gaussian Process

BAYESIAN MULTINOMIAL LOGISTIC REGRESSION FOR AUTHOR IDENTIFICATION

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Machine Learning for Structured Prediction

Support Vector Machine & Its Applications

Natural Language Processing (CSEP 517): Text Classification

Lab 12: Structured Prediction

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Optical Character Recognition of Jutakshars within Devanagari Script

Statistical NLP Spring A Discriminative Approach

Natural Language Processing. Statistical Inference: n-grams

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Support Vector Machines

Machine Teaching. for Personalized Education, Security, Interactive Machine Learning. Jerry Zhu

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

N-gram Language Modeling

Generative Models for Classification

Statistical Data Mining and Machine Learning Hilary Term 2016

Log-Linear Models, MEMMs, and CRFs

Is the test error unbiased for these programs?

Convex Optimization Algorithms for Machine Learning in 10 Slides

Training Support Vector Machines: Status and Challenges

ECE521 Lecture7. Logistic Regression

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

SVMs, Duality and the Kernel Trick

Lecture 2: Logistic Regression and Neural Networks

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Midterm sample questions

ECE521 week 3: 23/26 January 2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machine. Industrial AI Lab.

Large-scale Collaborative Ranking in Near-Linear Time

naive bayes document classification

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Linear classifiers: Overfitting and regularization

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Deep Learning for NLP

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Introduction to Machine Learning

Transcription:

Fast Logistic Regression for Text Categorization with Variable-Length N-grams Georgiana Ifrim *, Gökhan Bakır +, Gerhard Weikum * * Max-Planck Institute for Informatics Saarbrücken, Germany + Google Switzerland GmbH Zürich, Switzerland

Outline Background Related Work Our Approach Structured Logistic Regression Experiments Conclusion

Background Text Categorization using Machine Learning Typical steps: Training docs -> Feature selection -> Build classifier -> Classify test docs Typical document representation: bag-of-words Document: This phone is good. document, x = (..., is, phone, good, this,...) classifier weights, β = (..., 0.1, 0.3, 0.7, 0.3,...) By this representation we lose: structure, order, context. Should we care?

Background - Motivation +1 This product is not bad, it s actually quite good. -1 This product is not good, it s actually quite bad. Bag-of-words representation for both samples : (actually, bad, it s, good, not, product, quite, This) N-gram representation: +1 (This, product, is, not, bad, it s, actually, quite, good, This product, product is, not bad,, quite good, This product is, product is not, is not bad,, it s actually quite good,.) -1 (This, product, is, not, good, it s, actually, quite, bad, This product, product is, not good,, quite bad, This product is, product is not, is not good,, it s actually quite bad, )

Background - Motivation Simple solution to this problem: use syntax clues, use n- grams for fixed n (e.g. all bigrams), e.g. feature engineering Problem with this idea: feature selection depends on application, language, domain, corpus size Ideally: Regard text as a sequence of word or character tokens Consider as features all the possible subsequences (n-grams) in the training set (But: u = # unigrams, d = O(u n ) n-grams)

Our Contribution A new (Structured) Logistic Regression algorithm able to select a compact set of discriminative variablelength n-grams SLR exploits the structure of the n-gram space in order to transform the learning problem into a search problem SLR learns:, not bad, quite good,, not good, quite bad,

Related Work Efficient, regularized learning algorithms SVM perf (Joachims06) Sparse logistic regression (Genkin07: BBR, Komarek03: LR-TRIRLS, Efron04, Zhang01) Text Categorization using word/char n-grams Markov chain models fixed order (n-gram language models, Peng2004), variable order (PST - Prediction Suffix Trees, Dekel04), (PPM, PPM* - Prediction by Partial Matching, Cleary97) SVM with string kernel (Lodhi01), efficient feature selection + SVM (Zhang06) Optimization approaches to solving logistic regression Newton algorithms (O(d 2 ) memory) Limited memory BFGS (O(d)), conjugate gradient (O(d), Nocedal06), iterative scaling (O(d), Jin03), cyclic coordinate descent (O(d), Shevade03, Zhang01)

The Problem Given: Xset of samples D = {(x 1, y 1 ),(x 2,y 2 ), (x N, y N )} training set, y i є {0,1} Features: all n-grams in the training set d = # distinct n-grams in training set β = (β 1,, β 2,, β d ) parameter vector Log-likelihood of training set for logistic regression N T T y i iβ ix i l(β) = [y iβ ix - log(1 + e )] i=1 i Goal: Learn a mapping f : X -> {0,1} from given training set D Solved by: Find parameter vector β that maximizes l(β) i β i e P(y = 1 x, β)= 1 + e i i T β ixi T x i

Our Approach Solve logistic regression by coordinate-wise gradient ascent new old l old β = β + ε i ( β ), ε = step size (line search) β In each iteration: estimate ascent direction and step size. Computing full gradient vector not feasible for the space of all possible n-grams in the training set (u = # unigrams, d = O(u n )) l l l l ( β ) = ( β ), ( β ),..., ( β ) β β1 β 2 β d Our ascent direction: sparse l ( β l ) 0,0,...,...,0, ( β l ),0,...,0,0, where β = = argmax ( ) β β j β j β j, j 1, d βj

Searching & Pruning Goal: Find coordinate j s.t. β j = l arg max ( β ) β β j, j 1, d j s s For super sequence j' j, find upper bound on gradient value as a function of subsequence s j : gradient(s j ) µ(s j ) Can prune the search space starting at s j if µ(s j ) δ, where δ is the current suboptimal gradient value

Searching & Pruning grad(a)=0.5 δ=0.5, µ=1 a b c a grad(aaa)=0.7 δ=0.75, µ=0.72 a b c a b c a b c a b c grad(abc)=0.75 δ=0.75, µ=0.9 a b c Branch-and-bound strategy If µ(aaa) = 0.72 then the gradient of any super-sequence of aaa is not better than 0.72 s j' s j gradient(s j ) µ(s j ) Goal: Find coordinate j s.t. β j = l arg max ( β ) β β j, j 1, d j

Upper Bound Theorem: For any super sequence s j' s j it holds that l β j ' ( β ) µβ ( ) j where T β i xi e µ ( β j) = max xij 1 T, β i xi { iyi= 1, s j xi} 1 + e { iy= 0, s x} i j i x ij T β i xi e T β i x i 1 + e

Experiments Test collections IMDB: movie genre classification Categories: Crime vs Drama; 7,440 documents; 63,623 word unigrams, 800,000 up to trigrams, 1.9 million up to 5-grams Task: Given movie plots, learn the genre of movies CHINESE: topic detection for Chinese text Corpus: TREC-5 People s Daily News (3600 documents, 4,961 characters) Categories: (1) Politics, Law and Society; (2) Literature and Arts; (3) Education, Science and Culture; (4) Sports; (5) Theory and Academy; (6) Economics.

Experiments Compared Methods Linear training time SVM: SVM perf (Joachims, KDD06) Bayesian Logistic Regression: BBR (Genkin et al, Technometrics07) Structured Logistic Regression: SLR (Ifrim et al, ) Methodology SVM, BBR: generate space of variable-length n-grams explicitly, e.g. for n=3, generate all unigrams, bigrams and trigrams Evaluate all methods w.r.t. training run-times, micro/macro-avg F1 Parameters: SVM: C = 100 (fixed), C in {100,..,500} (tuning) BBR: -p 1 -t 1 (fixed), -p 1 -t 1 C 5 autosearch (tuning) SLR: # optimization iterations, set by fixed threshold (0.005) on the aggregated change in score predictions, or varied between 200 and 1,000 for tuning

IMDB Macro-avg F1 5-fold CV; Plots show Macro-avg F1, Micro-avg F1 similar n = max n-gram length, i.e. all n-grams up to length n Macro-avg F1 comparison (word n-grams) Macro-avg F1 comparison (char n-grams) 75 75 74 74 73 73 72 72 Macro-avg F1 71 70 69 68 Macro-avg F1 71 70 69 68 SVM BBR SLR 67 67 66 66 65 n=1 n=3 n=5 n unrestricted 65 n=1 n=3 n=5 n unrestricted Maximum n-gram length Maximum n-gram length SLR as good as the state-of-the-art in terms of generalization ability

IMDB Running time 5-fold CV; average running time per CV split averaged over topics n = max n-gram length, i.e. all n-grams up to length n Running time comparison (word n-grams) Running time comparison (char n-grams) 50 50 Running time (min) 40 30 20 10 Running time (min) 40 30 20 10 SVM BBR SLR 0 n=1 n=3 n=5 n unrestricted Maximum n-gram length 0 n=1 n=3 n=5 n unrestricted Maximum n-gram length SLR more than one order of magnitude faster than both SVM and BBR

CHINESE Macro-avg F1 n = max n-gram length, i.e. all n-grams up to length n Macro-avg F1 comparison (char n-grams) 83 81 79 Macro-avg F1 77 75 73 71 SVM BBR SLR 69 67 65 n=1 n=3 n=5 n unrestricted Maximum n-gram length SLR as good as the state-of-the-art in terms of generalization ability Char n-grams avoid the need for word segmentation in Asian text

CHINESE Running time Training running time n = max n-gram length, i.e. all n-grams up to length n Running time comparison (char n-grams) 14 12 Running time (min) 10 8 6 4 SVM BBR SLR 2 0 n=1 n=3 n=5 n unrestricted Maximum n-gram length SLR more than one order of magnitude faster than both SVM and BBR

Interpretability A look at the models IMDB: Crime vs Drama word n-grams Crime n-grams 0.0078 gang war 0.0067 in the middle of 0.0058 hired by 0.0048 from prison char n-grams 0.040 u r d e 0.039 g a n g 0.038 o l i c e 0.034 c r i m e Drama n-grams -0.0069 they meet -0.0066 group of people -0.0015 finds himself -0.0014 an affair with -0.020 l o v -0.015 o t i o n -0.014 u r n e y -0.011 c h o o Char n-gram models select characteristic substrings (implicit stemming, robustness to morphological variations)

Conclusion Structured Logistic Regression: a new algorithm for efficiently learning a classifier with unbounded n-grams No prior feature selection needed No restriction on the n-gram length Using n-grams for text classification is beneficial and can be done efficiently

Future Work Looking at other text categorization applications (spam filtering, e-mail categorization, etc.) Applying this model to supervised information extraction Carrying the same gradient projection ideas to other learning problems

Thank you! Open source software for SLR: http://www.mpi-inf.mpg.de/~ifrim/slr