Machine Learning for NLP: Supervised learning techniques. Outline. Saturnino Luz. ESSLLI 07 Dublin Ireland 1-1. Notes. Notes

Size: px

Start display at page:

Download "Machine Learning for NLP: Supervised learning techniques. Outline. Saturnino Luz. ESSLLI 07 Dublin Ireland 1-1. Notes. Notes"

Grant Perkins
5 years ago
Views:

1 Machine Learning for NLP: Supervised learning techniques Saturnino Luz Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland 1-1 Outline Introduction to supervised learning main concepts, notation some supervised methods NLP applications (case studies): Text categorisation (in detail) Data representation Target and categorisation functions Algorithms (probabilistic, symbolic) Evaluation Word-sense disambiguation (briefly) Saturnino Luz: ESSLLI 07 Dublin Ireland

2 The uses of supervised learning Supervised learning is possibly the type of machine learning method most widely used in natural language processing applications. A supervised learner has access to a teacher which describes the function to be learnt over a number of training examples in practice, an annotated data set (corpus, etc) the inductive process will build an approximation ˆf of a target function f Saturnino Luz: ESSLLI 07 Dublin Ireland Classification tasks Supervised learning methods are usually employed in learning of classification tasks. Some notation: D = {d 1,..., d D }: a set of data instances. C = {c 1,..., c C }: a set of categories with respect to which instances will be classified. In its most general form, a target function will have signature f : D 2 C Classes are usually defined through manual annotation, or labelling Saturnino Luz: ESSLLI 07 Dublin Ireland

3 Representing the classification function Multi-label classification (general form of f, above): an instance may have any number of categories (e.g. classification of news articles). In practice, on often uses (category-specific) single-label classifiers of the form ˆf c : D C {0,1}, s.t.: 8 < 1 if d belongs to class c ˆf(d, c) = (1) : 0 otherwise multi-label classification as a composition of single-label classifiers: make ˆf(d) = {c ˆf(d, c) = 1} (assuming c s are stochastically independent). Controlled vocabulary keyword assignment, document classification into web hierarchies are examples of MLC, whereas ambiguity resolution in NLP is an example of SLC. spam filtering is an example of the binary case of SLC Saturnino Luz: ESSLLI 07 Dublin Ireland Data representation As we have seen in lecture 1, the data presented to the learning algorithm needs to be uniformly represented We assume that instances will be represented as feature vectors d i = t 1,..., t n whose values t 1,..., t n T will vary depending on the data representation scheme chosen. E.g.: our training instances for last lecture s draughts player are represented by 5-dimensional integer-valued feature vectors: bp(b), rp(b), bk(b)..., Saturnino Luz: ESSLLI 07 Dublin Ireland

4 Performance measure Classification performance can be evaluated in different ways, depending on the domain. Commonly used measures are accuracy and error. For most NLP applications it usually make more sense to measure performance in terms of precision, recall, and combinations of these two measures such as F scores and break-even points Saturnino Luz: ESSLLI 07 Dublin Ireland Algorithms: Hard vs. Soft classification A classification function ˆf : D C {0,1} performs what we call hard classification Another possibility is to allow ˆf to range over real values in the interval [0,1] (i.e. ˆf(d, c) = v, where 0 v 1) This soft classification approach is equivalent to ranking the classes in C according to their appropriateness to each instance d. Soft classification can be turned into hard classification via thresholding Saturnino Luz: ESSLLI 07 Dublin Ireland

5 Classifier induction and training experience Inducing a classification function ˆf through supervised learning involves a train-and-test strategy. data set D is split into: training set, D t, test set, D e, and (sometimes) a validation set, D v, is reserved for parameter tuning. The strategy: induced ˆf from D t tune parameters on D v test by D e and comparing ˆf to f on D e Saturnino Luz: ESSLLI 07 Dublin Ireland Data sparsity and cross validation Important: No data used in training (D t ) should be used for testing (D e ). Evaluation must show that ˆf generalises to unseen data (i.e. that overfitting has been avoided) In many areas, NLP in particular, data sparsity is a problem (k-fold) cross validation is often used to deal with it: build k classifiers ˆf 1,..., ˆf k over k partitions D 1,..., D k of D: Train-and-test procedure is applied iteratively on pairs D i t, Di e of training and test partitions, where D i t = D \ D i and D i e = D i for 1 i k Saturnino Luz: ESSLLI 07 Dublin Ireland

6 Algorithms: inference methods Symbolic, numeric and meta-classification methods. Numeric methods implement classification indirectly: the classification function ˆf outputs a numerical score, hard classification via thresholding e.g.: probabilistic classifiers, regression methods,... Symbolic methods usually implement hard classification directly e.g.: decision trees, decision rules,... Meta-classification methods combine results from independent classifiers e.g.: classifier ensembles, committees, Saturnino Luz: ESSLLI 07 Dublin Ireland Probabilistic classifiers Probabilistic classifiers output an estimation of the conditional probability P(c d) = ˆf(d, c) that an instance represented as d should be classified as c. Elements d as random variables T i (1 i T ) Need to estimate probabilities for all possible representations i.e. P(c T i,..., T n ). Too costly in practice: for discrete case and n possible nominal values that is O(n T ) Independence assumptions help Saturnino Luz: ESSLLI 07 Dublin Ireland

7 Conditional independence assumption Using Bayes rule we get P(c d j ) = P(c)P( d j c) P( d j ) Naïve Bayes classifiers: assume T i,..., T n are independent of each other given the target category: T (2) P( d Y c) = P(t k c) (3) k=1 maximum a posteriori hypothesis: choose c that maximises (52) maximum likelihood hypothesis: choose that maximises P( d j c) Saturnino Luz: ESSLLI 07 Dublin Ireland Variants of Naive Bayes classifiers multi-variate Bernoulli models, in which features are modelled as Boolean random variables, and multinomial models where the variables represent count data [McCallum and Nigam, 1998] Numeric data representation: attributes represented by continuous probability distributions using Gaussian distributions, the conditionals can be estimated as P(T i = t c) = 1 σ (t µ) 2 2π e 2σ 2 (4) Non-parametric kernel density estimation has also been proposed [John and Langley, 1995] Saturnino Luz: ESSLLI 07 Dublin Ireland

8 Uses of NB in NLP Information retrieval [Robertson and Jones, 1988] Text categorisation (see [Sebastiani, 2002] for a survey) Spam filters Word sense disambiguation [Gale et al., 1992] Saturnino Luz: ESSLLI 07 Dublin Ireland A symbolic method: decision trees Symbolic methods offer the advantage that their classification decisions are easily interpretable. Decision trees: data represented as vectors of discrete-value (or discretised) attributes classification through binary tests on highly discriminative features. test sequence encoded as a tree structure outlook sunny overcast rainy humidity yes windy high normal true false no yes no yes Saturnino Luz: ESSLLI 07 Dublin Ireland

9 A Sample data set (Mitchell, 97) outlook temperature humidity windy play 1 sunny hot high false no 2 sunny hot high true no 3 overcast hot high false yes 4 rainy mild high false yes 5 rainy cool normal false yes 6 rainy cool normal true no 7 overcast cool normal true yes 8 sunny mild high false no 9 sunny cool normal false yes 10 rainy mild normal false yes 11 sunny mild normal true yes 12 overcast mild high true yes 13 overcast hot normal false yes 14 rainy mild high true no Saturnino Luz: ESSLLI 07 Dublin Ireland Divide-and-conquer learning strategy Choose the features that better divide the instance space. E.g. distribution of attributes for the tennis weather task: outlook temperature humidity windy 8 play=yes play=no overcast rainy sunny 0 cool hot mild 0 high normal 0 false true Saturnino Luz: ESSLLI 07 Dublin Ireland

10 Uses of decision trees in NLP Parsing [Haruno et al., 1999, Magerman, 1995] Text categorisation [Lewis and Ringuette, 1994] word-sense disambiguation, POS tagging, Saturnino Luz: ESSLLI 07 Dublin Ireland Instance-based methods The majority of instance-based learners are lazy learners The importance of being lazy : instead of estimating the target function once for the whole instance space, estimate it locally and differently for each new instance A family of related techniques: k-nearest Neighbour Locally weighted regression Radial basis functions Case-based reasoning Saturnino Luz: ESSLLI 07 Dublin Ireland

11 k-nearest neighbours Key idea: just store all training examples x i, f(x i ) Nearest neighbour: Given query instance x q, first locate nearest training example x n, then estimate ˆf(x q ) f(x n ) k-nearest neighbour for: classification (discrete-valued target function): Given x q, take vote among its k nearest neighbours, or regression (real-valued f): take mean of f values of k nearest neighbours P k i=1 ˆf(x q ) f(x i) k Saturnino Luz: ESSLLI 07 Dublin Ireland When To Consider Using Nearest Neighbour Advantages: Training is very fast Learn complex target functions Don t lose information Disadvantages: Slow at query time Easily fooled by irrelevant attributes Saturnino Luz: ESSLLI 07 Dublin Ireland

12 Uses of instance-based methods in NLP POS tagging [Daelemans et al., 1996] text categorisation, where knn ranks among the best performing techniques [Yang and Chute, 1994, Sebastiani, 2002]. Speech synthesis [Daelemans and van den Bosch, 1996] probability estimation for statistical parsing [Hogan, 2007], Information extraction [Zavrel and Daelemans, 2003] Saturnino Luz: ESSLLI 07 Dublin Ireland Support Vector Machines SVMs have become a very popular ML technique in recent years due to their scalability to high dimensionality and robustness to overfitting SVM can be explained in geometrical terms as follows: Decision Surfaces: planes σ 1,..., σ n in a T -dimensional space which separates positive and negative training examples Given σ 1, σ 2,..., find a σ i which separates positive from negative examples by the widest possible margin Saturnino Luz: ESSLLI 07 Dublin Ireland

13 An example: 2-d case... Assume positive and negative instances are linearly separable (decision surfaces are ( T 1)-hyperplanes ): o o o o "best" decision surface o o o o o o o o o σ i Saturnino Luz: ESSLLI 07 Dublin Ireland Non linearly-separable data use kernel function to project original feature space onto higher-dimension space. E.g. [Russell and Norvig, 2003]: (a) true decision boundary, circle x x (b) mapping to three-dimensional input space (x 2 1, x2 2, 2x 1 x 2 ) Saturnino Luz: ESSLLI 07 Dublin Ireland

14 Uses of SVM in NLP text categorisation [Joachims, 1998], a task for which the method s ability to handle large feature sets seems to be particularly useful. tagging, parsing [Collins and Duffy, 2002] Saturnino Luz: ESSLLI 07 Dublin Ireland Meta-classifiers: ensembles Basic idea: to apply different classifiers ˆf 1, ˆf 2,... to the classification task and then combine the outputs appropriately Ensembles are characterised according to: the kinds of classifiers ˆf i they employ: ideally, these classifiers should be as independent as possible the way they combine multiple classifier outputs. Examples: majority voting (for committees of binary classifiers), weighted linear combination (for probabilistic outputs), dynamic classifier selection, adaptive classifier selection, Saturnino Luz: ESSLLI 07 Dublin Ireland

15 Meta-classifiers: Boosting The basic idea: all classifiers in the ensemble are obtained via the same learning method Classifiers are trained sequentially, rather than independently (i.e. the training of ˆf i takes into account the performance of ˆf 1,..., ˆf i 1 ) ADABOOST: each pair < d j, c i > is assigned importance weight h t ij in ˆf t, which represents how hard it is to get a correct decision for < d j, c i > in ˆf 1,..., ˆf t Saturnino Luz: ESSLLI 07 Dublin Ireland Uses of meta-classification methods in NLP parsing [Collins and Koo, 2005], word-sense disambiguation [Pedersen, 2000, Escudero et al., 2000], text categorisation [Sebastiani, 2002], Saturnino Luz: ESSLLI 07 Dublin Ireland

16 Case study: text categorisation Text categorisation is a task which consists of assigning pre-defined symbolic labels (categories) to natural language texts (see Lecture 1). Two approaches: category-pivoted categorisation: given a category c, the classifier must find all text documents d that should be filed under c. document-pivoted categorisation: given a text document d, the classifier must find all categories c under which d should be filed Saturnino Luz: ESSLLI 07 Dublin Ireland Text categorisation data REUTERS-21578: a commonly used corpus <REUTERS TOPICS= YES LEWISSPLIT= TRAIN ID= 96 > <DATE>26 FEB :32:37.30</DATE> <TOPICS><D>acq</D></TOPICS> <PLACES><D>usa</D></PLACES> <PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES> <TEXT> <TITLE> INVESTMENT FIRMS CUT CYCLOPS <CYL> STAKE </TITLE> <DATELINE> WASHINGTON, Feb 26 </DATELINE> <BODY>A group of affiliated New York investment firms said they lowered their stake in Cyclops Corp to 260,500 shares, or 6.4 pct of the total outstanding common stock, from 370,500 shares, or 9.2 pct. In a filing with the Securities and Exchange Commission, the group, led by Mutual Shares Corp, said it sold 110,000 Cyclops common shares on Feb 17 and 19 for 10.0 mln dlrs. Reuter </BODY> </TEXT> </REUTERS> Saturnino Luz: ESSLLI 07 Dublin Ireland

17 TC and supervised learning: preliminaries assume a document-pivoted categorisation perspective. D: set of textual documents, d i. a corpus D, and sub-corpora for training (D t ), testing (D e ) and validation (D v ). Document labels correspond to the category set C. In the Reuters corpus, for instance, some of the elements of this set would be acq (labelling documents about company acquisitions), housing, vatican etc Saturnino Luz: ESSLLI 07 Dublin Ireland Labelling constraints Multi-label text categorisation (MLTC): A document may belong to any number of categories. In MLTC, for a given document d in D, one might have subsets of C, C m = {c 1,..., c k } such that k > 1, and ˆf(d, c 1 ) =... = ˆf(d, c k ) = 1 Example: REUTERS news articles If C m = 1 for all d D, we have what is called a single-label classifier (SLTC). Boolean labelling. spam filtering Saturnino Luz: ESSLLI 07 Dublin Ireland

18 Category generality The generality of a category c i in the context of a text classification system, given a corpus Ω is defined as the proportion of documents that belong to c i : g Ω (c i ) = {d j Ω s.t. f(d j, c i ) = 1} Ω Analogously, one can define category generality for training sets: g Tr (c i ) validation sets: g Tv (c i ) and test sets: g Te (c i ) (5) Saturnino Luz: ESSLLI 07 Dublin Ireland Texts as feature vectors The representation strategy most commonly adopted in text categorisation (and information retrieval) consists of: selecting a set of terms T in the corpus (also known as features), encoding all documents d j as vectors: dj = < t ij,..., t T j > t kj represents how much feature k contributes to the semantics of text d j Saturnino Luz: ESSLLI 07 Dublin Ireland

19 Possible implementations Variations on the vector representation: different ways of defining terms (features): compounds, semantic dependencies,... different ways of computing weights: which words, phases etc are the most relevant? how do we quantify relevance? One can eliminate large numbers of candidate features before any statistical processing is done Saturnino Luz: ESSLLI 07 Dublin Ireland Before indexing... Some pre-processing of texts may help reduce dimensionality even before any indexing is done or weights computed. One often removes stop words as the first step of text processing. These include topic-neutral types and function words (prepositions, conjunctions, articles) Stemming is also frequently employed. It involves clustering together types that share the same morphological root. For example: { cluster, clustering, clustered,... } Saturnino Luz: ESSLLI 07 Dublin Ireland

20 Different ways of defining features Words or phrases? Should we represent Saddam Hussein as a single feature or as two distinct features? How about White House, and Text categorisation? Richer representation schemes Darmstadt Indexing Approach: position of terms in d, properties of terms, category generality, etc. How about semantics? Saturnino Luz: ESSLLI 07 Dublin Ireland Different ways of assigning values to features Alternative ways of assigning values to document vectors also affect classification. #(t k, d j ) = number of occurrences of t k in d j # Tr (t k ) = number of documents in Tr in which t k occurs Three approaches: sets of words: where binary values indicate presence or absence of the feature in the document bags (or multi-sets) of words: where quantify number of occurrences of a feature relative frequency: term-frequency scores or probabilistic weights Computing weights by relative frequency: Tr tfidf(t k, d j ) = #(t k, d j )log # Tr (t k ) (6) Saturnino Luz: ESSLLI 07 Dublin Ireland

21 Dimensionality Reduction DR: a processing step whose goal is to reduce the size of the vector space from T to T T. T is called Reduced Term Set Benefits of DR: Lower computational cost for ML Avoidance of overfitting (training on constitutive features, rather than contingent ones Rule-of-thumb: overfitting is avoided if the number of training examples is proportional to the size of T (For TC, experiments have suggested a ratio of texts per feature) Saturnino Luz: ESSLLI 07 Dublin Ireland DR by feature selection vs. DR by feature extraction DR by term selection or Term Space Reduction (TSR): T is a subset of T. Order T by a score r i (t k ) that quantifies the relevance of t j as an indicator of c i and choose the top terms. DR by term extraction: the terms in T are not of the same type of the terms in T, but are obtained by combinations or transformations of the original ones. E.g: in DR by term extraction, if the terms in T are words, the terms in T may not be words at all Saturnino Luz: ESSLLI 07 Dublin Ireland

22 Term Space Reduction There are two ways to reduce the term space: TSR by term wrapping: the ML algorithm itself is used to reduce term space dimensionality TSR by term filtering: terms are ranked according to their importance (r i (t k )) for the TC task and the highest-scoring ones are chosen Performance is measured in terms of aggressiveness: the ratio between original and reduced feature set: A detailed comparison of TSR techniques can be found in [Yang and Pedersen, 1997] T T Saturnino Luz: ESSLLI 07 Dublin Ireland Local vs. Global DR DR can be done for each category or for the whole set of categories: Local DR: for for each category c i,a set T i of terms ( T i T ) is chosen for classification w.r.t. c i. Different term sets are used for different categories. global DR:,a set T of terms ( T T ) is chosen for all categories C = {c 1,..., c C } local scores can be combined into global ranking through sum: r sum (t k ) = P C i=1 r i(t k ), weighted average: r wavg (t k ) = P C i=1 P(c i)r i (t k ), or maximisation: r max (t k ) = max C i=1 r i(t k ) Saturnino Luz: ESSLLI 07 Dublin Ireland

23 Filtering by document frequency The simplest TSR technique: 1. Remove stop-words, etc, (see slide on pre-processing steps) 2. Order all features t k in T according to the number of documents in which they occur (# Tr (t k )) 3. Choose T = {t 1,..., t n } s.t. it contains the n highest scoring t k Advantages: Low computational cost DR up to a factor of 10 with just a small reduction in effectiveness (as reported in [Yang and Pedersen, 1997]) Saturnino Luz: ESSLLI 07 Dublin Ireland Information theoretic TSR Assume a multi-variate Bernoulli model where feature will have value 0 or 1 depending on whether they occur in a document. (sample space: Ω = 2 D ). T and C are Boolean random variables for events (sets) generated by choices of terms and categories. Notation: P(T = 1 C = 1), abbreviated as P(t c), is the probability that term t occurs in a document classified under category c, and similarly for P(T = 0, C = 1), abbreviated as P( t, c), P(t c) for the probability that t occurs in a document of category c, etc Saturnino Luz: ESSLLI 07 Dublin Ireland

24 Some TSR ranking functions Functions commonly used in feature selection for text categorisation [Sebastiani, 2002]. Name Document frequency DIA factor Expected mutual information Mutual information I(t, c) = X Definition #(t, c) = P(t c) z(t, c) = P(c t) X c {c, c} t {t, t} P(t, c ) log P(t, c ) P(t, c) MI(t, c) = P(t, c) log P(t)P(c) P(t )P(c ) Chi-square χ 2 (t, c) = D t [P(t, c)p( t, c) P(t, c)p( t, c)] 2 NLG coefficient NLG(t, c) = P(t)P( t)p(c)p( c) p Dt [P(t, c)p( t, c) P(t, c)p( t, c)] p P(t)P( t)p(c)p( c) Odds ratio OR(t, c) = P(t c)[1 P(t c)] [1 P(t c)]p(t c) GSS coefficient GSS(t, c) = P(t, c)p( t, c) P(t, c)p( t, c) Saturnino Luz: ESSLLI 07 Dublin Ireland TSR by Expected Mutual Information 1 f s I ( T, c, a ) : T s 2 var : C l, T t, T l : l i s t 3 for each d D 4 i f f(d, c) = true do append(d, C l ) 5 for each t d 6 put( t, d, T t ) 7 put( t, 0, T l ) 8 P(c) = C l D 9 for each t in T t 10 D tc {d d C l t, d T t }, D t c {d d C l t, d T t } 11 D t c {d d C l t, d T t }, D tc {d d C l t, d T t } 12 P(t, c) D tc, P( t, c) D t c D D 13 P(t, c) D t c, P( t, c) D tc D D 14 P(t) D tc + D t c D 15 remove( t, 0, T l ) 16 i P(t, c)log P(t,c) P(t)P(c) + P( t, c) log P( t,c) P( t)p(c) + 17 P( t, c) log P( t, c) P(t, c) + P(t, c) log P( t)p( c) P(t)P( c) 18 add( t, i, T l ) 19 sort T l by expected mutual information scores 20 return f i r s t T l a elements of T l The algorithm above is simply meant to illustrate the how the estimation of probabilities work in very general terms. A practical implementation would not involve as many counting operations and would need to take into account the need to avoid zero probabilities for cases where terms do not co-occur with categories. (More about the latter in slide 54). With respect to counting, for each term-category pair, it would suffice to estimate P(c), P(t) and a single joint or conditional, say P(t c) = D tc C l, (or P(t c) = D tc +1, using a Laplace estimator) C l + T l and derive the remaining values from it through straightforward applications of the properties of conditional probabiliites: P(t, c) = P(t c)p(c) (11) P( t, c) = (1 P(t c))p(c) (12) P(t, c) = P(t) P(t, c) (13) P( t, c) = (1 P(t))P( t, c) (14) Saturnino Luz: ESSLLI 07 Dublin Ireland

25 Sample ranking Top-ranked words for REUTERS category acq according to expected mutual information stake = merger = acquisition = vs = shares = buy = acquire = qtr = cts = usair = shareholders = buys = Saturnino Luz: ESSLLI 07 Dublin Ireland Naive Bayes text categorisation Categorisation status value (CSV) function: a soft classification function for each category c i C: ˆf i : D R Hard classifier status value, ˆf i h : D {0,1}, can then be defined as follows: 8 < ˆf i h 1 if ˆf i (d) τ i, (d) = (15) : 0 otherwise. Thresholds can be determined analytically or experimentally Saturnino Luz: ESSLLI 07 Dublin Ireland

26 Experimental thresholds CSV thresholding or SCut: SCut stands for optimal thresholding on the confidence scores of category candidates: Vary τ i on D v and choose the one that maximises effectiveness Proportional thresholding: choose τ i s.t. that generality measure g Tr (c i ) is closest to g Tv (c i ). RCut or fixed thresholding: stipulate that a fixed number of categories are to be assigned to each document. See [Yang, 2001] for a recent survey of thresholding strategies Saturnino Luz: ESSLLI 07 Dublin Ireland CSV for multi-variate Bernoulli models Starting from the independence assumption and Bayes rule T P( d Y c) = P(t k c) k=1 P(c d j ) = P(c)P( d j c) P( d j ) derive a monot. increasing function of P(c d ): ˆf(d, c) = T X i=1 t i log P(t i c)[1 P(t i c)] P(t i c)[1 P(t i c)] (16) Need to estimate 2 T, rather than 2 T parameters Saturnino Luz: ESSLLI 07 Dublin Ireland

27 Alternative: multinomial model An alternative implementation of the Naïve Bayes Classifier is described in [Mitchell, 1997]. In this approach, words appear as values rather than names of attributes A document representation for this slide would look like this: d = a1 = an, a 2 = alternative, a 3 = implementation,... Problem: each attribute s value would range over the entire vocabulary. Many values would be missing for a typical document Saturnino Luz: ESSLLI 07 Dublin Ireland Dealing with missing values what if none of the training instances with target value v j have attribute value a i? Then ˆP(a i v j ) = 0, and... ˆP(v j ) Y i ˆP(a i v j ) = 0 Typical solution is Bayesian estimate for ˆP(a i v j ) ˆP(a i v j ) n c + mp n + m where n is number of training examples for which v = v j, n c number of examples for which v = v j and a = a i p is prior estimate for ˆP(a i v j ) m is weight given to prior (i.e. number of virtual examples) Saturnino Luz: ESSLLI 07 Dublin Ireland

28 Learning in multinomial models 1 NB_Learn(D t, C) Algorithm 1: NB Probability estimation 2 /* collect all tokens that occur in D t */ 3 T all distinct words and other tokens in D t 4 /* calculate P(c j ) and P(t k c j ) */ 5 for each target value c j in C do 6 D j t subset of D t for which target value is c j 7 P(c j ) Dj t D t 8 Text j concatenation of all texts in D j t 9 n total number of tokens in Text j 10 for each word t k in C do 11 n k number of times word t k occurs in Text j 12 P(t k c j ) n k+1 n+ T 13 done 14 done Saturnino Luz: ESSLLI 07 Dublin Ireland Sample Classification Algorithm Could calculate posterior probabilities for soft classification ny ˆf(d) = P(c) P(t k c) k=1 and use thresholding as before Or, for SLTC, implement hard categorisation directly: Algorithm 2: MNB Categorisation Status Function 1 CSV_MNB(d : D) 2 positions all word positions in d 3 that contain tokens found in T 4 Return c nb, where 5 c nb = arg max ci C P(c i ) Q k positions P(t k c i ) Saturnino Luz: ESSLLI 07 Dublin Ireland

29 Naive but subtle Conditional independence assumption is clearly false P(a 1, a 2... a n v j ) = Y i...but NB works well anyway. Why? P(a i v j ) posteriors ˆP(v j x) don t need to be correct; We need only that: arg max v j V ˆP(v j ) Y i ˆP(a i v j ) = arg max v j V P(v j)p(a 1..., a n v j ) Naive Bayes posteriors are often unrealistically close to 1 or 0 see [Domingos and Pazzani, 1996] for analysis Saturnino Luz: ESSLLI 07 Dublin Ireland Other Probabilistic Classifiers One could also represent data as real-valued vectors (e.g. normalised TFxIDF) and assume an underlying normal distribution to estimate the probabilities. Alternative approaches to probabilistic classifiers attempt to improve effectiveness by: adopting weighted document vectors, rather than binary-valued ones introducing document length normalisation, in order to correct distortions in CSV i introduced by long documents relaxing the independence assumption (the least adopted variant, since it appears that the binary independence assumption seldom affects effectiveness) Saturnino Luz: ESSLLI 07 Dublin Ireland

30 Decision tree classifiers and learners A decision tree is a tree with: internal nodes labelled by terms edges labelled by tests (on the weight the term from which they depart has in the document) leaves labelled by categories Given a decision tree, categorisation of a document d j is done by recursively testing the values of the internal nodes of the tree against those in d j until a leaf is reached. Simplest case: d j consists of binary values Saturnino Luz: ESSLLI 07 Dublin Ireland Example: a (binary) Decision Tree wheat wheat ( ) farm farm commodity commodity bushels bushels agriculture agriculture WHEAT export export WHEAT WHEAT WHEAT tonnes tonnes WHEAT WHEAT WHEAT winter winter if (wheat farm) or WHEAT soft soft (wheat commodity) or WHEAT WHEAT (bushels export) or (wheat tonnes) or (wheat agriculture) or (wheat winter soft) then WHEAT else ( WHEAT) 60 Saturnino Luz: ESSLLI 07 Dublin Ireland 60-1

31 A learning algorithm Algorithm 3: Decision tree learning 1 DTreeLearn(D t ): 2 D, T : 2 T, default: C): tree 2 if isempty(d t ) then 3 return default 4 else if Q d i D t f(d i, c j ) = 1 then /* all d i have class c j */ 5 return c j 6 else if isempty(t ) then 7 return MajorityCateg(D t ) 8 else 9 t best ChooseFeature(T, D t ) 10 tree new dtree with root = t best 11 for each v k t best do 12 Dt k {d l D t t best has value v k in d l } 13 sbt DTreeLearn(Dt k, T \ {t best }, MajorityCateg(D t )) 14 add a branch to tree with label v k and subtree sbt 15 done 16 return tree Saturnino Luz: ESSLLI 07 Dublin Ireland How do we implement ChooseFeature? Finding the right feature to partition the feature set is essential. Entropy (the H(.) function, above), AKA self-information, measures the amount of uncertainty w.r.t a probability distribution. In other words, entropy is a measure of how much we learn when we observe an event occurring in accordance with this distribution. One can use Information Gain (the difference of the entropy of the mother node and the weighted sum of the entropies of the child nodes) yielded by candidate features t k : H(D) = D c D log D c D c D c log D D D (17) where D c ( D c ) is the number of positive (negative) instances filed under category c in D. IG(T, D) = H(D) [ D t D H(D t) + D t D H(D t)] (18) where D t and D t are the subsets of D containing instances for which T has value 1 and 0, respectively Saturnino Luz: ESSLLI 07 Dublin Ireland

32 Important Issues Choosing the right feature (from T ) to partition the training set Choose feature with highest information gain Avoiding overfitting: Memorising all observations from the Tr Extracting patterns, extrapolating to unseen examples in Tv and Te Occam s razor: the most likely hypothesis is the simplest one which is consistent with all observations Pruning: remove over specific branches Saturnino Luz: ESSLLI 07 Dublin Ireland A DT for the earnings category documents P(c n3) = documents P(c n2) = net = documents P(c n4) = documents P(c n1) = 0.3 cts = 2 Decision boundaries cts < 2 cts >= documents P(c n6) = documents P(c n5) = vs = net < 1 net >= 1 vs < 2 vs >= 2 [Manning and Schütze, 1999] documents P(c n7) = Probability that a document at node n4 belongs to category c = "earnings" net n4 cts Saturnino Luz: ESSLLI 07 Dublin Ireland

33 Calculating node probabilities One can assign probabilities to a leaf node (i.e. the probability that a new document d belonging to that node should be filled under category c) as follows (using add-one smoothing): where P(c d n ) = D cn + 1 D cn + D cn (19) P(c d n ) is the probability that a document d n which ended up in node n belongs to category c, D cn ( D cn ) number of (training) documents in node n which have been assigned category c ( c) Saturnino Luz: ESSLLI 07 Dublin Ireland Text Classifier Evaluation Evaluation of TC systems is usually done experimentally rather than analytically Analytical evaluation is difficult due to the subjective nature of the task Experimental evaluation aims at measuring classifier effectiveness, that is, its ability to make correct classification decisions for the largest possible number of documents We have already seen two measures used in experimental evaluation: precision and recall; Today we will characterise these measures more precisely and see other ways of evaluating TC Saturnino Luz: ESSLLI 07 Dublin Ireland

34 Precision and recall Precision (π), with respect to category c i may be defined as the following conditional probability: π = P(f(d x, c i ) = T ˆf(d x, c i ) = T) (20) the probability that if a document d x has been classified as c i this decision is correct Analogously, recall (ρ), may be defined as follows: ρ = P( ˆf(d x, c i ) = T f(d x, c i ) = T) (21) the probability that if a random document is meant to be filed under c i, it d x will be classified as such Saturnino Luz: ESSLLI 07 Dublin Ireland Calculating precision and recall TN i FP i TP i Selected ( ˆf(d x, c i ) = T) π i = FN i All texts Target (f(d x, c i ) = T) TP i TP i +FP i ρ i = TP i TP i +FN i Saturnino Luz: ESSLLI 07 Dublin Ireland

35 Combining local into global measures Local estimates of the probabilities in (20) and (21) may be combined to yield estimates for the classifier as a whole. The contingency table below summarises precision and recall over a category set: Category set Expert judgement C = {c 1, c 2,...} YES NO TC system YES TP = P C i=1 TP i FP = P C i=1 FP i judgement NO FN = P C i=1 FN i TN = P C i=1 TN i Saturnino Luz: ESSLLI 07 Dublin Ireland Effectiveness averaging Two different methods may be used to calculate global values for π and ρ: Microaveraging: π µ = ρ µ = P C i=1 TP i P C i=1 (TP i + FP i ) P C i=1 TP i P C i=1 (TP i + FN i ) (22) (23) Saturnino Luz: ESSLLI 07 Dublin Ireland

36 Macroaveraging Precision macroaveraging is calculated as follows: π M = P C i=1 π i C, where π i and ρ i are local scores. Recall macroaveraging is calculated as follows: ρ M = P C i=1 ρ i C (24) (25) Saturnino Luz: ESSLLI 07 Dublin Ireland An example Category Sport Politics World Judgement: E(xpert) and S(system) E S E S E S Brazil beat Venezuela T F F F F T US defeated Afghanistan F T T T T F Elections in Wicklow F F T T F F Elections in Peru F F F T T T Precision (local): π = 0 π = 0.67 π = 0.5 Recall (local): ρ = 0 ρ = 1 ρ = 0.5 π µ = π M = = = Saturnino Luz: ESSLLI 07 Dublin Ireland

37 Other measures Since one knows (by experimentation) which documents fall into: TP, FP, TN, FN, one may also estimate Accuracy (A) and Error (E): E = A = TP + TN TP + TN + FP + FN (26) FP + FN TP + TN + FP + FN = 1 A (27) These measures, however, are not widely used in TC due to the fact that they are less sensitive to variations in the number of correct decisions than π and ρ Saturnino Luz: ESSLLI 07 Dublin Ireland Fallout and ROC curves A less frequently used measure is fallout: Fallout i = FP i FP i + TN i (28) Fallout measures the proportion of non-targeted items that were mistakenly selected. In certain fields recall-fallout trade-offs are more common than precision-recall ones. The receiver operating characteristic, or ROC curve how different levels of fallout influence recall or sensitivity Saturnino Luz: ESSLLI 07 Dublin Ireland

38 Alternatives to effectiveness Efficiency is often used as an additional criterion in TC evaluation. It may be measured with respect to: training or classification The utility criterion, from decision theory, is sometimes used. An obvious example of application of utility measures is spam filtering, where failing to discard spam is less serious than discarding a legitimate message Saturnino Luz: ESSLLI 07 Dublin Ireland Combining precision and recall Neither π or ρ makes much sense in isolation. Classifiers can be tuned to maximise one at the expense of the other. TC evaluation is done in terms of measures that combine π and ρ. We will examine two such measures: breakeven point and the F functions Saturnino Luz: ESSLLI 07 Dublin Ireland

39 Breakeven point The breakeven point is the value at which π equals ρ, as determined by the following process: A plot of π as a function of ρ is computed by varying the thresholds τ i for the CSV function from 1 to 0 (with the threshold set to 1, only those documents that the classifier is totally sure belong to the category will be selected, so π will tend to 1, and ρ to 0; as we decrease τ i, precision will decrease, but ρ will increase) The breakeven point is the value (of ρ or π) at which the plot intersects the ρ = π line Saturnino Luz: ESSLLI 07 Dublin Ireland The F functions The idea behind F measures [van Rijsbergen, 1979, ch. 7] is to assign a degree of importance to ρ and π. Let β be a factor (0 β ) quantifying such degree of importance. The F β function can be computed via the following: F β = (β2 + 1)πρ (29) β 2 π + ρ A β value of 1 assigns equal importance to precision and recall The breakeven of a classifier is always less than or equal its F β for β = Saturnino Luz: ESSLLI 07 Dublin Ireland

40 Comparison of existing TC systems Corpus Type Systems non-learning WORD probabilistic PropBayes, Bim, Nb decision tree C4.5, Ind From [Sebastiani, 2002]. See also [Yang and Liu, 1999]..884 decision rules Swap-1, Ripper, etc regression LISF online linear BWinnow batch linear Rocchio neural nets Classi example based k-nn, Gis-W SVM SVMLight ensemble AdaBoost Saturnino Luz: ESSLLI 07 Dublin Ireland A final example: WSD Consider the following occurrences of the word bank : RIV y be? Then he ran down along the bank, toward a narrow, muddy path. FIN four bundles of small notes the bank cashier got it into his head RIV ross the bridge and on the other bank you only hear the stream, the RIV beneath the house, where a steep bank of earth is compacted between FIN op but is really the branch of a bank. As I set foot inside, despite FIN raffic police also belong to the bank. More foolhardy than entering FIN require a number. If you open a bank account, the teller identifies RIV circular movement, skirting the bank of the River Jordan, then turn The WSD learning task is to learn to distinguish between meanings financial institution (FIN) and the land alongside a river (RIV) Saturnino Luz: ESSLLI 07 Dublin Ireland

41 Task, Data representation, performance measures WSD can be described as a categorisation task where senses (FIN, RIV) are labels (C) the representation of instances (D) comes from the context surrounding the words to be disambiguated. E.g.: For T = {along, cashier, stream, muddy,... }, we could have: d 1 = along = 1, cashier = 0, stream = 0, muddy = 1,... and f(d 1 ) = RIV Performance can be measured as in text categorisation Saturnino Luz: ESSLLI 07 Dublin Ireland A decision tree Using the algorithm above (slide 61) we get this decision tree RIV on 1 0 RIV river 1 0 RIV FIN when 1 0 money 1 0 from 1 0 Trained on a small training set with T = {small, money, on, to, river, from, in, his, accounts, when, by, other, estuary, some, with} RIV FIN Saturnino Luz: ESSLLI 07 Dublin Ireland

42 Accuracy, k-nn This simple decision tree obtains F 1 scores of for RIV and for FIN. A 3-NN classifier for the same data using numeric vectors with values corresponding to the distance of the various features to the keyword (negative integers for words to the left of the keyword, positive integers for words to the right of the keyword) obtained F 1 scores of 0.88 and 0.91, respectively Saturnino Luz: ESSLLI 07 Dublin Ireland M. Collins and N. Duffy. Convolution kernels for natural language. Advances in Neural Information Processing Systems, 14: , M. Collins and T. Koo. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1): 25 69, W. Daelemans, J. Zavrel, P. Berck, and S. Gillis. MBT: A memory-based part of speech tagger generator. In Proc. of Fourth Workshop on Very Large Corpora, pages 14 27, Walter Daelemans and Antal van den Bosch. Language-independent data-oriented grapheme to phoneme conversion. In J. van Santen, R. Sproat, J. Olive, and J. Hirschberg, editors, Progress in Speech Synthesis. Springer, Pedro Domingos and Michael J. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classifier. In International Conference on Machine Learning, pages , URL citeseer.nj.nec.com/domingos96beyond.html. References Gerard Escudero, Lluís Màrquez, and German Rigau. Boosting applied to word sense disambiguation. In Ramon López De Mántaras and Enric Plaza, editors, Proceedings of ECML-00, 11th European Conference on Machine Learning, pages , Barcelona, Springer Verlag. William Gale, Kenneth Church, and David Yarowsky. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26: , M. Haruno, S. Shirai, and Y. Ooyama. Using decision trees to construct a practical parser. Machine Learning, 34(1): , Deirdre Hogan. Coordinate noun phrase disambiguation in a generative parsing model. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages , Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages , Chemnitz, George H. John and Pat Langley. Estimating continuous distributions in Bayesian classifiers. In Besnard, Philippe and Steve Hanks, editors, Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI 95), pages , San Francisco, CA, USA, August Morgan Kaufmann Publishers. D.D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pages 81 93, Saturnino Luz: ESSLLI 07 Dublin Ireland David M. Magerman. Statistical decision-tree models for parsing. In Meeting of the Association for Computational Linguistics, pages , URL citeseer.ist.psu.edu/magerman95statistical.html. 85 Saturnino Luz: ESSLLI 07 Dublin Ireland

Symbolic methods in TC: Decision Trees

Symbolic methods in TC: Decision Trees ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs0/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 01-017 A symbolic