Machine Learning for NLP: Supervised learning techniques. Outline. Saturnino Luz. ESSLLI 07 Dublin Ireland 1-1. Notes. Notes
|
|
- Grant Perkins
- 5 years ago
- Views:
Transcription
1 Machine Learning for NLP: Supervised learning techniques Saturnino Luz Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland 1-1 Outline Introduction to supervised learning main concepts, notation some supervised methods NLP applications (case studies): Text categorisation (in detail) Data representation Target and categorisation functions Algorithms (probabilistic, symbolic) Evaluation Word-sense disambiguation (briefly) Saturnino Luz: ESSLLI 07 Dublin Ireland
2 The uses of supervised learning Supervised learning is possibly the type of machine learning method most widely used in natural language processing applications. A supervised learner has access to a teacher which describes the function to be learnt over a number of training examples in practice, an annotated data set (corpus, etc) the inductive process will build an approximation ˆf of a target function f Saturnino Luz: ESSLLI 07 Dublin Ireland Classification tasks Supervised learning methods are usually employed in learning of classification tasks. Some notation: D = {d 1,..., d D }: a set of data instances. C = {c 1,..., c C }: a set of categories with respect to which instances will be classified. In its most general form, a target function will have signature f : D 2 C Classes are usually defined through manual annotation, or labelling Saturnino Luz: ESSLLI 07 Dublin Ireland
3 Representing the classification function Multi-label classification (general form of f, above): an instance may have any number of categories (e.g. classification of news articles). In practice, on often uses (category-specific) single-label classifiers of the form ˆf c : D C {0,1}, s.t.: 8 < 1 if d belongs to class c ˆf(d, c) = (1) : 0 otherwise multi-label classification as a composition of single-label classifiers: make ˆf(d) = {c ˆf(d, c) = 1} (assuming c s are stochastically independent). Controlled vocabulary keyword assignment, document classification into web hierarchies are examples of MLC, whereas ambiguity resolution in NLP is an example of SLC. spam filtering is an example of the binary case of SLC Saturnino Luz: ESSLLI 07 Dublin Ireland Data representation As we have seen in lecture 1, the data presented to the learning algorithm needs to be uniformly represented We assume that instances will be represented as feature vectors d i = t 1,..., t n whose values t 1,..., t n T will vary depending on the data representation scheme chosen. E.g.: our training instances for last lecture s draughts player are represented by 5-dimensional integer-valued feature vectors: bp(b), rp(b), bk(b)..., Saturnino Luz: ESSLLI 07 Dublin Ireland
4 Performance measure Classification performance can be evaluated in different ways, depending on the domain. Commonly used measures are accuracy and error. For most NLP applications it usually make more sense to measure performance in terms of precision, recall, and combinations of these two measures such as F scores and break-even points Saturnino Luz: ESSLLI 07 Dublin Ireland Algorithms: Hard vs. Soft classification A classification function ˆf : D C {0,1} performs what we call hard classification Another possibility is to allow ˆf to range over real values in the interval [0,1] (i.e. ˆf(d, c) = v, where 0 v 1) This soft classification approach is equivalent to ranking the classes in C according to their appropriateness to each instance d. Soft classification can be turned into hard classification via thresholding Saturnino Luz: ESSLLI 07 Dublin Ireland
5 Classifier induction and training experience Inducing a classification function ˆf through supervised learning involves a train-and-test strategy. data set D is split into: training set, D t, test set, D e, and (sometimes) a validation set, D v, is reserved for parameter tuning. The strategy: induced ˆf from D t tune parameters on D v test by D e and comparing ˆf to f on D e Saturnino Luz: ESSLLI 07 Dublin Ireland Data sparsity and cross validation Important: No data used in training (D t ) should be used for testing (D e ). Evaluation must show that ˆf generalises to unseen data (i.e. that overfitting has been avoided) In many areas, NLP in particular, data sparsity is a problem (k-fold) cross validation is often used to deal with it: build k classifiers ˆf 1,..., ˆf k over k partitions D 1,..., D k of D: Train-and-test procedure is applied iteratively on pairs D i t, Di e of training and test partitions, where D i t = D \ D i and D i e = D i for 1 i k Saturnino Luz: ESSLLI 07 Dublin Ireland
6 Algorithms: inference methods Symbolic, numeric and meta-classification methods. Numeric methods implement classification indirectly: the classification function ˆf outputs a numerical score, hard classification via thresholding e.g.: probabilistic classifiers, regression methods,... Symbolic methods usually implement hard classification directly e.g.: decision trees, decision rules,... Meta-classification methods combine results from independent classifiers e.g.: classifier ensembles, committees, Saturnino Luz: ESSLLI 07 Dublin Ireland Probabilistic classifiers Probabilistic classifiers output an estimation of the conditional probability P(c d) = ˆf(d, c) that an instance represented as d should be classified as c. Elements d as random variables T i (1 i T ) Need to estimate probabilities for all possible representations i.e. P(c T i,..., T n ). Too costly in practice: for discrete case and n possible nominal values that is O(n T ) Independence assumptions help Saturnino Luz: ESSLLI 07 Dublin Ireland
7 Conditional independence assumption Using Bayes rule we get P(c d j ) = P(c)P( d j c) P( d j ) Naïve Bayes classifiers: assume T i,..., T n are independent of each other given the target category: T (2) P( d Y c) = P(t k c) (3) k=1 maximum a posteriori hypothesis: choose c that maximises (52) maximum likelihood hypothesis: choose that maximises P( d j c) Saturnino Luz: ESSLLI 07 Dublin Ireland Variants of Naive Bayes classifiers multi-variate Bernoulli models, in which features are modelled as Boolean random variables, and multinomial models where the variables represent count data [McCallum and Nigam, 1998] Numeric data representation: attributes represented by continuous probability distributions using Gaussian distributions, the conditionals can be estimated as P(T i = t c) = 1 σ (t µ) 2 2π e 2σ 2 (4) Non-parametric kernel density estimation has also been proposed [John and Langley, 1995] Saturnino Luz: ESSLLI 07 Dublin Ireland
8 Uses of NB in NLP Information retrieval [Robertson and Jones, 1988] Text categorisation (see [Sebastiani, 2002] for a survey) Spam filters Word sense disambiguation [Gale et al., 1992] Saturnino Luz: ESSLLI 07 Dublin Ireland A symbolic method: decision trees Symbolic methods offer the advantage that their classification decisions are easily interpretable. Decision trees: data represented as vectors of discrete-value (or discretised) attributes classification through binary tests on highly discriminative features. test sequence encoded as a tree structure outlook sunny overcast rainy humidity yes windy high normal true false no yes no yes Saturnino Luz: ESSLLI 07 Dublin Ireland
9 A Sample data set (Mitchell, 97) outlook temperature humidity windy play 1 sunny hot high false no 2 sunny hot high true no 3 overcast hot high false yes 4 rainy mild high false yes 5 rainy cool normal false yes 6 rainy cool normal true no 7 overcast cool normal true yes 8 sunny mild high false no 9 sunny cool normal false yes 10 rainy mild normal false yes 11 sunny mild normal true yes 12 overcast mild high true yes 13 overcast hot normal false yes 14 rainy mild high true no Saturnino Luz: ESSLLI 07 Dublin Ireland Divide-and-conquer learning strategy Choose the features that better divide the instance space. E.g. distribution of attributes for the tennis weather task: outlook temperature humidity windy 8 play=yes play=no overcast rainy sunny 0 cool hot mild 0 high normal 0 false true Saturnino Luz: ESSLLI 07 Dublin Ireland
10 Uses of decision trees in NLP Parsing [Haruno et al., 1999, Magerman, 1995] Text categorisation [Lewis and Ringuette, 1994] word-sense disambiguation, POS tagging, Saturnino Luz: ESSLLI 07 Dublin Ireland Instance-based methods The majority of instance-based learners are lazy learners The importance of being lazy : instead of estimating the target function once for the whole instance space, estimate it locally and differently for each new instance A family of related techniques: k-nearest Neighbour Locally weighted regression Radial basis functions Case-based reasoning Saturnino Luz: ESSLLI 07 Dublin Ireland
11 k-nearest neighbours Key idea: just store all training examples x i, f(x i ) Nearest neighbour: Given query instance x q, first locate nearest training example x n, then estimate ˆf(x q ) f(x n ) k-nearest neighbour for: classification (discrete-valued target function): Given x q, take vote among its k nearest neighbours, or regression (real-valued f): take mean of f values of k nearest neighbours P k i=1 ˆf(x q ) f(x i) k Saturnino Luz: ESSLLI 07 Dublin Ireland When To Consider Using Nearest Neighbour Advantages: Training is very fast Learn complex target functions Don t lose information Disadvantages: Slow at query time Easily fooled by irrelevant attributes Saturnino Luz: ESSLLI 07 Dublin Ireland
12 Uses of instance-based methods in NLP POS tagging [Daelemans et al., 1996] text categorisation, where knn ranks among the best performing techniques [Yang and Chute, 1994, Sebastiani, 2002]. Speech synthesis [Daelemans and van den Bosch, 1996] probability estimation for statistical parsing [Hogan, 2007], Information extraction [Zavrel and Daelemans, 2003] Saturnino Luz: ESSLLI 07 Dublin Ireland Support Vector Machines SVMs have become a very popular ML technique in recent years due to their scalability to high dimensionality and robustness to overfitting SVM can be explained in geometrical terms as follows: Decision Surfaces: planes σ 1,..., σ n in a T -dimensional space which separates positive and negative training examples Given σ 1, σ 2,..., find a σ i which separates positive from negative examples by the widest possible margin Saturnino Luz: ESSLLI 07 Dublin Ireland
13 An example: 2-d case... Assume positive and negative instances are linearly separable (decision surfaces are ( T 1)-hyperplanes ): o o o o "best" decision surface o o o o o o o o o σ i Saturnino Luz: ESSLLI 07 Dublin Ireland Non linearly-separable data use kernel function to project original feature space onto higher-dimension space. E.g. [Russell and Norvig, 2003]: (a) true decision boundary, circle x x (b) mapping to three-dimensional input space (x 2 1, x2 2, 2x 1 x 2 ) Saturnino Luz: ESSLLI 07 Dublin Ireland
14 Uses of SVM in NLP text categorisation [Joachims, 1998], a task for which the method s ability to handle large feature sets seems to be particularly useful. tagging, parsing [Collins and Duffy, 2002] Saturnino Luz: ESSLLI 07 Dublin Ireland Meta-classifiers: ensembles Basic idea: to apply different classifiers ˆf 1, ˆf 2,... to the classification task and then combine the outputs appropriately Ensembles are characterised according to: the kinds of classifiers ˆf i they employ: ideally, these classifiers should be as independent as possible the way they combine multiple classifier outputs. Examples: majority voting (for committees of binary classifiers), weighted linear combination (for probabilistic outputs), dynamic classifier selection, adaptive classifier selection, Saturnino Luz: ESSLLI 07 Dublin Ireland
15 Meta-classifiers: Boosting The basic idea: all classifiers in the ensemble are obtained via the same learning method Classifiers are trained sequentially, rather than independently (i.e. the training of ˆf i takes into account the performance of ˆf 1,..., ˆf i 1 ) ADABOOST: each pair < d j, c i > is assigned importance weight h t ij in ˆf t, which represents how hard it is to get a correct decision for < d j, c i > in ˆf 1,..., ˆf t Saturnino Luz: ESSLLI 07 Dublin Ireland Uses of meta-classification methods in NLP parsing [Collins and Koo, 2005], word-sense disambiguation [Pedersen, 2000, Escudero et al., 2000], text categorisation [Sebastiani, 2002], Saturnino Luz: ESSLLI 07 Dublin Ireland
16 Case study: text categorisation Text categorisation is a task which consists of assigning pre-defined symbolic labels (categories) to natural language texts (see Lecture 1). Two approaches: category-pivoted categorisation: given a category c, the classifier must find all text documents d that should be filed under c. document-pivoted categorisation: given a text document d, the classifier must find all categories c under which d should be filed Saturnino Luz: ESSLLI 07 Dublin Ireland Text categorisation data REUTERS-21578: a commonly used corpus <REUTERS TOPICS= YES LEWISSPLIT= TRAIN ID= 96 > <DATE>26 FEB :32:37.30</DATE> <TOPICS><D>acq</D></TOPICS> <PLACES><D>usa</D></PLACES> <PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES> <TEXT> <TITLE> INVESTMENT FIRMS CUT CYCLOPS <CYL> STAKE </TITLE> <DATELINE> WASHINGTON, Feb 26 </DATELINE> <BODY>A group of affiliated New York investment firms said they lowered their stake in Cyclops Corp to 260,500 shares, or 6.4 pct of the total outstanding common stock, from 370,500 shares, or 9.2 pct. In a filing with the Securities and Exchange Commission, the group, led by Mutual Shares Corp, said it sold 110,000 Cyclops common shares on Feb 17 and 19 for 10.0 mln dlrs. Reuter </BODY> </TEXT> </REUTERS> Saturnino Luz: ESSLLI 07 Dublin Ireland
17 TC and supervised learning: preliminaries assume a document-pivoted categorisation perspective. D: set of textual documents, d i. a corpus D, and sub-corpora for training (D t ), testing (D e ) and validation (D v ). Document labels correspond to the category set C. In the Reuters corpus, for instance, some of the elements of this set would be acq (labelling documents about company acquisitions), housing, vatican etc Saturnino Luz: ESSLLI 07 Dublin Ireland Labelling constraints Multi-label text categorisation (MLTC): A document may belong to any number of categories. In MLTC, for a given document d in D, one might have subsets of C, C m = {c 1,..., c k } such that k > 1, and ˆf(d, c 1 ) =... = ˆf(d, c k ) = 1 Example: REUTERS news articles If C m = 1 for all d D, we have what is called a single-label classifier (SLTC). Boolean labelling. spam filtering Saturnino Luz: ESSLLI 07 Dublin Ireland
18 Category generality The generality of a category c i in the context of a text classification system, given a corpus Ω is defined as the proportion of documents that belong to c i : g Ω (c i ) = {d j Ω s.t. f(d j, c i ) = 1} Ω Analogously, one can define category generality for training sets: g Tr (c i ) validation sets: g Tv (c i ) and test sets: g Te (c i ) (5) Saturnino Luz: ESSLLI 07 Dublin Ireland Texts as feature vectors The representation strategy most commonly adopted in text categorisation (and information retrieval) consists of: selecting a set of terms T in the corpus (also known as features), encoding all documents d j as vectors: dj = < t ij,..., t T j > t kj represents how much feature k contributes to the semantics of text d j Saturnino Luz: ESSLLI 07 Dublin Ireland
19 Possible implementations Variations on the vector representation: different ways of defining terms (features): compounds, semantic dependencies,... different ways of computing weights: which words, phases etc are the most relevant? how do we quantify relevance? One can eliminate large numbers of candidate features before any statistical processing is done Saturnino Luz: ESSLLI 07 Dublin Ireland Before indexing... Some pre-processing of texts may help reduce dimensionality even before any indexing is done or weights computed. One often removes stop words as the first step of text processing. These include topic-neutral types and function words (prepositions, conjunctions, articles) Stemming is also frequently employed. It involves clustering together types that share the same morphological root. For example: { cluster, clustering, clustered,... } Saturnino Luz: ESSLLI 07 Dublin Ireland
20 Different ways of defining features Words or phrases? Should we represent Saddam Hussein as a single feature or as two distinct features? How about White House, and Text categorisation? Richer representation schemes Darmstadt Indexing Approach: position of terms in d, properties of terms, category generality, etc. How about semantics? Saturnino Luz: ESSLLI 07 Dublin Ireland Different ways of assigning values to features Alternative ways of assigning values to document vectors also affect classification. #(t k, d j ) = number of occurrences of t k in d j # Tr (t k ) = number of documents in Tr in which t k occurs Three approaches: sets of words: where binary values indicate presence or absence of the feature in the document bags (or multi-sets) of words: where quantify number of occurrences of a feature relative frequency: term-frequency scores or probabilistic weights Computing weights by relative frequency: Tr tfidf(t k, d j ) = #(t k, d j )log # Tr (t k ) (6) Saturnino Luz: ESSLLI 07 Dublin Ireland
21 Dimensionality Reduction DR: a processing step whose goal is to reduce the size of the vector space from T to T T. T is called Reduced Term Set Benefits of DR: Lower computational cost for ML Avoidance of overfitting (training on constitutive features, rather than contingent ones Rule-of-thumb: overfitting is avoided if the number of training examples is proportional to the size of T (For TC, experiments have suggested a ratio of texts per feature) Saturnino Luz: ESSLLI 07 Dublin Ireland DR by feature selection vs. DR by feature extraction DR by term selection or Term Space Reduction (TSR): T is a subset of T. Order T by a score r i (t k ) that quantifies the relevance of t j as an indicator of c i and choose the top terms. DR by term extraction: the terms in T are not of the same type of the terms in T, but are obtained by combinations or transformations of the original ones. E.g: in DR by term extraction, if the terms in T are words, the terms in T may not be words at all Saturnino Luz: ESSLLI 07 Dublin Ireland
22 Term Space Reduction There are two ways to reduce the term space: TSR by term wrapping: the ML algorithm itself is used to reduce term space dimensionality TSR by term filtering: terms are ranked according to their importance (r i (t k )) for the TC task and the highest-scoring ones are chosen Performance is measured in terms of aggressiveness: the ratio between original and reduced feature set: A detailed comparison of TSR techniques can be found in [Yang and Pedersen, 1997] T T Saturnino Luz: ESSLLI 07 Dublin Ireland Local vs. Global DR DR can be done for each category or for the whole set of categories: Local DR: for for each category c i,a set T i of terms ( T i T ) is chosen for classification w.r.t. c i. Different term sets are used for different categories. global DR:,a set T of terms ( T T ) is chosen for all categories C = {c 1,..., c C } local scores can be combined into global ranking through sum: r sum (t k ) = P C i=1 r i(t k ), weighted average: r wavg (t k ) = P C i=1 P(c i)r i (t k ), or maximisation: r max (t k ) = max C i=1 r i(t k ) Saturnino Luz: ESSLLI 07 Dublin Ireland
23 Filtering by document frequency The simplest TSR technique: 1. Remove stop-words, etc, (see slide on pre-processing steps) 2. Order all features t k in T according to the number of documents in which they occur (# Tr (t k )) 3. Choose T = {t 1,..., t n } s.t. it contains the n highest scoring t k Advantages: Low computational cost DR up to a factor of 10 with just a small reduction in effectiveness (as reported in [Yang and Pedersen, 1997]) Saturnino Luz: ESSLLI 07 Dublin Ireland Information theoretic TSR Assume a multi-variate Bernoulli model where feature will have value 0 or 1 depending on whether they occur in a document. (sample space: Ω = 2 D ). T and C are Boolean random variables for events (sets) generated by choices of terms and categories. Notation: P(T = 1 C = 1), abbreviated as P(t c), is the probability that term t occurs in a document classified under category c, and similarly for P(T = 0, C = 1), abbreviated as P( t, c), P(t c) for the probability that t occurs in a document of category c, etc Saturnino Luz: ESSLLI 07 Dublin Ireland
24 Some TSR ranking functions Functions commonly used in feature selection for text categorisation [Sebastiani, 2002]. Name Document frequency DIA factor Expected mutual information Mutual information I(t, c) = X Definition #(t, c) = P(t c) z(t, c) = P(c t) X c {c, c} t {t, t} P(t, c ) log P(t, c ) P(t, c) MI(t, c) = P(t, c) log P(t)P(c) P(t )P(c ) Chi-square χ 2 (t, c) = D t [P(t, c)p( t, c) P(t, c)p( t, c)] 2 NLG coefficient NLG(t, c) = P(t)P( t)p(c)p( c) p Dt [P(t, c)p( t, c) P(t, c)p( t, c)] p P(t)P( t)p(c)p( c) Odds ratio OR(t, c) = P(t c)[1 P(t c)] [1 P(t c)]p(t c) GSS coefficient GSS(t, c) = P(t, c)p( t, c) P(t, c)p( t, c) Saturnino Luz: ESSLLI 07 Dublin Ireland TSR by Expected Mutual Information 1 f s I ( T, c, a ) : T s 2 var : C l, T t, T l : l i s t 3 for each d D 4 i f f(d, c) = true do append(d, C l ) 5 for each t d 6 put( t, d, T t ) 7 put( t, 0, T l ) 8 P(c) = C l D 9 for each t in T t 10 D tc {d d C l t, d T t }, D t c {d d C l t, d T t } 11 D t c {d d C l t, d T t }, D tc {d d C l t, d T t } 12 P(t, c) D tc, P( t, c) D t c D D 13 P(t, c) D t c, P( t, c) D tc D D 14 P(t) D tc + D t c D 15 remove( t, 0, T l ) 16 i P(t, c)log P(t,c) P(t)P(c) + P( t, c) log P( t,c) P( t)p(c) + 17 P( t, c) log P( t, c) P(t, c) + P(t, c) log P( t)p( c) P(t)P( c) 18 add( t, i, T l ) 19 sort T l by expected mutual information scores 20 return f i r s t T l a elements of T l The algorithm above is simply meant to illustrate the how the estimation of probabilities work in very general terms. A practical implementation would not involve as many counting operations and would need to take into account the need to avoid zero probabilities for cases where terms do not co-occur with categories. (More about the latter in slide 54). With respect to counting, for each term-category pair, it would suffice to estimate P(c), P(t) and a single joint or conditional, say P(t c) = D tc C l, (or P(t c) = D tc +1, using a Laplace estimator) C l + T l and derive the remaining values from it through straightforward applications of the properties of conditional probabiliites: P(t, c) = P(t c)p(c) (11) P( t, c) = (1 P(t c))p(c) (12) P(t, c) = P(t) P(t, c) (13) P( t, c) = (1 P(t))P( t, c) (14) Saturnino Luz: ESSLLI 07 Dublin Ireland
25 Sample ranking Top-ranked words for REUTERS category acq according to expected mutual information stake = merger = acquisition = vs = shares = buy = acquire = qtr = cts = usair = shareholders = buys = Saturnino Luz: ESSLLI 07 Dublin Ireland Naive Bayes text categorisation Categorisation status value (CSV) function: a soft classification function for each category c i C: ˆf i : D R Hard classifier status value, ˆf i h : D {0,1}, can then be defined as follows: 8 < ˆf i h 1 if ˆf i (d) τ i, (d) = (15) : 0 otherwise. Thresholds can be determined analytically or experimentally Saturnino Luz: ESSLLI 07 Dublin Ireland
26 Experimental thresholds CSV thresholding or SCut: SCut stands for optimal thresholding on the confidence scores of category candidates: Vary τ i on D v and choose the one that maximises effectiveness Proportional thresholding: choose τ i s.t. that generality measure g Tr (c i ) is closest to g Tv (c i ). RCut or fixed thresholding: stipulate that a fixed number of categories are to be assigned to each document. See [Yang, 2001] for a recent survey of thresholding strategies Saturnino Luz: ESSLLI 07 Dublin Ireland CSV for multi-variate Bernoulli models Starting from the independence assumption and Bayes rule T P( d Y c) = P(t k c) k=1 P(c d j ) = P(c)P( d j c) P( d j ) derive a monot. increasing function of P(c d ): ˆf(d, c) = T X i=1 t i log P(t i c)[1 P(t i c)] P(t i c)[1 P(t i c)] (16) Need to estimate 2 T, rather than 2 T parameters Saturnino Luz: ESSLLI 07 Dublin Ireland
27 Alternative: multinomial model An alternative implementation of the Naïve Bayes Classifier is described in [Mitchell, 1997]. In this approach, words appear as values rather than names of attributes A document representation for this slide would look like this: d = a1 = an, a 2 = alternative, a 3 = implementation,... Problem: each attribute s value would range over the entire vocabulary. Many values would be missing for a typical document Saturnino Luz: ESSLLI 07 Dublin Ireland Dealing with missing values what if none of the training instances with target value v j have attribute value a i? Then ˆP(a i v j ) = 0, and... ˆP(v j ) Y i ˆP(a i v j ) = 0 Typical solution is Bayesian estimate for ˆP(a i v j ) ˆP(a i v j ) n c + mp n + m where n is number of training examples for which v = v j, n c number of examples for which v = v j and a = a i p is prior estimate for ˆP(a i v j ) m is weight given to prior (i.e. number of virtual examples) Saturnino Luz: ESSLLI 07 Dublin Ireland
28 Learning in multinomial models 1 NB_Learn(D t, C) Algorithm 1: NB Probability estimation 2 /* collect all tokens that occur in D t */ 3 T all distinct words and other tokens in D t 4 /* calculate P(c j ) and P(t k c j ) */ 5 for each target value c j in C do 6 D j t subset of D t for which target value is c j 7 P(c j ) Dj t D t 8 Text j concatenation of all texts in D j t 9 n total number of tokens in Text j 10 for each word t k in C do 11 n k number of times word t k occurs in Text j 12 P(t k c j ) n k+1 n+ T 13 done 14 done Saturnino Luz: ESSLLI 07 Dublin Ireland Sample Classification Algorithm Could calculate posterior probabilities for soft classification ny ˆf(d) = P(c) P(t k c) k=1 and use thresholding as before Or, for SLTC, implement hard categorisation directly: Algorithm 2: MNB Categorisation Status Function 1 CSV_MNB(d : D) 2 positions all word positions in d 3 that contain tokens found in T 4 Return c nb, where 5 c nb = arg max ci C P(c i ) Q k positions P(t k c i ) Saturnino Luz: ESSLLI 07 Dublin Ireland
29 Naive but subtle Conditional independence assumption is clearly false P(a 1, a 2... a n v j ) = Y i...but NB works well anyway. Why? P(a i v j ) posteriors ˆP(v j x) don t need to be correct; We need only that: arg max v j V ˆP(v j ) Y i ˆP(a i v j ) = arg max v j V P(v j)p(a 1..., a n v j ) Naive Bayes posteriors are often unrealistically close to 1 or 0 see [Domingos and Pazzani, 1996] for analysis Saturnino Luz: ESSLLI 07 Dublin Ireland Other Probabilistic Classifiers One could also represent data as real-valued vectors (e.g. normalised TFxIDF) and assume an underlying normal distribution to estimate the probabilities. Alternative approaches to probabilistic classifiers attempt to improve effectiveness by: adopting weighted document vectors, rather than binary-valued ones introducing document length normalisation, in order to correct distortions in CSV i introduced by long documents relaxing the independence assumption (the least adopted variant, since it appears that the binary independence assumption seldom affects effectiveness) Saturnino Luz: ESSLLI 07 Dublin Ireland
30 Decision tree classifiers and learners A decision tree is a tree with: internal nodes labelled by terms edges labelled by tests (on the weight the term from which they depart has in the document) leaves labelled by categories Given a decision tree, categorisation of a document d j is done by recursively testing the values of the internal nodes of the tree against those in d j until a leaf is reached. Simplest case: d j consists of binary values Saturnino Luz: ESSLLI 07 Dublin Ireland Example: a (binary) Decision Tree wheat wheat ( ) farm farm commodity commodity bushels bushels agriculture agriculture WHEAT export export WHEAT WHEAT WHEAT tonnes tonnes WHEAT WHEAT WHEAT winter winter if (wheat farm) or WHEAT soft soft (wheat commodity) or WHEAT WHEAT (bushels export) or (wheat tonnes) or (wheat agriculture) or (wheat winter soft) then WHEAT else ( WHEAT) 60 Saturnino Luz: ESSLLI 07 Dublin Ireland 60-1
31 A learning algorithm Algorithm 3: Decision tree learning 1 DTreeLearn(D t ): 2 D, T : 2 T, default: C): tree 2 if isempty(d t ) then 3 return default 4 else if Q d i D t f(d i, c j ) = 1 then /* all d i have class c j */ 5 return c j 6 else if isempty(t ) then 7 return MajorityCateg(D t ) 8 else 9 t best ChooseFeature(T, D t ) 10 tree new dtree with root = t best 11 for each v k t best do 12 Dt k {d l D t t best has value v k in d l } 13 sbt DTreeLearn(Dt k, T \ {t best }, MajorityCateg(D t )) 14 add a branch to tree with label v k and subtree sbt 15 done 16 return tree Saturnino Luz: ESSLLI 07 Dublin Ireland How do we implement ChooseFeature? Finding the right feature to partition the feature set is essential. Entropy (the H(.) function, above), AKA self-information, measures the amount of uncertainty w.r.t a probability distribution. In other words, entropy is a measure of how much we learn when we observe an event occurring in accordance with this distribution. One can use Information Gain (the difference of the entropy of the mother node and the weighted sum of the entropies of the child nodes) yielded by candidate features t k : H(D) = D c D log D c D c D c log D D D (17) where D c ( D c ) is the number of positive (negative) instances filed under category c in D. IG(T, D) = H(D) [ D t D H(D t) + D t D H(D t)] (18) where D t and D t are the subsets of D containing instances for which T has value 1 and 0, respectively Saturnino Luz: ESSLLI 07 Dublin Ireland
32 Important Issues Choosing the right feature (from T ) to partition the training set Choose feature with highest information gain Avoiding overfitting: Memorising all observations from the Tr Extracting patterns, extrapolating to unseen examples in Tv and Te Occam s razor: the most likely hypothesis is the simplest one which is consistent with all observations Pruning: remove over specific branches Saturnino Luz: ESSLLI 07 Dublin Ireland A DT for the earnings category documents P(c n3) = documents P(c n2) = net = documents P(c n4) = documents P(c n1) = 0.3 cts = 2 Decision boundaries cts < 2 cts >= documents P(c n6) = documents P(c n5) = vs = net < 1 net >= 1 vs < 2 vs >= 2 [Manning and Schütze, 1999] documents P(c n7) = Probability that a document at node n4 belongs to category c = "earnings" net n4 cts Saturnino Luz: ESSLLI 07 Dublin Ireland
33 Calculating node probabilities One can assign probabilities to a leaf node (i.e. the probability that a new document d belonging to that node should be filled under category c) as follows (using add-one smoothing): where P(c d n ) = D cn + 1 D cn + D cn (19) P(c d n ) is the probability that a document d n which ended up in node n belongs to category c, D cn ( D cn ) number of (training) documents in node n which have been assigned category c ( c) Saturnino Luz: ESSLLI 07 Dublin Ireland Text Classifier Evaluation Evaluation of TC systems is usually done experimentally rather than analytically Analytical evaluation is difficult due to the subjective nature of the task Experimental evaluation aims at measuring classifier effectiveness, that is, its ability to make correct classification decisions for the largest possible number of documents We have already seen two measures used in experimental evaluation: precision and recall; Today we will characterise these measures more precisely and see other ways of evaluating TC Saturnino Luz: ESSLLI 07 Dublin Ireland
34 Precision and recall Precision (π), with respect to category c i may be defined as the following conditional probability: π = P(f(d x, c i ) = T ˆf(d x, c i ) = T) (20) the probability that if a document d x has been classified as c i this decision is correct Analogously, recall (ρ), may be defined as follows: ρ = P( ˆf(d x, c i ) = T f(d x, c i ) = T) (21) the probability that if a random document is meant to be filed under c i, it d x will be classified as such Saturnino Luz: ESSLLI 07 Dublin Ireland Calculating precision and recall TN i FP i TP i Selected ( ˆf(d x, c i ) = T) π i = FN i All texts Target (f(d x, c i ) = T) TP i TP i +FP i ρ i = TP i TP i +FN i Saturnino Luz: ESSLLI 07 Dublin Ireland
35 Combining local into global measures Local estimates of the probabilities in (20) and (21) may be combined to yield estimates for the classifier as a whole. The contingency table below summarises precision and recall over a category set: Category set Expert judgement C = {c 1, c 2,...} YES NO TC system YES TP = P C i=1 TP i FP = P C i=1 FP i judgement NO FN = P C i=1 FN i TN = P C i=1 TN i Saturnino Luz: ESSLLI 07 Dublin Ireland Effectiveness averaging Two different methods may be used to calculate global values for π and ρ: Microaveraging: π µ = ρ µ = P C i=1 TP i P C i=1 (TP i + FP i ) P C i=1 TP i P C i=1 (TP i + FN i ) (22) (23) Saturnino Luz: ESSLLI 07 Dublin Ireland
36 Macroaveraging Precision macroaveraging is calculated as follows: π M = P C i=1 π i C, where π i and ρ i are local scores. Recall macroaveraging is calculated as follows: ρ M = P C i=1 ρ i C (24) (25) Saturnino Luz: ESSLLI 07 Dublin Ireland An example Category Sport Politics World Judgement: E(xpert) and S(system) E S E S E S Brazil beat Venezuela T F F F F T US defeated Afghanistan F T T T T F Elections in Wicklow F F T T F F Elections in Peru F F F T T T Precision (local): π = 0 π = 0.67 π = 0.5 Recall (local): ρ = 0 ρ = 1 ρ = 0.5 π µ = π M = = = Saturnino Luz: ESSLLI 07 Dublin Ireland
37 Other measures Since one knows (by experimentation) which documents fall into: TP, FP, TN, FN, one may also estimate Accuracy (A) and Error (E): E = A = TP + TN TP + TN + FP + FN (26) FP + FN TP + TN + FP + FN = 1 A (27) These measures, however, are not widely used in TC due to the fact that they are less sensitive to variations in the number of correct decisions than π and ρ Saturnino Luz: ESSLLI 07 Dublin Ireland Fallout and ROC curves A less frequently used measure is fallout: Fallout i = FP i FP i + TN i (28) Fallout measures the proportion of non-targeted items that were mistakenly selected. In certain fields recall-fallout trade-offs are more common than precision-recall ones. The receiver operating characteristic, or ROC curve how different levels of fallout influence recall or sensitivity Saturnino Luz: ESSLLI 07 Dublin Ireland
38 Alternatives to effectiveness Efficiency is often used as an additional criterion in TC evaluation. It may be measured with respect to: training or classification The utility criterion, from decision theory, is sometimes used. An obvious example of application of utility measures is spam filtering, where failing to discard spam is less serious than discarding a legitimate message Saturnino Luz: ESSLLI 07 Dublin Ireland Combining precision and recall Neither π or ρ makes much sense in isolation. Classifiers can be tuned to maximise one at the expense of the other. TC evaluation is done in terms of measures that combine π and ρ. We will examine two such measures: breakeven point and the F functions Saturnino Luz: ESSLLI 07 Dublin Ireland
39 Breakeven point The breakeven point is the value at which π equals ρ, as determined by the following process: A plot of π as a function of ρ is computed by varying the thresholds τ i for the CSV function from 1 to 0 (with the threshold set to 1, only those documents that the classifier is totally sure belong to the category will be selected, so π will tend to 1, and ρ to 0; as we decrease τ i, precision will decrease, but ρ will increase) The breakeven point is the value (of ρ or π) at which the plot intersects the ρ = π line Saturnino Luz: ESSLLI 07 Dublin Ireland The F functions The idea behind F measures [van Rijsbergen, 1979, ch. 7] is to assign a degree of importance to ρ and π. Let β be a factor (0 β ) quantifying such degree of importance. The F β function can be computed via the following: F β = (β2 + 1)πρ (29) β 2 π + ρ A β value of 1 assigns equal importance to precision and recall The breakeven of a classifier is always less than or equal its F β for β = Saturnino Luz: ESSLLI 07 Dublin Ireland
40 Comparison of existing TC systems Corpus Type Systems non-learning WORD probabilistic PropBayes, Bim, Nb decision tree C4.5, Ind From [Sebastiani, 2002]. See also [Yang and Liu, 1999]..884 decision rules Swap-1, Ripper, etc regression LISF online linear BWinnow batch linear Rocchio neural nets Classi example based k-nn, Gis-W SVM SVMLight ensemble AdaBoost Saturnino Luz: ESSLLI 07 Dublin Ireland A final example: WSD Consider the following occurrences of the word bank : RIV y be? Then he ran down along the bank, toward a narrow, muddy path. FIN four bundles of small notes the bank cashier got it into his head RIV ross the bridge and on the other bank you only hear the stream, the RIV beneath the house, where a steep bank of earth is compacted between FIN op but is really the branch of a bank. As I set foot inside, despite FIN raffic police also belong to the bank. More foolhardy than entering FIN require a number. If you open a bank account, the teller identifies RIV circular movement, skirting the bank of the River Jordan, then turn The WSD learning task is to learn to distinguish between meanings financial institution (FIN) and the land alongside a river (RIV) Saturnino Luz: ESSLLI 07 Dublin Ireland
41 Task, Data representation, performance measures WSD can be described as a categorisation task where senses (FIN, RIV) are labels (C) the representation of instances (D) comes from the context surrounding the words to be disambiguated. E.g.: For T = {along, cashier, stream, muddy,... }, we could have: d 1 = along = 1, cashier = 0, stream = 0, muddy = 1,... and f(d 1 ) = RIV Performance can be measured as in text categorisation Saturnino Luz: ESSLLI 07 Dublin Ireland A decision tree Using the algorithm above (slide 61) we get this decision tree RIV on 1 0 RIV river 1 0 RIV FIN when 1 0 money 1 0 from 1 0 Trained on a small training set with T = {small, money, on, to, river, from, in, his, accounts, when, by, other, estuary, some, with} RIV FIN Saturnino Luz: ESSLLI 07 Dublin Ireland
42 Accuracy, k-nn This simple decision tree obtains F 1 scores of for RIV and for FIN. A 3-NN classifier for the same data using numeric vectors with values corresponding to the distance of the various features to the keyword (negative integers for words to the left of the keyword, positive integers for words to the right of the keyword) obtained F 1 scores of 0.88 and 0.91, respectively Saturnino Luz: ESSLLI 07 Dublin Ireland M. Collins and N. Duffy. Convolution kernels for natural language. Advances in Neural Information Processing Systems, 14: , M. Collins and T. Koo. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1): 25 69, W. Daelemans, J. Zavrel, P. Berck, and S. Gillis. MBT: A memory-based part of speech tagger generator. In Proc. of Fourth Workshop on Very Large Corpora, pages 14 27, Walter Daelemans and Antal van den Bosch. Language-independent data-oriented grapheme to phoneme conversion. In J. van Santen, R. Sproat, J. Olive, and J. Hirschberg, editors, Progress in Speech Synthesis. Springer, Pedro Domingos and Michael J. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classifier. In International Conference on Machine Learning, pages , URL citeseer.nj.nec.com/domingos96beyond.html. References Gerard Escudero, Lluís Màrquez, and German Rigau. Boosting applied to word sense disambiguation. In Ramon López De Mántaras and Enric Plaza, editors, Proceedings of ECML-00, 11th European Conference on Machine Learning, pages , Barcelona, Springer Verlag. William Gale, Kenneth Church, and David Yarowsky. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26: , M. Haruno, S. Shirai, and Y. Ooyama. Using decision trees to construct a practical parser. Machine Learning, 34(1): , Deirdre Hogan. Coordinate noun phrase disambiguation in a generative parsing model. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages , Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages , Chemnitz, George H. John and Pat Langley. Estimating continuous distributions in Bayesian classifiers. In Besnard, Philippe and Steve Hanks, editors, Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI 95), pages , San Francisco, CA, USA, August Morgan Kaufmann Publishers. D.D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pages 81 93, Saturnino Luz: ESSLLI 07 Dublin Ireland David M. Magerman. Statistical decision-tree models for parsing. In Meeting of the Association for Computational Linguistics, pages , URL citeseer.ist.psu.edu/magerman95statistical.html. 85 Saturnino Luz: ESSLLI 07 Dublin Ireland
Symbolic methods in TC: Decision Trees
Symbolic methods in TC: Decision Trees ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs0/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 01-017 A symbolic
More informationSymbolic methods in TC: Decision Trees
Symbolic methods in TC: Decision Trees ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 2016-2017 2
More informationModern Information Retrieval
Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction
More informationDimensionality reduction
Dimensionality reduction ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 2017 Recapitulating: Evaluating
More informationModern Information Retrieval
Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction
More informationLearning Decision Trees
Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?
More informationLecture 3: Decision Trees
Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,
More informationLearning Decision Trees
Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy
More informationMachine Learning for NLP: Unsupervised learning techniques Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland
Machine Learning for NLP: Unsupervised learning techniques Saturnino Luz Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland Supervised vs. unsupervised learning So far
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification
More informationData classification (II)
Lecture 4: Data classification (II) Data Mining - Lecture 4 (2016) 1 Outline Decision trees Choice of the splitting attribute ID3 C4.5 Classification rules Covering algorithms Naïve Bayes Classification
More informationIntroduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees
Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical
More informationChapter 6: Classification
Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant
More informationCOMP 328: Machine Learning
COMP 328: Machine Learning Lecture 2: Naive Bayes Classifiers Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Spring 2010 Nevin L. Zhang
More informationAlgorithms for Classification: The Basic Methods
Algorithms for Classification: The Basic Methods Outline Simplicity first: 1R Naïve Bayes 2 Classification Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.
More informationCSCE 478/878 Lecture 6: Bayesian Learning
Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell
More informationDecision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology
Decision trees Special Course in Computer and Information Science II Adam Gyenge Helsinki University of Technology 6.2.2008 Introduction Outline: Definition of decision trees ID3 Pruning methods Bibliography:
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification
More informationGenerative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University
Generative Models CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell, Chapter 6.9-6.10 Duda, Hart & Stork, Pages 20-39 Bayes decision rule Bayes theorem Generative
More informationStephen Scott.
1 / 28 ian ian Optimal (Adapted from Ethem Alpaydin and Tom Mitchell) Naïve Nets sscott@cse.unl.edu 2 / 28 ian Optimal Naïve Nets Might have reasons (domain information) to favor some hypotheses/predictions
More informationClassification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci
Classification: Rule Induction Information Retrieval and Data Mining Prof. Matteo Matteucci What is Rule Induction? The Weather Dataset 3 Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny
More informationDECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]
1 DECISION TREE LEARNING [read Chapter 3] [recommended exercises 3.1, 3.4] Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting Decision Tree 2 Representation: Tree-structured
More informationCSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas
ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Naïve Nets (Adapted from Ethem
More informationClassification Using Decision Trees
Classification Using Decision Trees 1. Introduction Data mining term is mainly used for the specific set of six activities namely Classification, Estimation, Prediction, Affinity grouping or Association
More informationLecture 3: Decision Trees
Lecture 3: Decision Trees Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning Lecture 3: Decision Trees p. Decision
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for
More informationSupervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!
Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"
More informationBayesian Learning Features of Bayesian learning methods:
Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more
More informationDecision Trees. Gavin Brown
Decision Trees Gavin Brown Every Learning Method has Limitations Linear model? KNN? SVM? Explain your decisions Sometimes we need interpretable results from our techniques. How do you explain the above
More informationGenerative Models for Classification
Generative Models for Classification CS4780/5780 Machine Learning Fall 2014 Thorsten Joachims Cornell University Reading: Mitchell, Chapter 6.9-6.10 Duda, Hart & Stork, Pages 20-39 Generative vs. Discriminative
More informationDecision Tree Learning
Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,
More informationLecture 9: Bayesian Learning
Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal
More informationThe Naïve Bayes Classifier. Machine Learning Fall 2017
The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning
More informationClassification II: Decision Trees and SVMs
Classification II: Decision Trees and SVMs Digging into Data: Jordan Boyd-Graber February 25, 2013 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Digging into Data: Jordan Boyd-Graber ()
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationday month year documentname/initials 1
ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi
More informationDecision Trees. Tirgul 5
Decision Trees Tirgul 5 Using Decision Trees It could be difficult to decide which pet is right for you. We ll find a nice algorithm to help us decide what to choose without having to think about it. 2
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationDecision Support. Dr. Johan Hagelbäck.
Decision Support Dr. Johan Hagelbäck johan.hagelback@lnu.se http://aiguy.org Decision Support One of the earliest AI problems was decision support The first solution to this problem was expert systems
More informationIntroduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition
Introduction Decision Tree Learning Practical methods for inductive inference Approximating discrete-valued functions Robust to noisy data and capable of learning disjunctive expression ID3 earch a completely
More informationClassification and Regression Trees
Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity
More informationnaive bayes document classification
naive bayes document classification October 31, 2018 naive bayes document classification 1 / 50 Overview 1 Text classification 2 Naive Bayes 3 NB theory 4 Evaluation of TC naive bayes document classification
More informationML techniques. symbolic techniques different types of representation value attribute representation representation of the first order
MACHINE LEARNING Definition 1: Learning is constructing or modifying representations of what is being experienced [Michalski 1986], p. 10 Definition 2: Learning denotes changes in the system That are adaptive
More informationMachine Learning 2nd Edi7on
Lecture Slides for INTRODUCTION TO Machine Learning 2nd Edi7on CHAPTER 9: Decision Trees ETHEM ALPAYDIN The MIT Press, 2010 Edited and expanded for CS 4641 by Chris Simpkins alpaydin@boun.edu.tr h1p://www.cmpe.boun.edu.tr/~ethem/i2ml2e
More informationData Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan, Steinbach, Kumar Adapted by Qiang Yang (2010) Tan,Steinbach,
More informationEECS 349:Machine Learning Bryan Pardo
EECS 349:Machine Learning Bryan Pardo Topic 2: Decision Trees (Includes content provided by: Russel & Norvig, D. Downie, P. Domingos) 1 General Learning Task There is a set of possible examples Each example
More informationQuestion of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning
Question of the Day Machine Learning 2D1431 How can you make the following equation true by drawing only one straight line? 5 + 5 + 5 = 550 Lecture 4: Decision Tree Learning Outline Decision Tree for PlayTennis
More informationClassification Algorithms
Classification Algorithms UCSB 290N, 2015. T. Yang Slides based on R. Mooney UT Austin 1 Table of Content roblem Definition Rocchio K-nearest neighbor case based Bayesian algorithm Decision trees 2 Given:
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More informationthe tree till a class assignment is reached
Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal
More informationDecision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1
Decision Trees Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, 2018 Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1 Roadmap Classification: machines labeling data for us Last
More informationDATA MINING LECTURE 10
DATA MINING LECTURE 10 Classification Nearest Neighbor Classification Support Vector Machines Logistic Regression Naïve Bayes Classifier Supervised Learning 10 10 Illustrating Classification Task Tid Attrib1
More informationDecision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University
Decision Tree Learning Mitchell, Chapter 3 CptS 570 Machine Learning School of EECS Washington State University Outline Decision tree representation ID3 learning algorithm Entropy and information gain
More informationDecision Trees. Danushka Bollegala
Decision Trees Danushka Bollegala Rule-based Classifiers In rule-based learning, the idea is to learn a rule from train data in the form IF X THEN Y (or a combination of nested conditions) that explains
More informationCLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC
CLASSIFICATION NAIVE BAYES NIKOLA MILIKIĆ nikola.milikic@fon.bg.ac.rs UROŠ KRČADINAC uros@krcadinac.com WHAT IS CLASSIFICATION? A supervised learning task of determining the class of an instance; it is
More informationCS6375: Machine Learning Gautam Kunapuli. Decision Trees
Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s
More informationData Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition
Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationDecision Trees.
. Machine Learning Decision Trees Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de
More informationLecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan
Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof Ganesh Ramakrishnan October 20, 2016 1 / 25 Decision Trees: Cascade of step
More informationRule Generation using Decision Trees
Rule Generation using Decision Trees Dr. Rajni Jain 1. Introduction A DT is a classification scheme which generates a tree and a set of rules, representing the model of different classes, from a given
More informationClassification and Prediction
Classification Classification and Prediction Classification: predict categorical class labels Build a model for a set of classes/concepts Classify loan applications (approve/decline) Prediction: model
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification
More informationDay 6: Classification and Machine Learning
Day 6: Classification and Machine Learning Kenneth Benoit Essex Summer School 2014 July 30, 2013 Today s Road Map The Naive Bayes Classifier The k-nearest Neighbour Classifier Support Vector Machines (SVMs)
More informationText Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)
Text Categorization CSE 454 (Based on slides by Dan Weld, Tom Mitchell, and others) 1 Given: Categorization A description of an instance, x X, where X is the instance language or instance space. A fixed
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationNumerical Learning Algorithms
Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM...............................
More informationDecision Trees.
. Machine Learning Decision Trees Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de
More informationDecision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.
Decision Trees Supervised approach Used for Classification (Categorical values) or regression (continuous values). The learning of decision trees is from class-labeled training tuples. Flowchart like structure.
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering
More informationBayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction
15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive
More informationBayesian Classification. Bayesian Classification: Why?
Bayesian Classification http://css.engineering.uiowa.edu/~comp/ Bayesian Classification: Why? Probabilistic learning: Computation of explicit probabilities for hypothesis, among the most practical approaches
More informationTutorial 6. By:Aashmeet Kalra
Tutorial 6 By:Aashmeet Kalra AGENDA Candidate Elimination Algorithm Example Demo of Candidate Elimination Algorithm Decision Trees Example Demo of Decision Trees Concept and Concept Learning A Concept
More informationOutline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997
Outline Training Examples for EnjoySport Learning from examples General-to-specific ordering over hypotheses [read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] Version spaces and candidate elimination
More informationEnsemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan
Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite
More informationDecision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore
Decision Trees Claude Monet, The Mulberry Tree Slides from Pedro Domingos, CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Michael Guerzhoy
More informationMachine Learning Alternatives to Manual Knowledge Acquisition
Machine Learning Alternatives to Manual Knowledge Acquisition Interactive programs which elicit knowledge from the expert during the course of a conversation at the terminal. Programs which learn by scanning
More informationClassification: The rest of the story
U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher
More informationInduction of Decision Trees
Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationEmpirical Risk Minimization, Model Selection, and Model Assessment
Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,
More informationNatural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley
Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationText classification II CE-324: Modern Information Retrieval Sharif University of Technology
Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationMachine Learning & Data Mining
Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationMachine Learning, Midterm Exam
10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationCS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning
CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we
More informationArtificial Intelligence Roman Barták
Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationTools of AI. Marcin Sydow. Summary. Machine Learning
Machine Learning Outline of this Lecture Motivation for Data Mining and Machine Learning Idea of Machine Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication
More informationApplied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw
Applied Logic Lecture 4 part 2 Bayesian inductive reasoning Marcin Szczuka Institute of Informatics, The University of Warsaw Monographic lecture, Spring semester 2017/2018 Marcin Szczuka (MIMUW) Applied
More informationData Mining Part 4. Prediction
Data Mining Part 4. Prediction 4.3. Fall 2009 Instructor: Dr. Masoud Yaghini Outline Introduction Bayes Theorem Naïve References Introduction Bayesian classifiers A statistical classifiers Introduction
More information