Machine Learning for NLP: Supervised learning techniques. Outline. Saturnino Luz. ESSLLI 07 Dublin Ireland 1-1. Notes. Notes

Size: px
Start display at page:

Download "Machine Learning for NLP: Supervised learning techniques. Outline. Saturnino Luz. ESSLLI 07 Dublin Ireland 1-1. Notes. Notes"

Transcription

1 Machine Learning for NLP: Supervised learning techniques Saturnino Luz Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland 1-1 Outline Introduction to supervised learning main concepts, notation some supervised methods NLP applications (case studies): Text categorisation (in detail) Data representation Target and categorisation functions Algorithms (probabilistic, symbolic) Evaluation Word-sense disambiguation (briefly) Saturnino Luz: ESSLLI 07 Dublin Ireland

2 The uses of supervised learning Supervised learning is possibly the type of machine learning method most widely used in natural language processing applications. A supervised learner has access to a teacher which describes the function to be learnt over a number of training examples in practice, an annotated data set (corpus, etc) the inductive process will build an approximation ˆf of a target function f Saturnino Luz: ESSLLI 07 Dublin Ireland Classification tasks Supervised learning methods are usually employed in learning of classification tasks. Some notation: D = {d 1,..., d D }: a set of data instances. C = {c 1,..., c C }: a set of categories with respect to which instances will be classified. In its most general form, a target function will have signature f : D 2 C Classes are usually defined through manual annotation, or labelling Saturnino Luz: ESSLLI 07 Dublin Ireland

3 Representing the classification function Multi-label classification (general form of f, above): an instance may have any number of categories (e.g. classification of news articles). In practice, on often uses (category-specific) single-label classifiers of the form ˆf c : D C {0,1}, s.t.: 8 < 1 if d belongs to class c ˆf(d, c) = (1) : 0 otherwise multi-label classification as a composition of single-label classifiers: make ˆf(d) = {c ˆf(d, c) = 1} (assuming c s are stochastically independent). Controlled vocabulary keyword assignment, document classification into web hierarchies are examples of MLC, whereas ambiguity resolution in NLP is an example of SLC. spam filtering is an example of the binary case of SLC Saturnino Luz: ESSLLI 07 Dublin Ireland Data representation As we have seen in lecture 1, the data presented to the learning algorithm needs to be uniformly represented We assume that instances will be represented as feature vectors d i = t 1,..., t n whose values t 1,..., t n T will vary depending on the data representation scheme chosen. E.g.: our training instances for last lecture s draughts player are represented by 5-dimensional integer-valued feature vectors: bp(b), rp(b), bk(b)..., Saturnino Luz: ESSLLI 07 Dublin Ireland

4 Performance measure Classification performance can be evaluated in different ways, depending on the domain. Commonly used measures are accuracy and error. For most NLP applications it usually make more sense to measure performance in terms of precision, recall, and combinations of these two measures such as F scores and break-even points Saturnino Luz: ESSLLI 07 Dublin Ireland Algorithms: Hard vs. Soft classification A classification function ˆf : D C {0,1} performs what we call hard classification Another possibility is to allow ˆf to range over real values in the interval [0,1] (i.e. ˆf(d, c) = v, where 0 v 1) This soft classification approach is equivalent to ranking the classes in C according to their appropriateness to each instance d. Soft classification can be turned into hard classification via thresholding Saturnino Luz: ESSLLI 07 Dublin Ireland

5 Classifier induction and training experience Inducing a classification function ˆf through supervised learning involves a train-and-test strategy. data set D is split into: training set, D t, test set, D e, and (sometimes) a validation set, D v, is reserved for parameter tuning. The strategy: induced ˆf from D t tune parameters on D v test by D e and comparing ˆf to f on D e Saturnino Luz: ESSLLI 07 Dublin Ireland Data sparsity and cross validation Important: No data used in training (D t ) should be used for testing (D e ). Evaluation must show that ˆf generalises to unseen data (i.e. that overfitting has been avoided) In many areas, NLP in particular, data sparsity is a problem (k-fold) cross validation is often used to deal with it: build k classifiers ˆf 1,..., ˆf k over k partitions D 1,..., D k of D: Train-and-test procedure is applied iteratively on pairs D i t, Di e of training and test partitions, where D i t = D \ D i and D i e = D i for 1 i k Saturnino Luz: ESSLLI 07 Dublin Ireland

6 Algorithms: inference methods Symbolic, numeric and meta-classification methods. Numeric methods implement classification indirectly: the classification function ˆf outputs a numerical score, hard classification via thresholding e.g.: probabilistic classifiers, regression methods,... Symbolic methods usually implement hard classification directly e.g.: decision trees, decision rules,... Meta-classification methods combine results from independent classifiers e.g.: classifier ensembles, committees, Saturnino Luz: ESSLLI 07 Dublin Ireland Probabilistic classifiers Probabilistic classifiers output an estimation of the conditional probability P(c d) = ˆf(d, c) that an instance represented as d should be classified as c. Elements d as random variables T i (1 i T ) Need to estimate probabilities for all possible representations i.e. P(c T i,..., T n ). Too costly in practice: for discrete case and n possible nominal values that is O(n T ) Independence assumptions help Saturnino Luz: ESSLLI 07 Dublin Ireland

7 Conditional independence assumption Using Bayes rule we get P(c d j ) = P(c)P( d j c) P( d j ) Naïve Bayes classifiers: assume T i,..., T n are independent of each other given the target category: T (2) P( d Y c) = P(t k c) (3) k=1 maximum a posteriori hypothesis: choose c that maximises (52) maximum likelihood hypothesis: choose that maximises P( d j c) Saturnino Luz: ESSLLI 07 Dublin Ireland Variants of Naive Bayes classifiers multi-variate Bernoulli models, in which features are modelled as Boolean random variables, and multinomial models where the variables represent count data [McCallum and Nigam, 1998] Numeric data representation: attributes represented by continuous probability distributions using Gaussian distributions, the conditionals can be estimated as P(T i = t c) = 1 σ (t µ) 2 2π e 2σ 2 (4) Non-parametric kernel density estimation has also been proposed [John and Langley, 1995] Saturnino Luz: ESSLLI 07 Dublin Ireland

8 Uses of NB in NLP Information retrieval [Robertson and Jones, 1988] Text categorisation (see [Sebastiani, 2002] for a survey) Spam filters Word sense disambiguation [Gale et al., 1992] Saturnino Luz: ESSLLI 07 Dublin Ireland A symbolic method: decision trees Symbolic methods offer the advantage that their classification decisions are easily interpretable. Decision trees: data represented as vectors of discrete-value (or discretised) attributes classification through binary tests on highly discriminative features. test sequence encoded as a tree structure outlook sunny overcast rainy humidity yes windy high normal true false no yes no yes Saturnino Luz: ESSLLI 07 Dublin Ireland

9 A Sample data set (Mitchell, 97) outlook temperature humidity windy play 1 sunny hot high false no 2 sunny hot high true no 3 overcast hot high false yes 4 rainy mild high false yes 5 rainy cool normal false yes 6 rainy cool normal true no 7 overcast cool normal true yes 8 sunny mild high false no 9 sunny cool normal false yes 10 rainy mild normal false yes 11 sunny mild normal true yes 12 overcast mild high true yes 13 overcast hot normal false yes 14 rainy mild high true no Saturnino Luz: ESSLLI 07 Dublin Ireland Divide-and-conquer learning strategy Choose the features that better divide the instance space. E.g. distribution of attributes for the tennis weather task: outlook temperature humidity windy 8 play=yes play=no overcast rainy sunny 0 cool hot mild 0 high normal 0 false true Saturnino Luz: ESSLLI 07 Dublin Ireland

10 Uses of decision trees in NLP Parsing [Haruno et al., 1999, Magerman, 1995] Text categorisation [Lewis and Ringuette, 1994] word-sense disambiguation, POS tagging, Saturnino Luz: ESSLLI 07 Dublin Ireland Instance-based methods The majority of instance-based learners are lazy learners The importance of being lazy : instead of estimating the target function once for the whole instance space, estimate it locally and differently for each new instance A family of related techniques: k-nearest Neighbour Locally weighted regression Radial basis functions Case-based reasoning Saturnino Luz: ESSLLI 07 Dublin Ireland

11 k-nearest neighbours Key idea: just store all training examples x i, f(x i ) Nearest neighbour: Given query instance x q, first locate nearest training example x n, then estimate ˆf(x q ) f(x n ) k-nearest neighbour for: classification (discrete-valued target function): Given x q, take vote among its k nearest neighbours, or regression (real-valued f): take mean of f values of k nearest neighbours P k i=1 ˆf(x q ) f(x i) k Saturnino Luz: ESSLLI 07 Dublin Ireland When To Consider Using Nearest Neighbour Advantages: Training is very fast Learn complex target functions Don t lose information Disadvantages: Slow at query time Easily fooled by irrelevant attributes Saturnino Luz: ESSLLI 07 Dublin Ireland

12 Uses of instance-based methods in NLP POS tagging [Daelemans et al., 1996] text categorisation, where knn ranks among the best performing techniques [Yang and Chute, 1994, Sebastiani, 2002]. Speech synthesis [Daelemans and van den Bosch, 1996] probability estimation for statistical parsing [Hogan, 2007], Information extraction [Zavrel and Daelemans, 2003] Saturnino Luz: ESSLLI 07 Dublin Ireland Support Vector Machines SVMs have become a very popular ML technique in recent years due to their scalability to high dimensionality and robustness to overfitting SVM can be explained in geometrical terms as follows: Decision Surfaces: planes σ 1,..., σ n in a T -dimensional space which separates positive and negative training examples Given σ 1, σ 2,..., find a σ i which separates positive from negative examples by the widest possible margin Saturnino Luz: ESSLLI 07 Dublin Ireland

13 An example: 2-d case... Assume positive and negative instances are linearly separable (decision surfaces are ( T 1)-hyperplanes ): o o o o "best" decision surface o o o o o o o o o σ i Saturnino Luz: ESSLLI 07 Dublin Ireland Non linearly-separable data use kernel function to project original feature space onto higher-dimension space. E.g. [Russell and Norvig, 2003]: (a) true decision boundary, circle x x (b) mapping to three-dimensional input space (x 2 1, x2 2, 2x 1 x 2 ) Saturnino Luz: ESSLLI 07 Dublin Ireland

14 Uses of SVM in NLP text categorisation [Joachims, 1998], a task for which the method s ability to handle large feature sets seems to be particularly useful. tagging, parsing [Collins and Duffy, 2002] Saturnino Luz: ESSLLI 07 Dublin Ireland Meta-classifiers: ensembles Basic idea: to apply different classifiers ˆf 1, ˆf 2,... to the classification task and then combine the outputs appropriately Ensembles are characterised according to: the kinds of classifiers ˆf i they employ: ideally, these classifiers should be as independent as possible the way they combine multiple classifier outputs. Examples: majority voting (for committees of binary classifiers), weighted linear combination (for probabilistic outputs), dynamic classifier selection, adaptive classifier selection, Saturnino Luz: ESSLLI 07 Dublin Ireland

15 Meta-classifiers: Boosting The basic idea: all classifiers in the ensemble are obtained via the same learning method Classifiers are trained sequentially, rather than independently (i.e. the training of ˆf i takes into account the performance of ˆf 1,..., ˆf i 1 ) ADABOOST: each pair < d j, c i > is assigned importance weight h t ij in ˆf t, which represents how hard it is to get a correct decision for < d j, c i > in ˆf 1,..., ˆf t Saturnino Luz: ESSLLI 07 Dublin Ireland Uses of meta-classification methods in NLP parsing [Collins and Koo, 2005], word-sense disambiguation [Pedersen, 2000, Escudero et al., 2000], text categorisation [Sebastiani, 2002], Saturnino Luz: ESSLLI 07 Dublin Ireland

16 Case study: text categorisation Text categorisation is a task which consists of assigning pre-defined symbolic labels (categories) to natural language texts (see Lecture 1). Two approaches: category-pivoted categorisation: given a category c, the classifier must find all text documents d that should be filed under c. document-pivoted categorisation: given a text document d, the classifier must find all categories c under which d should be filed Saturnino Luz: ESSLLI 07 Dublin Ireland Text categorisation data REUTERS-21578: a commonly used corpus <REUTERS TOPICS= YES LEWISSPLIT= TRAIN ID= 96 > <DATE>26 FEB :32:37.30</DATE> <TOPICS><D>acq</D></TOPICS> <PLACES><D>usa</D></PLACES> <PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES> <TEXT> <TITLE> INVESTMENT FIRMS CUT CYCLOPS <CYL> STAKE </TITLE> <DATELINE> WASHINGTON, Feb 26 </DATELINE> <BODY>A group of affiliated New York investment firms said they lowered their stake in Cyclops Corp to 260,500 shares, or 6.4 pct of the total outstanding common stock, from 370,500 shares, or 9.2 pct. In a filing with the Securities and Exchange Commission, the group, led by Mutual Shares Corp, said it sold 110,000 Cyclops common shares on Feb 17 and 19 for 10.0 mln dlrs. Reuter </BODY> </TEXT> </REUTERS> Saturnino Luz: ESSLLI 07 Dublin Ireland

17 TC and supervised learning: preliminaries assume a document-pivoted categorisation perspective. D: set of textual documents, d i. a corpus D, and sub-corpora for training (D t ), testing (D e ) and validation (D v ). Document labels correspond to the category set C. In the Reuters corpus, for instance, some of the elements of this set would be acq (labelling documents about company acquisitions), housing, vatican etc Saturnino Luz: ESSLLI 07 Dublin Ireland Labelling constraints Multi-label text categorisation (MLTC): A document may belong to any number of categories. In MLTC, for a given document d in D, one might have subsets of C, C m = {c 1,..., c k } such that k > 1, and ˆf(d, c 1 ) =... = ˆf(d, c k ) = 1 Example: REUTERS news articles If C m = 1 for all d D, we have what is called a single-label classifier (SLTC). Boolean labelling. spam filtering Saturnino Luz: ESSLLI 07 Dublin Ireland

18 Category generality The generality of a category c i in the context of a text classification system, given a corpus Ω is defined as the proportion of documents that belong to c i : g Ω (c i ) = {d j Ω s.t. f(d j, c i ) = 1} Ω Analogously, one can define category generality for training sets: g Tr (c i ) validation sets: g Tv (c i ) and test sets: g Te (c i ) (5) Saturnino Luz: ESSLLI 07 Dublin Ireland Texts as feature vectors The representation strategy most commonly adopted in text categorisation (and information retrieval) consists of: selecting a set of terms T in the corpus (also known as features), encoding all documents d j as vectors: dj = < t ij,..., t T j > t kj represents how much feature k contributes to the semantics of text d j Saturnino Luz: ESSLLI 07 Dublin Ireland

19 Possible implementations Variations on the vector representation: different ways of defining terms (features): compounds, semantic dependencies,... different ways of computing weights: which words, phases etc are the most relevant? how do we quantify relevance? One can eliminate large numbers of candidate features before any statistical processing is done Saturnino Luz: ESSLLI 07 Dublin Ireland Before indexing... Some pre-processing of texts may help reduce dimensionality even before any indexing is done or weights computed. One often removes stop words as the first step of text processing. These include topic-neutral types and function words (prepositions, conjunctions, articles) Stemming is also frequently employed. It involves clustering together types that share the same morphological root. For example: { cluster, clustering, clustered,... } Saturnino Luz: ESSLLI 07 Dublin Ireland

20 Different ways of defining features Words or phrases? Should we represent Saddam Hussein as a single feature or as two distinct features? How about White House, and Text categorisation? Richer representation schemes Darmstadt Indexing Approach: position of terms in d, properties of terms, category generality, etc. How about semantics? Saturnino Luz: ESSLLI 07 Dublin Ireland Different ways of assigning values to features Alternative ways of assigning values to document vectors also affect classification. #(t k, d j ) = number of occurrences of t k in d j # Tr (t k ) = number of documents in Tr in which t k occurs Three approaches: sets of words: where binary values indicate presence or absence of the feature in the document bags (or multi-sets) of words: where quantify number of occurrences of a feature relative frequency: term-frequency scores or probabilistic weights Computing weights by relative frequency: Tr tfidf(t k, d j ) = #(t k, d j )log # Tr (t k ) (6) Saturnino Luz: ESSLLI 07 Dublin Ireland

21 Dimensionality Reduction DR: a processing step whose goal is to reduce the size of the vector space from T to T T. T is called Reduced Term Set Benefits of DR: Lower computational cost for ML Avoidance of overfitting (training on constitutive features, rather than contingent ones Rule-of-thumb: overfitting is avoided if the number of training examples is proportional to the size of T (For TC, experiments have suggested a ratio of texts per feature) Saturnino Luz: ESSLLI 07 Dublin Ireland DR by feature selection vs. DR by feature extraction DR by term selection or Term Space Reduction (TSR): T is a subset of T. Order T by a score r i (t k ) that quantifies the relevance of t j as an indicator of c i and choose the top terms. DR by term extraction: the terms in T are not of the same type of the terms in T, but are obtained by combinations or transformations of the original ones. E.g: in DR by term extraction, if the terms in T are words, the terms in T may not be words at all Saturnino Luz: ESSLLI 07 Dublin Ireland

22 Term Space Reduction There are two ways to reduce the term space: TSR by term wrapping: the ML algorithm itself is used to reduce term space dimensionality TSR by term filtering: terms are ranked according to their importance (r i (t k )) for the TC task and the highest-scoring ones are chosen Performance is measured in terms of aggressiveness: the ratio between original and reduced feature set: A detailed comparison of TSR techniques can be found in [Yang and Pedersen, 1997] T T Saturnino Luz: ESSLLI 07 Dublin Ireland Local vs. Global DR DR can be done for each category or for the whole set of categories: Local DR: for for each category c i,a set T i of terms ( T i T ) is chosen for classification w.r.t. c i. Different term sets are used for different categories. global DR:,a set T of terms ( T T ) is chosen for all categories C = {c 1,..., c C } local scores can be combined into global ranking through sum: r sum (t k ) = P C i=1 r i(t k ), weighted average: r wavg (t k ) = P C i=1 P(c i)r i (t k ), or maximisation: r max (t k ) = max C i=1 r i(t k ) Saturnino Luz: ESSLLI 07 Dublin Ireland

23 Filtering by document frequency The simplest TSR technique: 1. Remove stop-words, etc, (see slide on pre-processing steps) 2. Order all features t k in T according to the number of documents in which they occur (# Tr (t k )) 3. Choose T = {t 1,..., t n } s.t. it contains the n highest scoring t k Advantages: Low computational cost DR up to a factor of 10 with just a small reduction in effectiveness (as reported in [Yang and Pedersen, 1997]) Saturnino Luz: ESSLLI 07 Dublin Ireland Information theoretic TSR Assume a multi-variate Bernoulli model where feature will have value 0 or 1 depending on whether they occur in a document. (sample space: Ω = 2 D ). T and C are Boolean random variables for events (sets) generated by choices of terms and categories. Notation: P(T = 1 C = 1), abbreviated as P(t c), is the probability that term t occurs in a document classified under category c, and similarly for P(T = 0, C = 1), abbreviated as P( t, c), P(t c) for the probability that t occurs in a document of category c, etc Saturnino Luz: ESSLLI 07 Dublin Ireland

24 Some TSR ranking functions Functions commonly used in feature selection for text categorisation [Sebastiani, 2002]. Name Document frequency DIA factor Expected mutual information Mutual information I(t, c) = X Definition #(t, c) = P(t c) z(t, c) = P(c t) X c {c, c} t {t, t} P(t, c ) log P(t, c ) P(t, c) MI(t, c) = P(t, c) log P(t)P(c) P(t )P(c ) Chi-square χ 2 (t, c) = D t [P(t, c)p( t, c) P(t, c)p( t, c)] 2 NLG coefficient NLG(t, c) = P(t)P( t)p(c)p( c) p Dt [P(t, c)p( t, c) P(t, c)p( t, c)] p P(t)P( t)p(c)p( c) Odds ratio OR(t, c) = P(t c)[1 P(t c)] [1 P(t c)]p(t c) GSS coefficient GSS(t, c) = P(t, c)p( t, c) P(t, c)p( t, c) Saturnino Luz: ESSLLI 07 Dublin Ireland TSR by Expected Mutual Information 1 f s I ( T, c, a ) : T s 2 var : C l, T t, T l : l i s t 3 for each d D 4 i f f(d, c) = true do append(d, C l ) 5 for each t d 6 put( t, d, T t ) 7 put( t, 0, T l ) 8 P(c) = C l D 9 for each t in T t 10 D tc {d d C l t, d T t }, D t c {d d C l t, d T t } 11 D t c {d d C l t, d T t }, D tc {d d C l t, d T t } 12 P(t, c) D tc, P( t, c) D t c D D 13 P(t, c) D t c, P( t, c) D tc D D 14 P(t) D tc + D t c D 15 remove( t, 0, T l ) 16 i P(t, c)log P(t,c) P(t)P(c) + P( t, c) log P( t,c) P( t)p(c) + 17 P( t, c) log P( t, c) P(t, c) + P(t, c) log P( t)p( c) P(t)P( c) 18 add( t, i, T l ) 19 sort T l by expected mutual information scores 20 return f i r s t T l a elements of T l The algorithm above is simply meant to illustrate the how the estimation of probabilities work in very general terms. A practical implementation would not involve as many counting operations and would need to take into account the need to avoid zero probabilities for cases where terms do not co-occur with categories. (More about the latter in slide 54). With respect to counting, for each term-category pair, it would suffice to estimate P(c), P(t) and a single joint or conditional, say P(t c) = D tc C l, (or P(t c) = D tc +1, using a Laplace estimator) C l + T l and derive the remaining values from it through straightforward applications of the properties of conditional probabiliites: P(t, c) = P(t c)p(c) (11) P( t, c) = (1 P(t c))p(c) (12) P(t, c) = P(t) P(t, c) (13) P( t, c) = (1 P(t))P( t, c) (14) Saturnino Luz: ESSLLI 07 Dublin Ireland

25 Sample ranking Top-ranked words for REUTERS category acq according to expected mutual information stake = merger = acquisition = vs = shares = buy = acquire = qtr = cts = usair = shareholders = buys = Saturnino Luz: ESSLLI 07 Dublin Ireland Naive Bayes text categorisation Categorisation status value (CSV) function: a soft classification function for each category c i C: ˆf i : D R Hard classifier status value, ˆf i h : D {0,1}, can then be defined as follows: 8 < ˆf i h 1 if ˆf i (d) τ i, (d) = (15) : 0 otherwise. Thresholds can be determined analytically or experimentally Saturnino Luz: ESSLLI 07 Dublin Ireland

26 Experimental thresholds CSV thresholding or SCut: SCut stands for optimal thresholding on the confidence scores of category candidates: Vary τ i on D v and choose the one that maximises effectiveness Proportional thresholding: choose τ i s.t. that generality measure g Tr (c i ) is closest to g Tv (c i ). RCut or fixed thresholding: stipulate that a fixed number of categories are to be assigned to each document. See [Yang, 2001] for a recent survey of thresholding strategies Saturnino Luz: ESSLLI 07 Dublin Ireland CSV for multi-variate Bernoulli models Starting from the independence assumption and Bayes rule T P( d Y c) = P(t k c) k=1 P(c d j ) = P(c)P( d j c) P( d j ) derive a monot. increasing function of P(c d ): ˆf(d, c) = T X i=1 t i log P(t i c)[1 P(t i c)] P(t i c)[1 P(t i c)] (16) Need to estimate 2 T, rather than 2 T parameters Saturnino Luz: ESSLLI 07 Dublin Ireland

27 Alternative: multinomial model An alternative implementation of the Naïve Bayes Classifier is described in [Mitchell, 1997]. In this approach, words appear as values rather than names of attributes A document representation for this slide would look like this: d = a1 = an, a 2 = alternative, a 3 = implementation,... Problem: each attribute s value would range over the entire vocabulary. Many values would be missing for a typical document Saturnino Luz: ESSLLI 07 Dublin Ireland Dealing with missing values what if none of the training instances with target value v j have attribute value a i? Then ˆP(a i v j ) = 0, and... ˆP(v j ) Y i ˆP(a i v j ) = 0 Typical solution is Bayesian estimate for ˆP(a i v j ) ˆP(a i v j ) n c + mp n + m where n is number of training examples for which v = v j, n c number of examples for which v = v j and a = a i p is prior estimate for ˆP(a i v j ) m is weight given to prior (i.e. number of virtual examples) Saturnino Luz: ESSLLI 07 Dublin Ireland

28 Learning in multinomial models 1 NB_Learn(D t, C) Algorithm 1: NB Probability estimation 2 /* collect all tokens that occur in D t */ 3 T all distinct words and other tokens in D t 4 /* calculate P(c j ) and P(t k c j ) */ 5 for each target value c j in C do 6 D j t subset of D t for which target value is c j 7 P(c j ) Dj t D t 8 Text j concatenation of all texts in D j t 9 n total number of tokens in Text j 10 for each word t k in C do 11 n k number of times word t k occurs in Text j 12 P(t k c j ) n k+1 n+ T 13 done 14 done Saturnino Luz: ESSLLI 07 Dublin Ireland Sample Classification Algorithm Could calculate posterior probabilities for soft classification ny ˆf(d) = P(c) P(t k c) k=1 and use thresholding as before Or, for SLTC, implement hard categorisation directly: Algorithm 2: MNB Categorisation Status Function 1 CSV_MNB(d : D) 2 positions all word positions in d 3 that contain tokens found in T 4 Return c nb, where 5 c nb = arg max ci C P(c i ) Q k positions P(t k c i ) Saturnino Luz: ESSLLI 07 Dublin Ireland

29 Naive but subtle Conditional independence assumption is clearly false P(a 1, a 2... a n v j ) = Y i...but NB works well anyway. Why? P(a i v j ) posteriors ˆP(v j x) don t need to be correct; We need only that: arg max v j V ˆP(v j ) Y i ˆP(a i v j ) = arg max v j V P(v j)p(a 1..., a n v j ) Naive Bayes posteriors are often unrealistically close to 1 or 0 see [Domingos and Pazzani, 1996] for analysis Saturnino Luz: ESSLLI 07 Dublin Ireland Other Probabilistic Classifiers One could also represent data as real-valued vectors (e.g. normalised TFxIDF) and assume an underlying normal distribution to estimate the probabilities. Alternative approaches to probabilistic classifiers attempt to improve effectiveness by: adopting weighted document vectors, rather than binary-valued ones introducing document length normalisation, in order to correct distortions in CSV i introduced by long documents relaxing the independence assumption (the least adopted variant, since it appears that the binary independence assumption seldom affects effectiveness) Saturnino Luz: ESSLLI 07 Dublin Ireland

30 Decision tree classifiers and learners A decision tree is a tree with: internal nodes labelled by terms edges labelled by tests (on the weight the term from which they depart has in the document) leaves labelled by categories Given a decision tree, categorisation of a document d j is done by recursively testing the values of the internal nodes of the tree against those in d j until a leaf is reached. Simplest case: d j consists of binary values Saturnino Luz: ESSLLI 07 Dublin Ireland Example: a (binary) Decision Tree wheat wheat ( ) farm farm commodity commodity bushels bushels agriculture agriculture WHEAT export export WHEAT WHEAT WHEAT tonnes tonnes WHEAT WHEAT WHEAT winter winter if (wheat farm) or WHEAT soft soft (wheat commodity) or WHEAT WHEAT (bushels export) or (wheat tonnes) or (wheat agriculture) or (wheat winter soft) then WHEAT else ( WHEAT) 60 Saturnino Luz: ESSLLI 07 Dublin Ireland 60-1

31 A learning algorithm Algorithm 3: Decision tree learning 1 DTreeLearn(D t ): 2 D, T : 2 T, default: C): tree 2 if isempty(d t ) then 3 return default 4 else if Q d i D t f(d i, c j ) = 1 then /* all d i have class c j */ 5 return c j 6 else if isempty(t ) then 7 return MajorityCateg(D t ) 8 else 9 t best ChooseFeature(T, D t ) 10 tree new dtree with root = t best 11 for each v k t best do 12 Dt k {d l D t t best has value v k in d l } 13 sbt DTreeLearn(Dt k, T \ {t best }, MajorityCateg(D t )) 14 add a branch to tree with label v k and subtree sbt 15 done 16 return tree Saturnino Luz: ESSLLI 07 Dublin Ireland How do we implement ChooseFeature? Finding the right feature to partition the feature set is essential. Entropy (the H(.) function, above), AKA self-information, measures the amount of uncertainty w.r.t a probability distribution. In other words, entropy is a measure of how much we learn when we observe an event occurring in accordance with this distribution. One can use Information Gain (the difference of the entropy of the mother node and the weighted sum of the entropies of the child nodes) yielded by candidate features t k : H(D) = D c D log D c D c D c log D D D (17) where D c ( D c ) is the number of positive (negative) instances filed under category c in D. IG(T, D) = H(D) [ D t D H(D t) + D t D H(D t)] (18) where D t and D t are the subsets of D containing instances for which T has value 1 and 0, respectively Saturnino Luz: ESSLLI 07 Dublin Ireland

32 Important Issues Choosing the right feature (from T ) to partition the training set Choose feature with highest information gain Avoiding overfitting: Memorising all observations from the Tr Extracting patterns, extrapolating to unseen examples in Tv and Te Occam s razor: the most likely hypothesis is the simplest one which is consistent with all observations Pruning: remove over specific branches Saturnino Luz: ESSLLI 07 Dublin Ireland A DT for the earnings category documents P(c n3) = documents P(c n2) = net = documents P(c n4) = documents P(c n1) = 0.3 cts = 2 Decision boundaries cts < 2 cts >= documents P(c n6) = documents P(c n5) = vs = net < 1 net >= 1 vs < 2 vs >= 2 [Manning and Schütze, 1999] documents P(c n7) = Probability that a document at node n4 belongs to category c = "earnings" net n4 cts Saturnino Luz: ESSLLI 07 Dublin Ireland

33 Calculating node probabilities One can assign probabilities to a leaf node (i.e. the probability that a new document d belonging to that node should be filled under category c) as follows (using add-one smoothing): where P(c d n ) = D cn + 1 D cn + D cn (19) P(c d n ) is the probability that a document d n which ended up in node n belongs to category c, D cn ( D cn ) number of (training) documents in node n which have been assigned category c ( c) Saturnino Luz: ESSLLI 07 Dublin Ireland Text Classifier Evaluation Evaluation of TC systems is usually done experimentally rather than analytically Analytical evaluation is difficult due to the subjective nature of the task Experimental evaluation aims at measuring classifier effectiveness, that is, its ability to make correct classification decisions for the largest possible number of documents We have already seen two measures used in experimental evaluation: precision and recall; Today we will characterise these measures more precisely and see other ways of evaluating TC Saturnino Luz: ESSLLI 07 Dublin Ireland

34 Precision and recall Precision (π), with respect to category c i may be defined as the following conditional probability: π = P(f(d x, c i ) = T ˆf(d x, c i ) = T) (20) the probability that if a document d x has been classified as c i this decision is correct Analogously, recall (ρ), may be defined as follows: ρ = P( ˆf(d x, c i ) = T f(d x, c i ) = T) (21) the probability that if a random document is meant to be filed under c i, it d x will be classified as such Saturnino Luz: ESSLLI 07 Dublin Ireland Calculating precision and recall TN i FP i TP i Selected ( ˆf(d x, c i ) = T) π i = FN i All texts Target (f(d x, c i ) = T) TP i TP i +FP i ρ i = TP i TP i +FN i Saturnino Luz: ESSLLI 07 Dublin Ireland

35 Combining local into global measures Local estimates of the probabilities in (20) and (21) may be combined to yield estimates for the classifier as a whole. The contingency table below summarises precision and recall over a category set: Category set Expert judgement C = {c 1, c 2,...} YES NO TC system YES TP = P C i=1 TP i FP = P C i=1 FP i judgement NO FN = P C i=1 FN i TN = P C i=1 TN i Saturnino Luz: ESSLLI 07 Dublin Ireland Effectiveness averaging Two different methods may be used to calculate global values for π and ρ: Microaveraging: π µ = ρ µ = P C i=1 TP i P C i=1 (TP i + FP i ) P C i=1 TP i P C i=1 (TP i + FN i ) (22) (23) Saturnino Luz: ESSLLI 07 Dublin Ireland

36 Macroaveraging Precision macroaveraging is calculated as follows: π M = P C i=1 π i C, where π i and ρ i are local scores. Recall macroaveraging is calculated as follows: ρ M = P C i=1 ρ i C (24) (25) Saturnino Luz: ESSLLI 07 Dublin Ireland An example Category Sport Politics World Judgement: E(xpert) and S(system) E S E S E S Brazil beat Venezuela T F F F F T US defeated Afghanistan F T T T T F Elections in Wicklow F F T T F F Elections in Peru F F F T T T Precision (local): π = 0 π = 0.67 π = 0.5 Recall (local): ρ = 0 ρ = 1 ρ = 0.5 π µ = π M = = = Saturnino Luz: ESSLLI 07 Dublin Ireland

37 Other measures Since one knows (by experimentation) which documents fall into: TP, FP, TN, FN, one may also estimate Accuracy (A) and Error (E): E = A = TP + TN TP + TN + FP + FN (26) FP + FN TP + TN + FP + FN = 1 A (27) These measures, however, are not widely used in TC due to the fact that they are less sensitive to variations in the number of correct decisions than π and ρ Saturnino Luz: ESSLLI 07 Dublin Ireland Fallout and ROC curves A less frequently used measure is fallout: Fallout i = FP i FP i + TN i (28) Fallout measures the proportion of non-targeted items that were mistakenly selected. In certain fields recall-fallout trade-offs are more common than precision-recall ones. The receiver operating characteristic, or ROC curve how different levels of fallout influence recall or sensitivity Saturnino Luz: ESSLLI 07 Dublin Ireland

38 Alternatives to effectiveness Efficiency is often used as an additional criterion in TC evaluation. It may be measured with respect to: training or classification The utility criterion, from decision theory, is sometimes used. An obvious example of application of utility measures is spam filtering, where failing to discard spam is less serious than discarding a legitimate message Saturnino Luz: ESSLLI 07 Dublin Ireland Combining precision and recall Neither π or ρ makes much sense in isolation. Classifiers can be tuned to maximise one at the expense of the other. TC evaluation is done in terms of measures that combine π and ρ. We will examine two such measures: breakeven point and the F functions Saturnino Luz: ESSLLI 07 Dublin Ireland

39 Breakeven point The breakeven point is the value at which π equals ρ, as determined by the following process: A plot of π as a function of ρ is computed by varying the thresholds τ i for the CSV function from 1 to 0 (with the threshold set to 1, only those documents that the classifier is totally sure belong to the category will be selected, so π will tend to 1, and ρ to 0; as we decrease τ i, precision will decrease, but ρ will increase) The breakeven point is the value (of ρ or π) at which the plot intersects the ρ = π line Saturnino Luz: ESSLLI 07 Dublin Ireland The F functions The idea behind F measures [van Rijsbergen, 1979, ch. 7] is to assign a degree of importance to ρ and π. Let β be a factor (0 β ) quantifying such degree of importance. The F β function can be computed via the following: F β = (β2 + 1)πρ (29) β 2 π + ρ A β value of 1 assigns equal importance to precision and recall The breakeven of a classifier is always less than or equal its F β for β = Saturnino Luz: ESSLLI 07 Dublin Ireland

40 Comparison of existing TC systems Corpus Type Systems non-learning WORD probabilistic PropBayes, Bim, Nb decision tree C4.5, Ind From [Sebastiani, 2002]. See also [Yang and Liu, 1999]..884 decision rules Swap-1, Ripper, etc regression LISF online linear BWinnow batch linear Rocchio neural nets Classi example based k-nn, Gis-W SVM SVMLight ensemble AdaBoost Saturnino Luz: ESSLLI 07 Dublin Ireland A final example: WSD Consider the following occurrences of the word bank : RIV y be? Then he ran down along the bank, toward a narrow, muddy path. FIN four bundles of small notes the bank cashier got it into his head RIV ross the bridge and on the other bank you only hear the stream, the RIV beneath the house, where a steep bank of earth is compacted between FIN op but is really the branch of a bank. As I set foot inside, despite FIN raffic police also belong to the bank. More foolhardy than entering FIN require a number. If you open a bank account, the teller identifies RIV circular movement, skirting the bank of the River Jordan, then turn The WSD learning task is to learn to distinguish between meanings financial institution (FIN) and the land alongside a river (RIV) Saturnino Luz: ESSLLI 07 Dublin Ireland

41 Task, Data representation, performance measures WSD can be described as a categorisation task where senses (FIN, RIV) are labels (C) the representation of instances (D) comes from the context surrounding the words to be disambiguated. E.g.: For T = {along, cashier, stream, muddy,... }, we could have: d 1 = along = 1, cashier = 0, stream = 0, muddy = 1,... and f(d 1 ) = RIV Performance can be measured as in text categorisation Saturnino Luz: ESSLLI 07 Dublin Ireland A decision tree Using the algorithm above (slide 61) we get this decision tree RIV on 1 0 RIV river 1 0 RIV FIN when 1 0 money 1 0 from 1 0 Trained on a small training set with T = {small, money, on, to, river, from, in, his, accounts, when, by, other, estuary, some, with} RIV FIN Saturnino Luz: ESSLLI 07 Dublin Ireland

42 Accuracy, k-nn This simple decision tree obtains F 1 scores of for RIV and for FIN. A 3-NN classifier for the same data using numeric vectors with values corresponding to the distance of the various features to the keyword (negative integers for words to the left of the keyword, positive integers for words to the right of the keyword) obtained F 1 scores of 0.88 and 0.91, respectively Saturnino Luz: ESSLLI 07 Dublin Ireland M. Collins and N. Duffy. Convolution kernels for natural language. Advances in Neural Information Processing Systems, 14: , M. Collins and T. Koo. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1): 25 69, W. Daelemans, J. Zavrel, P. Berck, and S. Gillis. MBT: A memory-based part of speech tagger generator. In Proc. of Fourth Workshop on Very Large Corpora, pages 14 27, Walter Daelemans and Antal van den Bosch. Language-independent data-oriented grapheme to phoneme conversion. In J. van Santen, R. Sproat, J. Olive, and J. Hirschberg, editors, Progress in Speech Synthesis. Springer, Pedro Domingos and Michael J. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classifier. In International Conference on Machine Learning, pages , URL citeseer.nj.nec.com/domingos96beyond.html. References Gerard Escudero, Lluís Màrquez, and German Rigau. Boosting applied to word sense disambiguation. In Ramon López De Mántaras and Enric Plaza, editors, Proceedings of ECML-00, 11th European Conference on Machine Learning, pages , Barcelona, Springer Verlag. William Gale, Kenneth Church, and David Yarowsky. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26: , M. Haruno, S. Shirai, and Y. Ooyama. Using decision trees to construct a practical parser. Machine Learning, 34(1): , Deirdre Hogan. Coordinate noun phrase disambiguation in a generative parsing model. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages , Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages , Chemnitz, George H. John and Pat Langley. Estimating continuous distributions in Bayesian classifiers. In Besnard, Philippe and Steve Hanks, editors, Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI 95), pages , San Francisco, CA, USA, August Morgan Kaufmann Publishers. D.D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pages 81 93, Saturnino Luz: ESSLLI 07 Dublin Ireland David M. Magerman. Statistical decision-tree models for parsing. In Meeting of the Association for Computational Linguistics, pages , URL citeseer.ist.psu.edu/magerman95statistical.html. 85 Saturnino Luz: ESSLLI 07 Dublin Ireland

Symbolic methods in TC: Decision Trees

Symbolic methods in TC: Decision Trees Symbolic methods in TC: Decision Trees ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs0/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 01-017 A symbolic

More information

Symbolic methods in TC: Decision Trees

Symbolic methods in TC: Decision Trees Symbolic methods in TC: Decision Trees ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 2016-2017 2

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Dimensionality reduction

Dimensionality reduction Dimensionality reduction ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 2017 Recapitulating: Evaluating

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

More information

Machine Learning for NLP: Unsupervised learning techniques Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland

Machine Learning for NLP: Unsupervised learning techniques Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland Machine Learning for NLP: Unsupervised learning techniques Saturnino Luz Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland Supervised vs. unsupervised learning So far

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification

More information

Data classification (II)

Data classification (II) Lecture 4: Data classification (II) Data Mining - Lecture 4 (2016) 1 Outline Decision trees Choice of the splitting attribute ID3 C4.5 Classification rules Covering algorithms Naïve Bayes Classification

More information

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical

More information

Chapter 6: Classification

Chapter 6: Classification Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

More information

COMP 328: Machine Learning

COMP 328: Machine Learning COMP 328: Machine Learning Lecture 2: Naive Bayes Classifiers Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Spring 2010 Nevin L. Zhang

More information

Algorithms for Classification: The Basic Methods

Algorithms for Classification: The Basic Methods Algorithms for Classification: The Basic Methods Outline Simplicity first: 1R Naïve Bayes 2 Classification Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.

More information

CSCE 478/878 Lecture 6: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell

More information

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology Decision trees Special Course in Computer and Information Science II Adam Gyenge Helsinki University of Technology 6.2.2008 Introduction Outline: Definition of decision trees ID3 Pruning methods Bibliography:

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University Generative Models CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell, Chapter 6.9-6.10 Duda, Hart & Stork, Pages 20-39 Bayes decision rule Bayes theorem Generative

More information

Stephen Scott.

Stephen Scott. 1 / 28 ian ian Optimal (Adapted from Ethem Alpaydin and Tom Mitchell) Naïve Nets sscott@cse.unl.edu 2 / 28 ian Optimal Naïve Nets Might have reasons (domain information) to favor some hypotheses/predictions

More information

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci Classification: Rule Induction Information Retrieval and Data Mining Prof. Matteo Matteucci What is Rule Induction? The Weather Dataset 3 Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny

More information

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4] 1 DECISION TREE LEARNING [read Chapter 3] [recommended exercises 3.1, 3.4] Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting Decision Tree 2 Representation: Tree-structured

More information

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Naïve Nets (Adapted from Ethem

More information

Classification Using Decision Trees

Classification Using Decision Trees Classification Using Decision Trees 1. Introduction Data mining term is mainly used for the specific set of six activities namely Classification, Estimation, Prediction, Affinity grouping or Association

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning Lecture 3: Decision Trees p. Decision

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for

More information

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"

More information

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning Features of Bayesian learning methods: Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

More information

Decision Trees. Gavin Brown

Decision Trees. Gavin Brown Decision Trees Gavin Brown Every Learning Method has Limitations Linear model? KNN? SVM? Explain your decisions Sometimes we need interpretable results from our techniques. How do you explain the above

More information

Generative Models for Classification

Generative Models for Classification Generative Models for Classification CS4780/5780 Machine Learning Fall 2014 Thorsten Joachims Cornell University Reading: Mitchell, Chapter 6.9-6.10 Duda, Hart & Stork, Pages 20-39 Generative vs. Discriminative

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

Classification II: Decision Trees and SVMs

Classification II: Decision Trees and SVMs Classification II: Decision Trees and SVMs Digging into Data: Jordan Boyd-Graber February 25, 2013 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Digging into Data: Jordan Boyd-Graber ()

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Decision Trees. Tirgul 5

Decision Trees. Tirgul 5 Decision Trees Tirgul 5 Using Decision Trees It could be difficult to decide which pet is right for you. We ll find a nice algorithm to help us decide what to choose without having to think about it. 2

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Decision Support. Dr. Johan Hagelbäck.

Decision Support. Dr. Johan Hagelbäck. Decision Support Dr. Johan Hagelbäck johan.hagelback@lnu.se http://aiguy.org Decision Support One of the earliest AI problems was decision support The first solution to this problem was expert systems

More information

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition Introduction Decision Tree Learning Practical methods for inductive inference Approximating discrete-valued functions Robust to noisy data and capable of learning disjunctive expression ID3 earch a completely

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

naive bayes document classification

naive bayes document classification naive bayes document classification October 31, 2018 naive bayes document classification 1 / 50 Overview 1 Text classification 2 Naive Bayes 3 NB theory 4 Evaluation of TC naive bayes document classification

More information

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order MACHINE LEARNING Definition 1: Learning is constructing or modifying representations of what is being experienced [Michalski 1986], p. 10 Definition 2: Learning denotes changes in the system That are adaptive

More information

Machine Learning 2nd Edi7on

Machine Learning 2nd Edi7on Lecture Slides for INTRODUCTION TO Machine Learning 2nd Edi7on CHAPTER 9: Decision Trees ETHEM ALPAYDIN The MIT Press, 2010 Edited and expanded for CS 4641 by Chris Simpkins alpaydin@boun.edu.tr h1p://www.cmpe.boun.edu.tr/~ethem/i2ml2e

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan, Steinbach, Kumar Adapted by Qiang Yang (2010) Tan,Steinbach,

More information

EECS 349:Machine Learning Bryan Pardo

EECS 349:Machine Learning Bryan Pardo EECS 349:Machine Learning Bryan Pardo Topic 2: Decision Trees (Includes content provided by: Russel & Norvig, D. Downie, P. Domingos) 1 General Learning Task There is a set of possible examples Each example

More information

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning Question of the Day Machine Learning 2D1431 How can you make the following equation true by drawing only one straight line? 5 + 5 + 5 = 550 Lecture 4: Decision Tree Learning Outline Decision Tree for PlayTennis

More information

Classification Algorithms

Classification Algorithms Classification Algorithms UCSB 290N, 2015. T. Yang Slides based on R. Mooney UT Austin 1 Table of Content roblem Definition Rocchio K-nearest neighbor case based Bayesian algorithm Decision trees 2 Given:

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1 Decision Trees Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, 2018 Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1 Roadmap Classification: machines labeling data for us Last

More information

DATA MINING LECTURE 10

DATA MINING LECTURE 10 DATA MINING LECTURE 10 Classification Nearest Neighbor Classification Support Vector Machines Logistic Regression Naïve Bayes Classifier Supervised Learning 10 10 Illustrating Classification Task Tid Attrib1

More information

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University Decision Tree Learning Mitchell, Chapter 3 CptS 570 Machine Learning School of EECS Washington State University Outline Decision tree representation ID3 learning algorithm Entropy and information gain

More information

Decision Trees. Danushka Bollegala

Decision Trees. Danushka Bollegala Decision Trees Danushka Bollegala Rule-based Classifiers In rule-based learning, the idea is to learn a rule from train data in the form IF X THEN Y (or a combination of nested conditions) that explains

More information

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC CLASSIFICATION NAIVE BAYES NIKOLA MILIKIĆ nikola.milikic@fon.bg.ac.rs UROŠ KRČADINAC uros@krcadinac.com WHAT IS CLASSIFICATION? A supervised learning task of determining the class of an instance; it is

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Decision Trees.

Decision Trees. . Machine Learning Decision Trees Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof Ganesh Ramakrishnan October 20, 2016 1 / 25 Decision Trees: Cascade of step

More information

Rule Generation using Decision Trees

Rule Generation using Decision Trees Rule Generation using Decision Trees Dr. Rajni Jain 1. Introduction A DT is a classification scheme which generates a tree and a set of rules, representing the model of different classes, from a given

More information

Classification and Prediction

Classification and Prediction Classification Classification and Prediction Classification: predict categorical class labels Build a model for a set of classes/concepts Classify loan applications (approve/decline) Prediction: model

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Day 6: Classification and Machine Learning

Day 6: Classification and Machine Learning Day 6: Classification and Machine Learning Kenneth Benoit Essex Summer School 2014 July 30, 2013 Today s Road Map The Naive Bayes Classifier The k-nearest Neighbour Classifier Support Vector Machines (SVMs)

More information

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others) Text Categorization CSE 454 (Based on slides by Dan Weld, Tom Mitchell, and others) 1 Given: Categorization A description of an instance, x X, where X is the instance language or instance space. A fixed

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Numerical Learning Algorithms

Numerical Learning Algorithms Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM...............................

More information

Decision Trees.

Decision Trees. . Machine Learning Decision Trees Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label. Decision Trees Supervised approach Used for Classification (Categorical values) or regression (continuous values). The learning of decision trees is from class-labeled training tuples. Flowchart like structure.

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction 15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive

More information

Bayesian Classification. Bayesian Classification: Why?

Bayesian Classification. Bayesian Classification: Why? Bayesian Classification http://css.engineering.uiowa.edu/~comp/ Bayesian Classification: Why? Probabilistic learning: Computation of explicit probabilities for hypothesis, among the most practical approaches

More information

Tutorial 6. By:Aashmeet Kalra

Tutorial 6. By:Aashmeet Kalra Tutorial 6 By:Aashmeet Kalra AGENDA Candidate Elimination Algorithm Example Demo of Candidate Elimination Algorithm Decision Trees Example Demo of Decision Trees Concept and Concept Learning A Concept

More information

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997 Outline Training Examples for EnjoySport Learning from examples General-to-specific ordering over hypotheses [read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] Version spaces and candidate elimination

More information

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite

More information

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Decision Trees Claude Monet, The Mulberry Tree Slides from Pedro Domingos, CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Michael Guerzhoy

More information

Machine Learning Alternatives to Manual Knowledge Acquisition

Machine Learning Alternatives to Manual Knowledge Acquisition Machine Learning Alternatives to Manual Knowledge Acquisition Interactive programs which elicit knowledge from the expert during the course of a conversation at the terminal. Programs which learn by scanning

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Machine Learning & Data Mining

Machine Learning & Data Mining Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

Artificial Intelligence Roman Barták

Artificial Intelligence Roman Barták Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Tools of AI. Marcin Sydow. Summary. Machine Learning

Tools of AI. Marcin Sydow. Summary. Machine Learning Machine Learning Outline of this Lecture Motivation for Data Mining and Machine Learning Idea of Machine Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication

More information

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw Applied Logic Lecture 4 part 2 Bayesian inductive reasoning Marcin Szczuka Institute of Informatics, The University of Warsaw Monographic lecture, Spring semester 2017/2018 Marcin Szczuka (MIMUW) Applied

More information

Data Mining Part 4. Prediction

Data Mining Part 4. Prediction Data Mining Part 4. Prediction 4.3. Fall 2009 Instructor: Dr. Masoud Yaghini Outline Introduction Bayes Theorem Naïve References Introduction Bayesian classifiers A statistical classifiers Introduction

More information