Empirical Approaches to Multilingual Lexical Acquisition. Lecturer: Timothy Baldwin

Similar documents
Classification Using Decision Trees

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Rule Generation using Decision Trees

Data classification (II)

The Naïve Bayes Classifier. Machine Learning Fall 2017

Decision Trees. Danushka Bollegala

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Chapter 4.5 Association Rules. CSCI 347, Data Mining

Classification and Regression Trees

The Solution to Assignment 6

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Bayesian Classification. Bayesian Classification: Why?

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Algorithms for Classification: The Basic Methods

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Lecture 9: Bayesian Learning

Hypothesis Evaluation

Mining Classification Knowledge

UVA CS / Introduc8on to Machine Learning and Data Mining

Artificial Intelligence. Topic

Data Mining Part 4. Prediction

Lecture 3: Decision Trees

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Learning Decision Trees

Modern Information Retrieval

UVA CS 4501: Machine Learning

Decision Trees. Tirgul 5

Naive Bayes Classifier. Danushka Bollegala

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Mining Classification Knowledge

Bayesian Learning Features of Bayesian learning methods:

Data Mining. Chapter 1. What s it all about?

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

CS 6375 Machine Learning

Performance Evaluation and Hypothesis Testing

Decision Support. Dr. Johan Hagelbäck.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Bayesian Learning. Bayesian Learning Criteria

Decision Tree Learning - ID3

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

Learning Decision Trees

CSC314 / CSC763 Introduction to Machine Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Typical Supervised Learning Problem Setting

Bias Correction in Classification Tree Construction ICML 2001

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

CSCE 478/878 Lecture 6: Bayesian Learning

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Symbolic methods in TC: Decision Trees

Decision Tree And Random Forest

Model Accuracy Measures

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Evaluation. Andrea Passerini Machine Learning. Evaluation

Tutorial 6. By:Aashmeet Kalra

Numerical Learning Algorithms

Classification and regression trees

Introduction to Supervised Learning. Performance Evaluation

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Stephen Scott.

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

Evaluation requires to define performance measures to be optimized

Probability Based Learning

Machine Learning for natural language processing

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Chapter 6: Classification

Learning Classification Trees. Sargur Srihari

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Modern Information Retrieval

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Stats notes Chapter 5 of Data Mining From Witten and Frank

An Empirical Study of Building Compact Ensembles

Machine Learning Chapter 4. Algorithms

Tools of AI. Marcin Sydow. Summary. Machine Learning

Naïve Bayes Lecture 6: Self-Study -----

Decision Tree Learning and Inductive Inference

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Symbolic methods in TC: Decision Trees

10-701/ Machine Learning: Assignment 1

Decision Tree Learning

Administrative notes. Computational Thinking ct.cs.ubc.ca

CS 6375 Machine Learning

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Machine Learning Alternatives to Manual Knowledge Acquisition

Performance Evaluation and Comparison

Decision Tree Learning

Qualifying Exam in Machine Learning

Classification and Prediction

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Induction on Decision Trees

Introduction to Bayesian Learning. Machine Learning Fall 2018

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Transcription:

Empirical Approaches to Multilingual Lexical Acquisition Lecturer: Timothy Baldwin

Lecture 2 Introduction to Machine Learning 1

Machine Learning (ML) Hypothesis: pre-existing data repositories contain a lot of potentially important information Mission of ML: find it Definition of ML: automatic extraction of valid, novel, useful and comprehensible knowledge (rules, regularities, patterns, constraints, models) from arbitrary sets of data 2

Underlying Motivation We are drowning in data, but starving for knowledge! Data = raw information Knowledge = set of patterns or models behind the data 3

ML Example: the Cool/Cute Classifier According to my 2 y.o. son: Entity Class Entity Class self cute sports car cool self as baby??? tiger cool big brother (4 y.o.) cool Hello Kitty cute big sister (6 y.o.) cute spoon??? Mummy cute water??? What would we predict the class for the following to be: train, koala, book on ML 4

Yeah yeah, but what s in it for me? Scenario 1: You are a supermarket manager, wishing to boost sales without increasing expenditure Solution: Strategically position products to entice consumers to spend more: beer next to chips? beer next to bathroom cleaner? ASSOCIATION RULES 5

Yeah yeah, but what s in it for me? Scenario 1: You are a supermarket manager, wishing to boost sales without increasing expenditure Solution: Strategically position products to entice consumers to spend more beer next to chips? beer next to bathroom cleaner? ASSOCIATION RULES 5

Scenario 2: You are an archaeologist in charge of classifying a mountain of fossilised bones, and want to quickly identify any finds of the century before sending the bones off to a museum Solution: Identify bones which are of different size/dimensions/characteristics to others in the sample and/or pre-identified bones CLUSTERING, OUTLIER DETECTION 6

Scenario 2: You are an archaeologist in charge of classifying a mountain of fossilised bones, and want to quickly identify any finds of the century before sending the bones off to a museum Solution: Identify bones which are of different size/dimensions/characteristics to others in the sample and/or pre-identified bones CLUSTERING, OUTLIER DETECTION 6

Scenario 3: You are an archaeologist in charge of classifying a mountain of fossilised bones, and want to come up with a consistent way of determining the species and type of each bone which doesn t require specialist skills Solution: Identify some easily measurable properties of bones (size, shape, number of lumps,...) and compare any new bones to a pre-classified DB of bones SUPERVISED CLASSIFICATION 7

Scenario 3: You are an archaeologist in charge of classifying a mountain of fossilised bones, and want to come up with a consistent way of determining the species and type of each bone which doesn t require specialist skills Solution: Identify some easily measurable properties of bones (size, shape, number of lumps,...) and compare any new bones to a pre-classified DB of bones SUPERVISED CLASSIFICATION 7

Scenario 4: You are in charge of developing the next release of Coca Cola, and want to be able to estimate how well received a given recipe will be Solution: Carry out tast tests over various recipes with varying proportions of sugar, caramel, caffeine, phosphoric acid, coca leaf extract,... (and any number of secret new ingredients), and estimate the function which predicts customer satisfaction from these numbers REGRESSION 8

Scenario 4: You are in charge of developing the next release of Coca Cola, and want to be able to estimate how well received a given recipe will be Solution: Carry out tast tests over various recipes with varying proportions of sugar, caramel, caffeine, phosphoric acid, coca leaf extract,... (and any number of secret new ingredients), and estimate the function which predicts customer satisfaction from these numbers REGRESSION 8

Machine Learning vs. Data Mining Machine learning tends to: be more concerned with theory than applications largely ignore questions of run time/scalability Data mining tends to: be more concerned with (business) applications than theory talk a lot about databases and run time/scalability Fuzzy dividing line between the two Google Munich is hiring machine learning and not data mining experts (according to Google sponsored links ca. 15/1/2008) 9

We will refer to everything as machine learning for the purposes of this course In doing so, we will tend to shy away from many of the high-end scalability issues 10

Lexical Acquisition: Machine Learning or Data Mining? Is lexical acquisition more related to machine learning or data mining? As good empiricists, we answer question via Google counts (carried out on 15/1/2008): lexical acquisition machine learning 6,460 1,710,000 data mining 3,920 3,540,000 10,900 9,630,000,000 11

Pointwise mutual information is an information-theoretic measure to determine the relative association between two events: I(A; B) = log P (A,B) P (A)P (B) NB: the pointwise mutual information between two independent events is 0 I(lexical acquisition; machine learning) 11.7 I(lexical acquisition; data mining) 9.9 Conclusion: lexical acquisition has more of a machine learning than data mining orientation 12

DATA REPRESENTATION 13

Terminology The input to a machine learning system consists of: Instances: the individual, independent examples of a concept also known as exemplars Attributes: measuring aspects of an instance also known as features Concepts: things that we aim to learn Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 14

Example: weather.nominal dataset Outlook Temperature Humidity Windy Play sunny sunny overcast rainy rainy rainy hot hot hot mild cool cool high high high high normal normal FALSE TRUE FALSE FALSE FALSE TRUE no no yes yes yes no............... Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 15

Example: weather.nominal dataset Outlook Temperature Humidity Windy Play INSTANCE 1 INSTANCE 2 sunny sunny overcast rainy rainy rainy FALSE TRUE FALSE FALSE FALSE TRUE no no yes yes yes no............... Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 15

Example: weather.nominal dataset ATTRIBUTE... 1 Outlook Temperature Humidity Windy sunny sunny overcast rainy rainy rainy ATTRIBUTE... 2... FALSE TRUE FALSE FALSE FALSE TRUE... Play no no yes yes yes no... Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 15

Styles of learning concepts: What s a Concept? Association learning: detecting associations between features Clustering: grouping similar instances into clusters Classification learning: predicting a discrete class Regression: predicting a numeric quantity Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 16

Association learning Detect frequent patterns, associations, correlations, or causal structures among sets of items or objects in dataset Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database Any kind of structure is considered interesting, and no a priori sense of what we hope to predict Potential for massive number of association rules Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 17

Top-10 Association Rules for weather.nominal # java weka.associations.apriori -t data/weather.nominal.arff 1. humidity=normal windy=false ==> play=yes 2. temperature=cool ==> humidity=normal 3. outlook=overcast ==> play=yes 4. temperature=cool play=yes ==> humidity=normal 5. outlook=rainy windy=false ==> play=yes 6. outlook=rainy play=yes ==> windy=false 7. outlook=sunny humidity=high ==> play=no 8. outlook=sunny play=no ==> humidity=high 9. temperature=cool windy=false ==> humidity=normal play=yes 10. temperature=cool humidity=normal windy=false ==> play=yes Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 18

Full weather.nominal Dataset Outlook Temperature Humidity Windy Play sunny hot high FALSE no sunny hot high TRUE no overcast hot high FALSE yes rainy mild high FALSE yes rainy cool normal FALSE yes rainy cool normal TRUE no overcast cool normal TRUE yes sunny mild high FALSE no sunny cool normal FALSE yes rainy mild normal FALSE yes sunny mild normal TRUE yes overcast mild high TRUE yes overcast hot normal FALSE yes rainy mild high TRUE no Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 19

Clustering Finding groups of items that are similar Clustering is unsupervised The class of an example is not known Success often measured subjectively Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 20

Clustering over weather.nominal Outlook Temperature Humidity Windy Play sunny sunny overcast rainy rainy rainy hot hot hot mild cool cool high high high high normal normal FALSE TRUE FALSE FALSE FALSE TRUE no no yes yes yes no............... Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 21

Example Clusters for weather.nominal Outlook Temperature Humidity Windy Cluster Play sunny hot high FALSE 0 no sunny hot high TRUE 0 no overcast hot high FALSE 0 yes rainy mild high FALSE 1 yes rainy cool normal FALSE 1 yes rainy cool normal TRUE 1 no overcast cool normal TRUE 1 yes sunny mild high FALSE 0 no sunny cool normal FALSE 1 yes rainy mild normal FALSE 1 yes sunny mild normal TRUE 1 yes overcast mild high TRUE 1 yes overcast hot normal FALSE 0 yes rainy mild high TRUE 1 no Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 22

Classification learning Scheme is provided with actual outcome or class The learning algorithm is provided with a set of classified training data Measure success on fresh data for which class labels are known (test data) Classification learning is supervised Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 23

Classifier Training data Test instance? Learner A A B Classification Test data B Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 24

Example Predictions for weather.nominal Outlook Temperature Humidity Windy Actual Classified sunny hot high FALSE no sunny hot high TRUE no overcast hot high FALSE yes rainy mild high FALSE yes rainy cool normal FALSE yes rainy cool normal TRUE no overcast cool normal TRUE yes sunny mild high FALSE no sunny cool normal FALSE yes rainy mild normal FALSE yes sunny mild normal TRUE (yes) no overcast mild high TRUE (yes) yes overcast hot normal FALSE (yes) yes rainy mild high TRUE (no) yes Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 25

A Word on Supervision Supervised methods have prior knowledge of a closed set of classes and set out to discover and categorise new instances according to those classes Unsupervised methods dynamically discover the classes in the process of categorising the instances [STRONG DEFINITION] OR Unsupervised methods categorise instances without the aid of pre-classified data [WEAK DEFINITION] Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 26

Regression Classification learning, but class is (continuous) numeric Learning is supervised Also known as numeric prediction Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 27

Example Predictions for weather.nominal Outlook Humidity Windy Play Actual Temp Classified Temp sunny 85 FALSE no 85 sunny 90 TRUE no 80 overcast 86 FALSE yes 83 rainy 96 FALSE yes 70 rainy 80 FALSE yes 68 rainy 70 TRUE no 65 overcast 65 TRUE yes 64 sunny 95 FALSE no 72 sunny 70 FALSE yes 69 rainy 80 FALSE yes 75 sunny 70 TRUE yes (75) 68.8 overcast 90 TRUE yes (72) 76.2 overcast 75 FALSE yes (81) 70.6 rainy 91 TRUE no (71) 76.5 Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 28

What s in an Attribute? Each instance is described by a fixed feature vector of attributes Attributes generally come in two types: 1. nominal 2. continuous Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 29

Nominal Attributes Values are distinct symbols (e.g. {sunny,overcast,rainy}) values themselves serve only as labels or names Also called categorical, enumerated, or discrete (NB. enumerated and discrete imply an order which tends not to exist) Special case: dichotomy ( boolean attribute) No relation is implied among nominal values (no ordering or distance measure), and only equality tests can be performed Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 30

Continuous Attributes Ratio quantities are real-valued attributes with a well-defined zero point and (usu.) no explicit upper bound Also called numeric Example: attribute distance Distance between an object and itself is zero All mathematical operations are allowed Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 31

(VERY) BASICS OF PROBABILITY AND INFORMATION THEORY Tan et al. (2006:pp19 35), Witten and Frank (1999:pp37 55), Witten and Frank (2005:pp41 60) 32

Basics of Probability Theory Joint probability (P (A, B)): the probability of both A and B occurring = P (A B) ω A B A B U P (ace, heart) = 1 52, P (heart, red) = 1 4 33

Conditional probability (P (A B)): the probability of A occurring given the occurrence of B = P (A B) P (B) ω A B A B U P (ace heart) = 1 13, P (heart red) = 1 2 34

Multiplication rule: P (A B) = P (A B)P (B) = P (B A)P (A) Prior probability (P (A)): the probability of A occurring, given no additional knowledge about A Posterior probability (P (A B)): the probability of A occurring, given background knowledge about event(s) B leading up to A Independence: A and B are independent iff P (A B) = P (A)P (B) 35

Binomial Distributions A binomial distribution results from a series of independent trials with only two outcomes (i.e. Bernoulli trials) e.g. multiple coin tosses ( H, T, H, H,..., T ) The probability of an event with probability p occurring exactly m out of n times is given by P (m, n, p) = n! m!(n m)! pm (1 p) n m 36

Binomial Example: P (m, 10, p = 0.1) 0.5 0.4 P(m,10,p=0.1) 0.3 0.2 0.1 0 0 2 4 6 8 10 m 37

Entropy Given a probability distribution, the information (in bits) required to predict an event is the distribution s entropy or information value The entropy of a discrete random event x with possible states 1,..n is: n H(x) = P (i) log 2 P (i) i=1 = freq( ) log 2(freq( )) n i=1 freq(i) log 2(freq(i)) freq( ) where 0 log 2 0 = def 0 38

Interpreting Entropy Values A high entropy value means x is boring (uniform/flat) A low entropy value means x is varied ( peaky ) 39

Entropy of Loaded Dice (1) 1.0 0.8 H(x) = 2.58 0.6 P(i) 0.4 0.2 0.0 1 2 3 4 5 6 i 40

Entropy of Loaded Dice (2) 1.0 0.8 H(x) = 2.16 0.6 P(i) 0.4 0.2 0.0 1 2 3 4 5 6 i 41

Entropy of Loaded Dice (3) 1.0 0.8 H(x) = 2.32 0.6 P(i) 0.4 0.2 0.0 1 2 3 4 5 6 i 42

Entropy of Loaded Dice (4) 1.0 0.8 H(x) = 1.56 0.6 P(i) 0.4 0.2 0.0 1 2 3 4 5 6 i 43

Entropy of Loaded Dice (5) 1.0 0.8 H(x) = 0 0.6 P(i) 0.4 0.2 0.0 1 2 3 4 5 6 i 44

Estimating the Probabilities The most obvious way of generating the probabilities is via maximum likelihood estimation (MLE), using the frequency counts in the training data: ˆP (c j ) = freq(c j) k freq(c k) ˆP (x i c j ) = freq(x i, c j ) freq(c j ) 45

Based on this, a term frequency representation such as: x = 233, 0, 2, 0, 0, 1, 0,... would turn into something like: x = 0.1, 0, 0.0002, 0, 0, 0.0001, 0,... 46

EVALUATION 47

Baselines vs. Benchmarks Baseline = naive method which we would expect any reasonably well-developed method to better e.g. for a novice marathon runner, the time to walk 42km Benchmark = established rival technique which we are pitching our method against e.g. for a marathon runner, the time of our last marathon run/the world record time/3 hours/... Baseline often used as umbrella term for both meanings 48

The Importance of Baselines Baselines are important in establishing whether any proposed method is doing better than dumb and simple dumb methods often work surprisingly well Baselines are valuable in getting a sense for the intrinsic difficulty of a given task (cf. accuracy = 5% vs. 99%) In formulating a baseline, we need to be sensitive to the importance of positives and negatives in the classification task limited utility of a baseline of unsuitable for a classification task aimed at detecting potential sites for new diamond mines 49

True/false Positives/negatives Basis of evaluation metrics is generally a confusion matrix: Predicted Y N Y true positive (TP) false negative (FN) Actual N false positive (FP) true negative (TN) 50

Accuracy The simplest form of evaluation is in terms of classification accuracy (aka accuracy or success rate) Accuracy is the proportion of instances for which we have correctly predicted the label For binary classifier: ACC = T P + T N T P + F P + F N + T N Tan et al. (2006:pp172 184), Witten and Frank (2005:pp144 145) 51

Alternatively, we sometimes talk about the error rate: ER = F P + F N T P + F P + F N + T N N.B. ER = 1 ACC Tan et al. (2006:pp172 184), Witten and Frank (2005:pp144 145) 52

Precision and Recall If we wish to focus on only how well we have identified the positives and not what we have correctly ignored (or equivalently, performance relative to a single class of interest), we use precision and recall Precision = Recall = T P T P + F P T P T P + F N Tan et al. (2006:pp172 184), Witten and Frank (2005:pp144 145) 53

F-score In applications where we make individual decisions for each data point rather than generating a monolithic ranking (e.g. spam filtering), F-score gives us an overall picture of system performance: F-score = (1 + β 2 P R ) R + β 2 P where P = precision and R = recall [weighted harmonic mean] Set β depending on how much we care about false negatives vs. false positives (cf. intelligence filtering vs. spam filtering)... conventionally β = 1 Tan et al. (2006:pp172 184), Witten and Frank (2005:pp144 145) 54

Bias and Variance in Evaluation The (training) bias of a classifier is the average distance between the expected value and the estimated value The (test) variance of a classifier is the standard deviation between the expected and estimated value The noise in a dataset is the inherent variability of the training data In evaluation, we aim to minimise classifier bias and variance (but there s not a lot we can do about noise!) Tan et al. (2006:pp281 3), Witten and Frank (2005:pp144 53), Kohavi (1995) 55

Holdout Train a classifier over a fixed training dataset, and evaluate it over a held-out test dataset Advantages: simple to work with high reproducibility Disadvantages: trade-off between more training and more test data (variance vs. bias) representativeness of the training and test data Tan et al. (2006:pp186 7), Witten and Frank (2005:pp144 53), Kohavi (1995) 56

Random Subsampling Perform holdout over multiple iterations, randomly selecting the training and test data (maintaining a fixed size for each dataset) on each iteration Evaluate by taking the average across the iterations Advantages: reduction in variance and bias over holdout method Disadvantages: trade-off between more training and more test data Tan et al. (2006:pp186 7), Witten and Frank (2005:pp144 53), Kohavi (1995) 57

Take our entire dataset: Cross Validation: Input Tan et al. (2006:pp186 7), Witten and Frank (2005:pp144 53), Kohavi (1995) 58

Cross Validation: Partitioning Split up into N equal-sized partitions P i : P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 Tan et al. (2006:pp186 7), Witten and Frank (2005:pp144 53), Kohavi (1995) 59

Cross Validation: Fold 1 For each i = 1...N, take P i as the test data and {P j : j i} as the training data P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 Tan et al. (2006:pp186 7), Witten and Frank (2005:pp144 53), Kohavi (1995) 60

Cross Validation: Fold 2 For each i = 1...N, take P i as the test data and {P j : j i} as the training data P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 Tan et al. (2006:pp186 7), Witten and Frank (2005:pp144 53), Kohavi (1995) 61

Cross Validation: Fold 3 For each i = 1...N, take P i as the test data and {P j : j i} as the training data P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 Tan et al. (2006:pp186 7), Witten and Frank (2005:pp144 53), Kohavi (1995) 62

And so on... Cross Validation: Fold i Tan et al. (2006:pp186 7), Witten and Frank (2005:pp144 53), Kohavi (1995) 63

Cross Validation: Evaluate Evaluate according to the average across the N iterations N is generally set to 10 (but possibly run over multiple iterations, with different partitions) Advantages: minimises bias and variance makes effective use of training data Disadvantages: efficiency Tan et al. (2006:pp186 7), Witten and Frank (2005:pp144 53), Kohavi (1995) 64

Stratified Cross Validation To further reduce variance and bias, we can additionally stratify the data = partition the data so as to maintain the overall class distribution within individual partitions Subtle questions about whether we are accurately modelling novel data instances or giving our classifier a push in the right direction? Tan et al. (2006:pp186 7), Witten and Frank (2005:pp144 53), Kohavi (1995) 65

Summary What are the 4 basic flavours of machine learning? What is the difference between supervised and unsupervised ML? How is data standardly represented in ML? What is entropy and what is its relevance to ML? What are accuracy, precision, recall and F-score? What are the different ways of partitioning up the data in evaluation, and what are the relative strengths and weaknesses of each? Tan et al. (2006:pp186 7), Witten and Frank (2005:pp144 53), Kohavi (1995) 66

References Kohavi, Ron. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proc. of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), 1137 43. Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. 2006. Introduction to Data Mining. Addison Wesley. Witten, Ian H., and Eibe Frank. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco, USA: Morgan Kaufmann., and. 2005. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco, USA: Morgan Kaufmann, second edition. 67