Mining Classification Knowledge

Similar documents
Mining Classification Knowledge

Algorithms for Classification: The Basic Methods

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

Numerical Learning Algorithms

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Bayesian Classification. Bayesian Classification: Why?

Intuition Bayesian Classification

CS6220: DATA MINING TECHNIQUES

Data Mining Part 4. Prediction

The Naïve Bayes Classifier. Machine Learning Fall 2017

Naïve Bayes Lecture 6: Self-Study -----

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Bayesian Learning Features of Bayesian learning methods:

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Bayesian Learning. Bayesian Learning Criteria

CSCE 478/878 Lecture 6: Bayesian Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Slides for Data Mining by I. H. Witten and E. Frank

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Modern Information Retrieval

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Tools of AI. Marcin Sydow. Summary. Machine Learning

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

DATA MINING LECTURE 10

Decision Support. Dr. Johan Hagelbäck.

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Probabilistic Methods in Bioinformatics. Pabitra Mitra

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Lecture 9: Bayesian Learning

Induction on Decision Trees

Lecture 3: Decision Trees

Logistic Regression. Machine Learning Fall 2018

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probability Based Learning

Machine Learning Chapter 4. Algorithms

Stephen Scott.

Decision Tree Learning and Inductive Inference

Learning Classification Trees. Sargur Srihari

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Modern Information Retrieval

Chapter 6 Classification and Prediction (2)

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Lazy Rule Learning Nikolaus Korfhage

Data classification (II)

The Solution to Assignment 6

Statistical learning. Chapter 20, Sections 1 4 1

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Linear & nonlinear classifiers

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Introduction to Machine Learning

Learning Decision Trees

Introduction to Machine Learning

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Introduction to Signal Detection and Classification. Phani Chavali

Learning with Probabilities

Lecture 3: Pattern Classification

CHAPTER-17. Decision Tree Induction

Decision Trees. Danushka Bollegala

Decision Trees. Gavin Brown

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS 6375 Machine Learning

5. Discriminant analysis

FINAL: CS 6375 (Machine Learning) Fall 2014

COMP 328: Machine Learning

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Chapter 6: Classification

Final Examination CS 540-2: Introduction to Artificial Intelligence

An Introduction to Statistical and Probabilistic Linear Models

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Naïve Bayes classification

Advanced classifica-on methods

Algorithmisches Lernen/Machine Learning


COMS 4771 Introduction to Machine Learning. Nakul Verma

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Naïve Bayes, Maxent and Neural Models

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Neural Networks DWML, /25

Classification and Regression Trees

Advanced statistical methods for data analysis Lecture 2

Time Series Classification

Decision Trees. Tirgul 5

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Data Mining Part 5. Prediction

MIRA, SVM, k-nn. Lirong Xia

Transcription:

Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013

Outline 1. Bayesian classification 2. K nearest neighbors 3. Linear discrimination 4. Artificial neural networks 5. Other remarks

Bayesian Classification: Why? Probabilistic learning: Calculate explicit probabilities for hypothesis (decision), among the most practical approaches to certain types of learning problems Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Applications: Quite effective in some problems, e.g. text classification Good mathematical background and reference point to other methods

Bayesian Theorem: Basics Let X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) P(X): probability that sample data is observed P(X H) : probability of observing the sample X, given that the hypothesis holds

Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H X) follows the Bayes theorem P ( H X ) = P( X H) P( H) P( X ) Informally, this can be written as posterior =likelihood x prior / evidence MAP (maximum posteriori) hypothesis h argmaxp( h D) = argmaxp( D h) P( h MAP h H h H ). Practical difficulty: require initial knowledge of many probabilities, significant computational cost

Bayesian Classifiers Consider each attribute and class label as random variables Given a record with attributes (A 1, A 2,,A n ) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C A 1, A 2,,A n ) Can we estimate P(C A 1, A 2,,A n ) directly from data?

Bayesian Classifiers Approach: compute the posterior probability P(C A 1, A 2,, A n ) for all values of C using the Bayes theorem P ( C A 1 A 2 K A ) = n P ( A A K A C ) P ( C ) 1 2 n P ( A A K A ) 1 2 n Choose value of C that maximizes P(C A 1, A 2,, A n ) Equivalent to choosing value of C that maximizes P(A 1, A 2,, A n C) P(C) How to estimate P(A 1, A 2,, A n C )?

Naïve Bayes Classifier Assume independence among attributes A i when class is given: P(A 1, A 2,, A n C) = P(A 1 C j ) P(A 2 C j ) P(A n C j ) Can estimate P(A i C j ) for all A i and C j. New point is classified to C j if P(C j ) Π P(A i C j ) is maximal. Greatly reduces requirements to collect enough data, only count the class distribution.

Probabilities for weather data 5/14 5 No 9/14 9 Play 3/5 2/5 3 2 No 3/9 6/9 3 6 True False True False Windy 1/5 4/5 1 4 No No No 6/9 3/9 6 3 Normal High Normal High Humidity 1/5 2/5 2/5 1 2 2 3/9 4/9 2/9 3 4 2 Cool 2/5 3/9 Rainy Mild Hot Cool Mild Hot Temperature 0/5 4/9 Overcast 3/5 2/9 Sunny 2 3 Rainy 0 4 Overcast 3 2 Sunny Outlook No True High Mild Rainy False Normal Hot Overcast True High Mild Overcast True Normal Mild Sunny False Normal Mild Rainy False Normal Cool Sunny No False High Mild Sunny True Normal Cool Overcast No True Normal Cool Rainy False Normal Cool Rainy False High Mild Rainy False High Hot Overcast No True High Hot Sunny No False High Hot Sunny Play Windy Humidity Temp Outlook witten&eibe

Probabilities for weather data Outlook Temperature Humidity Windy Play No No No No No Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5 Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3 Rainy 3 2 Cool 3 1 Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14 Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 Rainy 3/9 2/5 Cool 3/9 1/5 A new day: Outlook Sunny Temp. Cool Humidity High Windy True Play? Likelihood of the two classes For yes = 2/9 3/9 3/9 3/9 9/14 = 0.0053 For no = 3/5 1/5 4/5 3/5 5/14 = 0.0206 Conversion into a probability by normalization: P( yes ) = 0.0053 / (0.0053 + 0.0206) = 0.205 P( no ) = 0.0206 / (0.0053 + 0.0206) = 0.795 witten&eibe

Missing values Training: instance is not included in frequency count for attribute valueclass combination Classification: attribute will be omitted from calculation Example: Outlook Temp. Humidity Windy Play? Cool High True? Likelihood of yes = 3/9 3/9 3/9 9/14 = 0.0238 Likelihood of no = 1/5 4/5 3/5 5/14 = 0.0343 P( yes ) = 0.0238 / (0.0238 + 0.0343) = 41% P( no ) = 0.0343 / (0.0238 + 0.0343) = 59% witten&eibe

Naïve Bayes: discussion Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn t require accurate probability estimates as long as maximum probability is assigned to correct class However: adding too many redundant attributes will cause problems (e.g. identical attributes) Note also: many numeric attributes are not normally distributed ( kernel density estimators)

InstanceBased Methods Instancebased learning: Store training examples and delay the processing ( lazy evaluation ) until a new instance must be classified. Typical approaches: knearest neighbor approach: Instances represented as points in a Euclidean space. Locally weighted regression: Constructs local approximation.

Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Test Record Training Records Choose k of the nearest records

knearestneighbor Neighbor Algorithm The case of discrete set of classes. 1. Take the instance x to be classified 2. Find k nearest neighbors of x in the training data. 3. Determine the class c of the majority of the instances among the k nearest neighbors. 4. Return the class c as the classification of x. The distance functions are composed from difference metric d a defined for each two instances x i and x j.

Classification & Decision Boundaries e1 + q1 + + + 1nn: q1 is positive 5nn: q1 is classified as negative 1nn:

The distance function Simplest case: one numeric attribute Distance is the difference between the two attribute values involved (or a function thereof) Several numeric attributes: normally, Euclidean distance is used and attributes are normalized Nominal attributes: distance is set to 1 if values are different, 0 if they are equal Are all attributes equally important? Weighting the attributes might be necessary

Instancebased learning Distance function defines what s learned Most instancebased schemes use Euclidean distance: (1) (2) 2 (1) (2) 2 (1) (2) 2 a a ) + ( a a ) +... + ( a k a ( 1 1 2 2 k ) a (1) and a (2) : two instances with k attributes Taking the square root is not required when comparing distances Other popular metric: cityblock (Manhattan) metric Adds differences without squaring them

Normalization and other issues Different attributes are measured on different scales need to be normalized: vi min vi vi Avg( vi ) ai = or ai = maxv min v StDev( v ) i v i : the actual value of attribute i i Nominal attributes: distance either 0 or 1 Common policy for missing values: assumed to be maximally distant (given normalized attributes) i

Discussion on the knn Algorithm The knn algorithm for continuousvalued target functions. Calculate the mean values of the k nearest neighbors. Distanceweighted nearest neighbor algorithm. Weight the contribution of each of the k neighbors according to their distance to the query point xq. w 1 giving greater weight to closer neighbors: dxq (, x i ) 2 Similarly, we can distanceweight the instances for realvalued target functions. Robust to noisy data by averaging knearest neighbors. Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. To overcome it, axes stretch or elimination of the least relevant attributes.

Disadvantages of the NN Algorithm the NN algorithm has large storage requirements because it has to store all the data; the NN algorithm is slow during instance classification because all the training instances have to be visited; the accuracy of the NN algorithm degrades with increase of noise in the training data; the accuracy of the NN algorithm degrades with increase of irrelevant attributes.

Condensed NN Algorithm The Condensed NN algorithm was introduced to reduce the storage requirements of the NN algorithm. The algorithm finds a subset S of the training data D s.t. each instance in D can be correctly classified by the NN algorithm applied on the subset S. The average reduction of the algorithm varies between 60% to 80%. D S + + + + + + + + + + + + + + + + + + +

Remarks on Lazy vs. Eager Learning Instancebased learning: lazy evaluation Decisiontree and Bayesian classification: eager evaluation Key differences Lazy method may consider query instance xq when deciding how to generalize beyond the training data D Eager method cannot since they have already chosen global approximation when seeing the query Efficiency: Lazy less time training but more time predicting Accuracy Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function Eager: must commit to a single hypothesis that covers the entire instance space

Linear Classification Binary Classification problem x x x x o x x x x x o o o o o o o x o o o o o The data above the blue line belongs to class x The data below red line belongs to class o Examples SVM, Perceptron, Probabilistic Classifiers

Discriminative Classifiers Advantages prediction accuracy is generally high (as compared to Bayesian methods in general) robust, works when training examples contain errors fast evaluation of the learned target function Criticism long training time difficult to understand the learned function (weights) not easy to incorporate domain knowledge (easy in the form of priors on the data or distributions)

Neural Networks Analogy to Biological Systems (Indeed a great example of a good learning system) Massive Parallelism allowing for computational efficiency The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target output value is provided for a single neuron with fixed inputs, one can incrementally change weights to learn to produce these outputs using the perceptron learning rule

A Neuron µ k x 0 w 0 x 1 x n w 1 w n f output y Input vector x weight vector w weighted sum Activation function The ndimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping

A Neuron µ k x 0 w 0 x 1 x n w 1 w n f output y Input weight vector x vector w For Example y n = sign( i = 0 weighted sum w + µ i x i k ) Activation function

Artificial Neural Networks (ANN) X 1 X 2 X 3 Y 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 Output Y is 1 if at least two of the three inputs are equal to 1.

Artificial Neural Networks (ANN) X 1 X 2 X 3 Y 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 Y = I ( 0.3 X 1 + 0.3 X 2 + 0.3 X 3 1 if z is true where I ( z) = 0 otherwise 0.4 > 0)

Artificial Neural Networks (ANN) Model is an assembly of interconnected nodes and weighted links Output node sums up each of its input value according to the weights of its links Perceptron Model Compare output node against some threshold t Y Y = I( wi X t) i = sign( wi X i t) i i or

MultiLayer Perceptron Output vector Output nodes Hidden nodes Input nodes Err j Err w ij = O (1 O ) θ j ij j = I O 1 O j = I 1 j + e = w O + θ j j =θ j ij j ( 1 O j )( T j O j ) i j ij k + (l) Err w = w + ( l) Err i Err j j O i k j w jk Input vector: x i

Network Training The ultimate objective of training obtain a set of weights that makes almost all the examples in the training data classified correctly Steps: Initial weights are set randomly Input examples are fed into the network one by one Activation values for the hidden nodes are computed Output vector can be computed after the activation values of all hidden node are available Weights are adjusted using error (desired output actual output)

Neural Networks Advantages prediction accuracy is generally high robust, works when training examples contain errors output may be discrete, realvalued, or a vector of several discrete or realvalued attributes fast evaluation of the learned target function. Criticism long training time difficult to understand the learned function (weights). not easy to incorporate domain knowledge