Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

Data Mining: Cncepts and Techniques Classificatin and Predictin Chapter 6.4-6 February 8, 2007 CSE-4412: Data Mining 1 Chapter 6 Classificatin and Predictin 1. What is classificatin? What is predictin? 2. Issues regarding classificatin and predictin 3. Classificatin by decisin tree inductin 4. Bayesian classificatin 5. Rule-based classificatin 6. Classificatin by back prpagatin 7. Supprt Vectr Machines (SVM) 8. Summary February 8, 2007 CSE-4412: Data Mining 2 1

Basic Idea (Again) Use ld tuples with knwn classes t classify new tuples with unknwn classes. E.g., tuples: custmers ld tuples: previus and current custmers new tuples: prspective custmers questin: Is custmer a gd credit risk? classes (answers): gd, fair, pr Why nt ust use the class prir prbabilities ver the ld tuples? Yes, why nt? February 8, 2007 CSE-4412: Data Mining 3 Use the Attributes Okay, the tuples have attributes. Use the attribute values t d better classificatin. Idea: Given a new tuple (e.g., <25, $72k, student>), use ust the ld tuples that match exactly t decide. Wuld this wrk? What are the prblems with this apprach? Still a gd idea. Hw can we fix this apprach? February 8, 2007 CSE-4412: Data Mining 4 2

Bayesian Classificatin A statistical classifier Perfrms prbabilistic predictin; i.e., predicts class membership prbabilities. Based n Bayes Therem. Perfrmance A simple Bayesian classifier, naïve Bayesian classifier, has cmparable perfrmance with decisin tree and selected neural netwrk classifiers. Incremental Each training example can incrementally increase / decrease the prbability that a hypthesis is crrect. Prir knwledge can be cmbined with bserved data. Standard Can be cmputatinally intractable, but Prvide a standard f ptimal decisin making, against which ther methds can be measured. February 8, 2007 CSE-4412: Data Mining 5 Bayes Therem Basics Let X be a data sample ( evidence ): class label is unknwn. Let H be a hypthesis that X belngs t class C. Classificatin is t determine P(H X), the prbability that the hypthesis hlds given the bserved data sample X. P(H) (prir prbability), the initial prbability. E.g., X will buy cmputer, regardless f age, incme, P(X): prbability that sample data is bserved. P(X H) (psteriri prbability), the prbability f bserving the sample X, given that the hypthesis hlds. E.g., Given that X will buy cmputer, the prbability that X is age 31..40, incme = medium,... February 8, 2007 CSE-4412: Data Mining 6 3

Bayes Therem Given training data X, psteriri prbability f a hypthesis H, P(H X), fllws the Bayes therem: P ( H X) = P( X H) P( H) P( X) Infrmally, this can be written as psteriri likelihd prir / evidence Predicts X belngs t C i iff the prbability P(C i X) is the highest amng all the P(C k X) fr all the k classes. Practical difficulty: Require initial knwledge f many prbabilities, significant cmputatinal cst. February 8, 2007 CSE-4412: Data Mining 7 Twards a Naïve Bayesian Classifier Let D be a training set f tuples and their assciated class labels, and each tuple is represented by an n-d attribute vectr X = (x 1, x 2,, x n ). Suppse there are k classes C 1, C 2,, C k. Classificatin is t derive the maximum psteriri; i.e., the maximal P(C i X). This can be derived frm Bayes Therem: P( X C ) P( C ) P ( C X) = i i i P( X) Since P(X) is cnstant fr all classes, nly P( C X ) = P( X C ) P( C ) i i i needs t be maximized. February 8, 2007 CSE-4412: Data Mining 8 4

Derivatin f Naïve Bayes Classifier A simplified assumptin: attributes are cnditinally independent (i.e., n dependence relatin between attributes): n P( X Ci) = " P( x Ci) = P( x ) ( )... ( ) 1 Ci! P x 2 Ci!! P x k n Ci k = 1 This greatly reduces the cmputatinal cst: Only cunts the class distributin. If A k is categrical, P(x k C i ) is the # f tuples in C i having value x k fr A k divided by C i, D (# f tuples f C i in D). If A k is cntinuus-valued, P(x k C i ) is usually cmputed based n Gaussian distributin with a mean μ and standard deviatin σ and P(x C i ) is 2 ( x# µ ) 1 # 2 2! g( x, µ,! ) = e 2"! P(x k C i ) = g(x k, µ ci, σ ci ) February 8, 2007 CSE-4412: Data Mining 9 Naïve Bayesian Classifier: Training Dataset Class: C1:buys_cmputer = yes C2:buys_cmputer = n Data sample X = (age <=30, Incme = medium, Student = yes Credit_rating = Fair) age incme studentcredit_rating_cmp <=30 high n fair n <=30 high n excellent n 31 40 high n fair yes >40 medium n fair yes >40 lw yes fair yes >40 lw yes excellent n 31 40 lw yes excellent yes <=30 medium n fair n <=30 lw yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium n excellent yes 31 40 high yes fair yes >40 medium n excellent n February 8, 2007 CSE-4412: Data Mining 10 5

Naïve Bayesian Classifier An Example P(C i ): P(buys_cmputer = yes ) = 9/14 = 0.643 P(buys_cmputer = n ) = 5/14= 0.357 Cmpute P(X C i ) fr each class P(age = <=30 buys_cmputer = yes ) = 2/9 = 0.222 P(age = <= 30 buys_cmputer = n ) = 3/5 = 0.6 P(incme = medium buys_cmputer = yes ) = 4/9 = 0.444 P(incme = medium buys_cmputer = n ) = 2/5 = 0.4 P(student = yes buys_cmputer = yes) = 6/9 = 0.667 P(student = yes buys_cmputer = n ) = 1/5 = 0.2 P(credit_rating = fair buys_cmputer = yes ) = 6/9 = 0.667 P(credit_rating = fair buys_cmputer = n ) = 2/5 = 0.4 X = (age <= 30, incme = medium, student = yes, credit_rating = fair) P(X C i ) : P(X buys_cmputer = yes ) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X buys_cmputer = n ) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X C i )*P(C i ) : P(X buys_cmputer = yes ) * P(buys_cmputer = yes ) = 0.028 P(X buys_cmputer = n ) * P(buys_cmputer = n ) = 0.007 Therefre, X belngs t class ( buys_cmputer = yes ) February 8, 2007 CSE-4412: Data Mining 11 The Zer-Prbability Prblem Naïve Bayesian predictin requires each cnditinal prbability be nn-zer. Otherwise, the predicted prbability will be zer! P( X Ci) = n! P( xk Ci) k = 1 E.g., Suppse a dataset with 1000 tuples, incme=lw (0), incme= medium (990), and incme = high (10). Use Laplacian crrectin (r Laplacian estimatr): Add 1 t each case. E.g., Prb(incme = lw) = 1 / 1003 Prb(incme = medium) = 991 / 1003 Prb(incme = high) = 11 / 1003 The crrected prbability estimates are clse t their uncrrected cunterparts. February 8, 2007 CSE-4412: Data Mining 12 6

Naïve Bayesian Classifier Evaluatin Advantages Easy t implement and maintain. Easy t update incrementally with new training tuples. Gd results btained in many dmains. N issues with verfitting the mdel. Reasnably immune t nise. Can wrk with missing values in data, bth training and fr classifying. Nise in data (incrrect values) get balanced ut (t sme extent). February 8, 2007 CSE-4412: Data Mining 13 Naïve Bayesian Classifier Evaluatin Disadvantages Assumptin: Class cnditinal independence, therefre lss f accuracy. In practice, hwever, dependencies d exist amng variables. E.g., hspital patients prfiles: age, family histry, etc. Symptms: fever, cugh etc. Disease: lung cancer, diabetes, etc. Dependencies amng these cannt be mdeled by a Naïve Bayesian Classifier. Blackbx: Cannt interpret the mdel. Hw t deal with these dependencies? Bayesian Belief Netwrks February 8, 2007 CSE-4412: Data Mining 14 7

Bayesian Belief Netwrks Bayesian belief netwrk allws a subset f the variables cnditinally independent. A graphical mdel f causal relatinships Represents dependencies amng the variables. Gives a specificatin f int prbability distributin. Ndes: randm variables X Z Y P Links: dependency X and Y are the parents f Z, and Y is the parent f P N dependency between Z and P Has n lps r cycles February 8, 2007 CSE-4412: Data Mining 15 Bayesian Belief Netwrk An Example Family Histry Smker The cnditinal prbability table (CPT) fr variable LungCancer: LC (FH, S) 0.8 (FH, ~S) (~FH, S) (~FH, ~S) 0.5 0.7 0.1 LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9 CPT shws the cnditinal prbability fr each pssible cmbinatin f its parents PsitiveXRay Dyspnea Bayesian Belief Netwrk Derivatin f the prbability f a particular cmbinatin f values f X, frm CPT: P( x x = n 1,..., n)! P( xi Parents( Y i)) i = 1 February 8, 2007 CSE-4412: Data Mining 16 8

Training Bayesian Netwrks Several scenaris: Given bth the netwrk structure (knwn) and all variables bservable: learn nly the CPTs. Netwrk structure knwn, sme hidden variables: gradient descent (greedy hill-climbing) methd, analgus t neural netwrk learning. Netwrk structure unknwn, all variables bservable: search thrugh the mdel space t recnstruct netwrk tplgy. Unknwn structure, hidden variables: N gd algrithms knwn fr this purpse! Ref.: D. Heckerman: Bayesian netwrks fr data mining February 8, 2007 CSE-4412: Data Mining 17 Chapter 6 Classificatin and Predictin 1. What is classificatin? What is predictin? 2. Issues regarding classificatin and predictin 3. Classificatin by decisin tree inductin 4. Bayesian classificatin 5. Rule-based classificatin 6. Classificatin by back prpagatin 7. Supprt Vectr Machines (SVM) 8. Summary February 8, 2007 CSE-4412: Data Mining 18 9

Using IF-THEN Rules fr Classificatin Represent the knwledge in the frm f IF-THEN rules R: IF age = yuth AND student = yes THEN buys_cmputer = yes Rule antecedent/precnditin vs. rule cnsequent Assessment f a rule: cverage and accuracy n cvers = # f tuples cvered by R n crrect = # f tuples crrectly classified by R cverage(r) = n cvers / D /* D: training data set */ accuracy(r) = n crrect / n cvers If mre than ne rule is triggered, need cnflict reslutin. size rdering: Assign the highest pririty t the triggering rules that has the tughest requirement (i.e., with the mst attribute test). class-based rdering: Decreasing rder f prevalence r misclassificatin cst per class. rule-based rdering (decisin list): Rules are rganized int ne lng pririty list, accrding t sme measure f rule quality r by experts. February 8, 2007 CSE-4412: Data Mining 19 Rule Extractin frm a Decisin Tree Rules are easier t understand than large trees. One rule is created fr each path frm the rt t a leaf. Each attribute-value pair alng a path frms a cnunctin: the leaf hlds the class predictin. Rules are mutually exclusive and exhaustive. n student? Example: Rule extractin frm ur buys_cmputer decisin-tree: IF age = yung AND student = n IF age = yung AND student = yes IF age = mid-age age? <=30 31..40 >40 yes yes n credit rating? excellent fair n yes yes THEN buys_cmputer = n THEN buys_cmputer = yes THEN buys_cmputer = yes IF age = ld AND credit_rating = excellent THEN buys_cmputer = yes IF age = yung AND credit_rating = fair THEN buys_cmputer = n February 8, 2007 CSE-4412: Data Mining 20 10

Rule Extractin frm the Training Data Sequential cvering algrithm: Extracts rules directly frm training data. Typical sequential cvering algrithms: FOIL, AQ, CN2, RIPPER. Rules are learned sequentially, each fr a given class C i will cver many tuples f C i but nne (r few) f the tuples f ther classes. Steps: Rules are learned ne at a time. Each time a rule is learned, the tuples cvered by the rules are remved. The prcess repeats n the remaining tuples unless terminatin cnditin, e.g., when n mre training examples r when the quality f a rule returned is belw a user-specified threshld. Cmparisn w/ decisin-tree inductin: Learning a set f rules simultaneusly. February 8, 2007 CSE-4412: Data Mining 21 Learn-One-Rule Start with the mst general rule pssible: cnditin = empty. Add new attributes by adpting a greedy depth-first strategy. Pick the ne that mst imprves the rule quality. Rule-Quality measures: cnsider bth cverage and accuracy. Fil-gain (in FOIL & RIPPER): assesses inf_gain by extending cnditin. ps' ps FOIL _ Gain = ps' "(lg2! lg2 ) ps' + neg' ps + neg It favrs rules that have high accuracy and cver many psitive tuples Rule pruning based n an independent set f test tuples. ps! neg FOIL _ Prune( R) = ps + neg Ps/neg are # f psitive/negative tuples cvered by R. If FOIL_Prune is higher fr the pruned versin f R, prune R. February 8, 2007 CSE-4412: Data Mining 22 11

Chapter 6 Classificatin and Predictin 1. What is classificatin? What is predictin? 2. Issues regarding classificatin and predictin 3. Classificatin by decisin tree inductin 4. Bayesian classificatin 5. Rule-based classificatin 6. Classificatin by back prpagatin 7. Supprt Vectr Machines (SVM) 8. Summary February 8, 2007 CSE-4412: Data Mining 23 Classificatin as a Mathematical Mapping Classificatin: Predicts categrical class labels. E.g., Persnal hmepage classificatin. = (x 1, x 2, x 3, ), y i = +1 r 1 x 1 : # f the wrd hmepage x 2 : # f the wrd welcme Mathematically X = R n, y Y = {+1, 1} We want a functin f: X Y February 8, 2007 CSE-4412: Data Mining 24 12

Linear Classificatin x x x x x x x x x x Binary Classificatin prblem. The data abve the red line belngs t class x. The data belw red line belngs t class Examples: SVM, Perceptrn, Prbabilistic Classifiers February 8, 2007 CSE-4412: Data Mining 25 Discriminative Classifiers Advantages: predictin accuracy is generally high As cmpared t Bayesian methds in general rbust, wrks when training examples cntain errrs fast evaluatin f the learned target functin Bayesian netwrks are nrmally slw. Disadvantages: lng training time difficult t understand the learned functin (weights) Bayesian netwrks can be used easily fr pattern discvery. nt easy t incrprate dmain knwledge Easy in the frm f prirs n the data r distributins February 8, 2007 CSE-4412: Data Mining 26 13

Perceptrn & Winnw x 2 Vectr: x, w Scalar: x, y, w Input: {( 1, y 1 ), } Output: classificatin functin f() f( i ) > 0 fr y i = +1 f( i ) < 0 fr y i = -1 f() => + b = 0 r w 1 x 1 +w 2 x 2 +b = 0 x 1 Perceptrn: Update W additively. Winnw: Update W multiplicatively. February 8, 2007 CSE-4412: Data Mining 27 Classificatin by Backprpagatin Nnlinear Neural netwrk: A set f cnnected input/utput units where each cnnectin has a weight assciated with it. During the learning phase, the netwrk learns by adusting the weights s as t be able t predict the crrect class label f the input tuples. Als referred t as cnnectinist learning due t the cnnectins between units. Backprpagatin: A neural netwrk learning algrithm. Started by psychlgists and neurbilgists t develp and test cmputatinal analgues f neurns. February 8, 2007 CSE-4412: Data Mining 28 14

Neural Netwrk as a Classifier Advantages: High tlerance t nisy data. Ability t classify untrained patterns. Well-suited fr cntinuus-valued inputs and utputs. Successful n a wide array f real-wrld data. Algrithms are inherently parallel. Disadvantages: Lng training time. Require a number f parameters typically best determined empirically; e.g., the netwrk tplgy r structure. Pr interpretability: Difficult t interpret the symblic meaning. behind the learned weights and f hidden units in the netwrk. February 8, 2007 CSE-4412: Data Mining 29 A Neurn (= a perceptrn) - θ k x 0 w 0 x 1 x n w 1 w n f utput y Input vectr x weight vectr w weighted sum Activatin functin Fr Example n y = sign(! w ix i + µ k ) i= 0 The n-dimensinal input vectr x is mapped int variable y by means f the scalar prduct and a nnlinear functin mapping. February 8, 2007 CSE-4412: Data Mining 30 15

Output vectr A Multi-Layer Feed-Frward Neural Netwrk Output layer Hidden layer Input layer Input vectr: X Err = O (1 " O )! Err w Err = O ( 1! O )( T! O ) w i! i 1 O =! I 1 + e I =! w O + " =! w = w + ( l) Err O i i i k + (l) Err i i k k February 8, 2007 CSE-4412: Data Mining 31 Hw des a Multi-Layer Neural Netwrk Wrk? The inputs t the netwrk crrespnd t the attributes measured fr each training tuple. Inputs are fed simultaneusly int the units making up the input layer. They are then weighted and fed simultaneusly t a hidden layer. The number f hidden layers is arbitrary, althugh usually nly ne. The weighted utputs f the last hidden layer are input t units making up the utput layer, which emits the netwrk's predictin. The netwrk is feed-frward in that nne f the weights cycles back t an input unit r t an utput unit f a previus layer. Frm a statistical pint f view, netwrks perfrm nnlinear regressin: Given enugh hidden units and enugh training samples, they can clsely apprximate any functin. February 8, 2007 CSE-4412: Data Mining 32 16

Defining a Netwrk Tplgy First decide the netwrk tplgy: # f units in the input layer, # f hidden layers (if > 1), # f units in each hidden layer, and # f units in the utput layer. Nrmalizing the input values fr each attribute measured in the training tuples t [0.0 1.0]. One input unit per dmain value, each initialized t 0. Output, if fr classificatin and mre than tw classes, ne utput unit per class is used. Once a netwrk has been trained and its accuracy is unacceptable, repeat the training prcess with a different netwrk tplgy r a different set f initial weights. February 8, 2007 CSE-4412: Data Mining 33 Backprpagatin Iteratively prcess a set f training tuples & cmpare the netwrk's predictin with the actual knwn target value. Fr each training tuple, the weights are mdified t minimize the mean squared errr between the netwrk's predictin and the actual target value. Mdificatins are made in the backwards directin: frm the utput layer, thrugh each hidden layer dwn t the first hidden layer, hence backprpagatin. Steps 1. Initialize weights (t small randm #s) and biases in the netwrk. 2. Prpagate the inputs frward (by applying activatin functin). 3. Backprpagate the errr (by updating weights and biases). 4. Terminating cnditin (when errr is very small, etc.). February 8, 2007 CSE-4412: Data Mining 34 17

Backprpagatin and Interpretability Efficiency f backprpagatin: Each epch (ne interatin thrugh the training set) takes O( D * w), with D tuples and w weights. Hwever, # epchs can be expnential in n, the number f inputs, in the wrst case. Rule extractin frm netwrks (netwrk pruning): Simplify the netwrk structure by remving weighted links that have the least effect n the trained netwrk. Then perfrm link, unit, r activatin value clustering. The set f input and activatin values are studied t derive rules describing the relatinship between the input and hidden unit layers. Sensitivity analysis: Assess the impact that a given input variable has n a netwrk utput. The knwledge gained frm this analysis can be represented as rules. February 8, 2007 CSE-4412: Data Mining 35 18